Sejong Corpora are widely used for Korean Language Processing and contains POS(Part of Speech) tagged corpus, word sense tagged corpus, dependency tree tagged corpus, and Korean-English parallel corpus. However, it also contains many kinds of errors although the corpora had been built by well-trained annotators. In this thesis, we specially are interested in the errors which are involved in Sejong POS tagged corpus. The errors cause bad performance of the systems which are trained via the corpus, and should be minimized. It, however, is not easy to detect and correct the errors in the corpus because the proportion of the errors is large and the kinds of the errors are very diverse. Furthermore, detecting and correcting the errors are laborious, time-consuming and then spend large expense.
In this thesis, we propose the error correction tool for efficiently detecting and correcting the errors in the Sejong POS tagged corpus. We automatically detect the errors using the methods for morphological generation and automatic word spacing. The former is used for insertion and deletion errors and spelling errors, and the latter is for word spacing errors. Also we semi-automatically correct the errors using graphical user interface (GUI), which is implemented in Java. The GUI consists of four major functions: the spelling error correction, the morpheme deletion correction, the morpheme insertion correction, and the morphological re-analysis. The GUI is designed to reduce laborious tasks and repetitive behavior patterns.
We have observed that there’s been a nine-fold reduction in the duration for error detection and correction at the least when applying the proposed tool to Sejong POS tagged corpus. We have also shown that error correction speed has steadily increased through experiments. As a result, the proposed tool is very promising for error detection and correction.