In Korean information retrieval systems, several indexing methods are proposed such as morpheme-based, word-phrase-based, and n-gram-based. An n-gram-based indexing method is widely used among these methods where n is 2 or 3. The method is very simple, but outperforms others in precision and recall, which are basic measures for evaluating information retrieval systems. On the other hand, the method generates too many index terms that contain meaningless terms, and then the size of index files is huge. To relieve this problem, this paper proposes a new indexing method, which chooses between 2 and 3-grams according to probabilistic criteria for removing the meaningless terms. It is called a mixed n-gram indexing method. The t-score is used for the criteria for choosing between 2 and 3-grams. Also this paper describes a new stemming method for speed-up of Korean indexing systems by using a greedy algorithm.
For experiments, KT-SET and KEMONG-SET are used for reference test collections in Korean and storage and retrieval components of Lemur information retrieval toolkit 2.2 are used. Experiments have shown that the proposed method is not inferior to others in recall and precision, but is superior to others in the number of index terms.