한국해양대학교

Detailed Information

Metadata Downloads

한글 정보 검색을 위한 혼합 n-그램 기반의 색인 방법

Title
한글 정보 검색을 위한 혼합 n-그램 기반의 색인 방법
Alternative Title
An Indexing Method Based on the Mixed n-Gram for Korean Information Retrieval
Author(s)
정창용
Issued Date
2004
Publisher
한국해양대학교 대학원
URI
http://kmou.dcollection.net/jsp/common/DcLoOrgPer.jsp?sItemId=000002176193
http://repository.kmou.ac.kr/handle/2014.oak/10568
Abstract
In Korean information retrieval systems, several indexing methods are proposed such as morpheme-based, word-phrase-based, and n-gram-based. An n-gram-based indexing method is widely used among these methods where n is 2 or 3. The method is very simple, but outperforms others in precision and recall, which are basic measures for evaluating information retrieval systems. On the other hand, the method generates too many index terms that contain meaningless terms, and then the size of index files is huge. To relieve this problem, this paper proposes a new indexing method, which chooses between 2 and 3-grams according to probabilistic criteria for removing the meaningless terms. It is called a mixed n-gram indexing method. The t-score is used for the criteria for choosing between 2 and 3-grams. Also this paper describes a new stemming method for speed-up of Korean indexing systems by using a greedy algorithm.



For experiments, KT-SET and KEMONG-SET are used for reference test collections in Korean and storage and retrieval components of Lemur information retrieval toolkit 2.2 are used. Experiments have shown that the proposed method is not inferior to others in recall and precision, but is superior to others in the number of index terms.
Appears in Collections:
컴퓨터공학과 > Thesis
Files in This Item:
000002176193.pdf Download

Items in Repository are protected by copyright, with all rights reserved, unless otherwise indicated.

Browse