KwonNam Son wrote:

First of all, I really appreciate your work on Lucene for Korean words,

But If we cannot support stem analyzer for Korean words, I think one
token for one Korean character is better.

When we search a word, usually we use "검색" not "검색하다". ("하다" is like
"ed" of "searched").
If we cannot get any result from "검색", StandardAnalyzer has no meaning
to Korean, I may have to go back to use CJKAnalyzer.

How about let the StandarAnalyzer be not changed, and add a new
Analyzer for Korea words?

Hello,

My knowledge of Korean is near absolute zero... however, your example above looks like a typical stemming process for any Western language. The stem is not necessarily a valid dictionary word, just something that uniquely "labels" a group of related words created from the same root - and the transformation from inflected words to a stem can be expressed as a series of "patch commands" (insert/remove substring).

I successfully used a Java package, originally created by Leon Galambos from Egothor project, to create an algorithmic stemmer for Polish (http://www.getopt.org/stempel). The advantage of this particular approach is that you don't have to encode specific grammar rules in the stemmer, the stemmer learns rules by itself from a training corpus. Such training corpus consists of pairs of inflected and base forms, and the library automatically learns these "patch commands", i.e. instructions for inserting/removing parts of an inflected word to arrive at the base form. This training process results in creating a stemmer table, reusable even for previously unseen words (based on the similarity of character patterns in input words).

I suggest to try the code from the link above and test how it works, even if you only have a moderately-sized training corpus (~500 pairs) the results should be positive.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to