Thanks Bialecki, I'm trying to test your program, thanks a lot!
And also, can you give me the paper you've cited [1] and [2]? I've googled(entire web and google scholar) about it but got nothing. On 11/8/05, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > KwonNam Son wrote: > > >First of all, I really appreciate your work on Lucene for Korean words, > > > >But If we cannot support stem analyzer for Korean words, I think one > >token for one Korean character is better. > > > >When we search a word, usually we use "검색" not "검색하다". ("하다" is like > >"ed" of "searched"). > >If we cannot get any result from "검색", StandardAnalyzer has no meaning > >to Korean, I may have to go back to use CJKAnalyzer. > > > >How about let the StandarAnalyzer be not changed, and add a new > >Analyzer for Korea words? > > > > > > Hello, > > My knowledge of Korean is near absolute zero... however, your example > above looks like a typical stemming process for any Western language. > The stem is not necessarily a valid dictionary word, just something that > uniquely "labels" a group of related words created from the same root - > and the transformation from inflected words to a stem can be expressed > as a series of "patch commands" (insert/remove substring). > > I successfully used a Java package, originally created by Leon Galambos > from Egothor project, to create an algorithmic stemmer for Polish > (http://www.getopt.org/stempel). The advantage of this particular > approach is that you don't have to encode specific grammar rules in the > stemmer, the stemmer learns rules by itself from a training corpus. Such > training corpus consists of pairs of inflected and base forms, and the > library automatically learns these "patch commands", i.e. instructions > for inserting/removing parts of an inflected word to arrive at the base > form. This training process results in creating a stemmer table, > reusable even for previously unseen words (based on the similarity of > character patterns in input words). > > I suggest to try the code from the link above and test how it works, > even if you only have a moderately-sized training corpus (~500 pairs) > the results should be positive. > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- Cheolgoo