KwonNam Son wrote:
First of all, I really appreciate your work on Lucene for Korean words,
But If we cannot support stem analyzer for Korean words, I think one
token for one Korean character is better.
When we search a word, usually we use "검색" not "검색하다". ("하다" is like
"ed" of "searched").
If we cannot get any result from "검색", StandardAnalyzer has no meaning
to Korean, I may have to go back to use CJKAnalyzer.
How about let the StandarAnalyzer be not changed, and add a new
Analyzer for Korea words?
Hello,
My knowledge of Korean is near absolute zero... however, your example
above looks like a typical stemming process for any Western language.
The stem is not necessarily a valid dictionary word, just something that
uniquely "labels" a group of related words created from the same root -
and the transformation from inflected words to a stem can be expressed
as a series of "patch commands" (insert/remove substring).
I successfully used a Java package, originally created by Leon Galambos
from Egothor project, to create an algorithmic stemmer for Polish
(http://www.getopt.org/stempel). The advantage of this particular
approach is that you don't have to encode specific grammar rules in the
stemmer, the stemmer learns rules by itself from a training corpus. Such
training corpus consists of pairs of inflected and base forms, and the
library automatically learns these "patch commands", i.e. instructions
for inserting/removing parts of an inflected word to arrive at the base
form. This training process results in creating a stemmer table,
reusable even for previously unseen words (based on the similarity of
character patterns in input words).
I suggest to try the code from the link above and test how it works,
even if you only have a moderately-sized training corpus (~500 pairs)
the results should be positive.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]