I would suggest you to have a look at Egothor stemmer (http://www.egothor.org/book/bk01ch01s06.html), can be trained rather easily (if your only use of "roots" is for searching)
I have only heard of it as a good thing, never tried it On Aug 4, 2006, at 1:29 PM, Marios Skounakis wrote: > > > > Hi all, > > The contrib section of Lucene contains a Greek Analyzer, which > however only does > some letter normalization (capitals to lowercase, accent removal) > and basic stop > word removal. > > I am interested in creating a Stemmer for the Greek Language to use > with Lucene > (i.e. implement it as an analyzer). The Greek Language is quite > different from > English (and most latin-related languages) in that it is highly > inflectional(?) - > meaning that there is a large number of suffixes, many of which are > not produced > in a very straightforward way. > > A quick internet search did not return much information - a couple of > non-publicly available papers and a Master's Thesis with a javascript > implementation which, however, seems to be somewhat lacking in > precision (i.e. > produces erroneous stems). A disappointing picture, admittedly, > when for the > english language it is so easy to find a public domain high quality > stemmer like > Porter's... > > Anyway, to cut a long story short, I had the following idea in > order to counter > the problem of the multiple suffixes and the high inflectional > nature of the > language: implement the stemmer using a combination of a lexicon of > stems (of the > most common words) and a list of all possible suffixes. The > algorithm for finding > the stem of a word would be something like: > > - for each suffix in the list of suffixes > - remove the suffix from the word (if possible), producing a > candindate stem > - search the lexicon of stems for the candidate stem > - if the search is succesful return the candidate stem > - if the search is unsuccesful go to next suffix > - if the suffixes are exhausted and no match is found, the word > cannot be stemmed > (return the original word) > > [By the way the algorithm is inspired by a paper which descripbes the > implementation of a lemmatizer in a similar way - citeseer link: > http://citeseer.ist.psu.edu/694579.html] > > The question is: > > Is such a strategy that depends on a leixcon of predefined stems > for implementing > a stemmer considered a major drawback? In theory it can be > (compared to an > algorith that works purely with rules, like Porter's) a drawback, > but in > practice, with a lexicon of a few thousand stems, the stemmer could > achieve > pretty good recall (and good precision too). > > Other issues to comment on are the lexicon size (which will have to > be embedded > in or accompany the stemmer), memory issues in running the stemmer > (keep the > lexicon in memory?), and performance issues (multiple lookups in > the lexicon > could make it much slower than a rule based stemmer?). In general, > any feedback > would be appreciated. > > Thanks in advance, > > Marios Skounakis > ---- Msg sent via eXis webmail > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -------------------------- Grant Ingersoll Sr. Software Engineer Center for Natural Language Processing Syracuse University 335 Hinds Hall Syracuse, NY 13244 http://www.cnlp.org Voice: 315-443-5484 Skype: grant_ingersoll Fax: 315-443-6886 --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]