hey i've played around with trying to get towards a reasonable gpl hebrew analyzer for lucene but don't have anything yet... just messing during my spare time.
in general it wasnt hard to munge the hspell perl scripts with some java code into producing a morphological analyzer but from what I see this is pretty useless without some disambiguation, because precision is low even when things are written in pristine spelling, etc. i'm not aware of some good trec-like data for hebrew to benchmark any ideas against either, which creates some problems. i do have the idea of trying to only solve the easier problem of segmentation to create a reasonable search, and there's some test data i've been playing with here: http://www.aaai.org/Papers/Workshops/2008/WS-08-15/WS08-15-011.pdf problem is I can't train on it (if you read the paper it explains this). one thing i did do was upload some tokenization work here: https://issues.apache.org/jira/browse/LUCENE-1488 this uses RBBI and should at least tokenize your hindi correctly (according to unicode rules). it should also parse your hebrew better (but still not really correct), but really won't give you a useful hebrew search, just handle punctuation a bit better. it also has some practical problems that should be fixed as mentioned in the JIRA task. don't know if this helps... On Tue, Feb 17, 2009 at 9:54 PM, Zhang, Lisheng < lisheng.zh...@broadvision.com> wrote: > Hi, > > Are there free Hebrew and Hindi language analyzers for > lucene? I searched archive and found some discussions, > but did not see clear pointers to downloadable classes. > > Thanks very much for helps, Lisheng > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com