Re: Hebrew and Hindi analyzers

Robert Muir Tue, 17 Feb 2009 21:48:21 -0800

hey i've played around with trying to get towards a reasonable gpl hebrew
analyzer for lucene but don't have anything yet... just messing during my
spare time.

in general it wasnt hard to munge the hspell perl scripts with some java
code into producing a morphological analyzer but from what I see this is
pretty useless without some disambiguation, because precision is low even
when things are written in pristine spelling, etc.

i'm not aware of some good trec-like data for hebrew to benchmark any ideas
against either, which creates some problems.

i do have the idea of trying to only solve the easier problem of
segmentation to create a reasonable search, and there's some test data i've
been playing with here:
http://www.aaai.org/Papers/Workshops/2008/WS-08-15/WS08-15-011.pdf

problem is I can't train on it (if you read the paper it explains this).

one thing i did do was upload some tokenization work here:
https://issues.apache.org/jira/browse/LUCENE-1488

this uses RBBI and should at least tokenize your hindi correctly (according
to unicode rules). it should also parse your hebrew better (but still not
really correct), but really won't give you a useful hebrew search, just
handle punctuation a bit better. it also has some practical problems that
should be fixed as mentioned in the JIRA task.

don't know if this helps...

On Tue, Feb 17, 2009 at 9:54 PM, Zhang, Lisheng <
lisheng.zh...@broadvision.com> wrote:

> Hi,
>
> Are there free Hebrew and Hindi language analyzers for
> lucene? I searched archive and found some discussions,
> but did not see clear pointers to downloadable classes.
>
> Thanks very much for helps, Lisheng
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-- 
Robert Muir
rcm...@gmail.com

Re: Hebrew and Hindi analyzers

Reply via email to