Re: snowball (english) and filenames

Doron Cohen Thu, 17 May 2007 23:49:48 -0700

> a.b.c.d.e.f.g.h is not broken apart like how the snowball demo
> indicates it should do.


I am not sure about the "should" here - the way I see it, this
is just how the demo works: Snowball stemmers operate on words,
so the demo first breaks the input text into words and only
then applies stemming.

> For my lucene testing, I indexed one text file with one
> "a.b.c.d.e.f.g.h" string in it and opened the index up using Luke.
> It only indexed the string a.b.c.d.e.f.g.h (and didn't parse the
> string based on the periods).

In Lucene the way text is "broken" into words is up to
application - and depends on the analyzer being used.
WhitespaceAnalyzer would break on white space. StandardAnalyzer
would do more sophisticated work. Analyzers are extendable,
so you could modify their behavior. The wiki page
"AnalysisParalysis" has some relevant info.

Using Lucene's SimpleAnalyzer btw would break "a.b.c" into
"a b c" which seems to be what you are looking for?

HTH,
Doron


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: snowball (english) and filenames

Reply via email to