July 18, 2008 3:29 PM
To: java-user@lucene.apache.org
Subject: RE: Bug in CJKTokenizer
Hi Scott,
I think this sounds reasonable, but why not also add LATIN_EXTENDED_B and
LATIN_EXTENDED_ADDITIONAL? AFAICT, among other things, these cover some
eastern European languages and Vietnamese, respect
Hi Scott,
I think this sounds reasonable, but why not also add LATIN_EXTENDED_B and
LATIN_EXTENDED_ADDITIONAL? AFAICT, among other things, these cover some
eastern European languages and Vietnamese, respectively.
Steve
On 07/18/2008 at 5:03 PM, Scott Smith wrote:
> org.apache.lucene.analysis.
org.apache.lucene.analysis.cjk.CJKTokenizer is in the "contrib" portion of
lucene, so I'm not sure if this is the right place to mention this or not. I
was doing some detailed analysis of how this tokenizer worked and noticed the
following behavior (which I would classify as a bug).
If you