subject:"Bug in CJKTokenizer"

RE: Bug in CJKTokenizer

2008-07-18 Thread Scott Smith

July 18, 2008 3:29 PM To: java-user@lucene.apache.org Subject: RE: Bug in CJKTokenizer Hi Scott, I think this sounds reasonable, but why not also add LATIN_EXTENDED_B and LATIN_EXTENDED_ADDITIONAL? AFAICT, among other things, these cover some eastern European languages and Vietnamese, respect

RE: Bug in CJKTokenizer

2008-07-18 Thread Steven A Rowe

Hi Scott, I think this sounds reasonable, but why not also add LATIN_EXTENDED_B and LATIN_EXTENDED_ADDITIONAL? AFAICT, among other things, these cover some eastern European languages and Vietnamese, respectively. Steve On 07/18/2008 at 5:03 PM, Scott Smith wrote: > org.apache.lucene.analysis.

Bug in CJKTokenizer

2008-07-18 Thread Scott Smith

org.apache.lucene.analysis.cjk.CJKTokenizer is in the "contrib" portion of lucene, so I'm not sure if this is the right place to mention this or not. I was doing some detailed analysis of how this tokenizer worked and noticed the following behavior (which I would classify as a bug). If you