Hi all. We discovered that fullwidth letters are not treated as <LETTER> and fullwidth digits are not treated as <DIGIT>.
This in itself is probably easy to fix (including the filter for normalising these back to the normal versions) but while sanity checking the blocks in StandardTokenizer.jj I found some suspicious parts and felt it necessary to check that this is by design as there is no comment explaining the anomalies. Line 87: "\uffa0"-"\uffdc" The halfwidth Katakana "letters" (as Unicode calls them) are in <CJ> as expected, so I'm wondering if these halfwidth Hangul "letters" should actually be in <KOREAN> instead of <LETTER>. Line 92: "\u3040"-"\u318f", This block appears to duplicate the ranges in the next three lines and suspiciously also includes a range which belongs to <KOREAN>, making me wonder what happens when a range is in two blocks. In case anyone is wondering, the JFlex version of the tokeniser on Lucene trunk has the same ranges. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]