Hi all.

We discovered that fullwidth letters are not treated as <LETTER> and fullwidth 
digits are not treated as <DIGIT>.

This in itself is probably easy to fix (including the filter for normalising 
these back to the normal versions) but while sanity checking the blocks in 
StandardTokenizer.jj I found some suspicious parts and felt it necessary to 
check that this is by design as there is no comment explaining the anomalies.

Line 87:
       "\uffa0"-"\uffdc"

  The halfwidth Katakana "letters" (as Unicode calls them) are in <CJ> as
  expected, so I'm wondering if these halfwidth Hangul "letters" should
  actually be in <KOREAN> instead of <LETTER>.

Line 92:
       "\u3040"-"\u318f",

  This block appears to duplicate the ranges in the next three lines and
  suspiciously also includes a range which belongs to <KOREAN>, making me
  wonder what happens when a range is in two blocks.

In case anyone is wondering, the JFlex version of the tokeniser on Lucene 
trunk has the same ranges.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to