On Mon, Sep 7, 2009 at 10:47 AM, Paul Taylor <paul_t...@fastmail.fm> wrote: > Robert Muir wrote: >>> >>> I think we would like to implement the complete unicode rules, so if you >>> could provide us with some code that would be great. >>> >> >> ok, I will followup... what version of lucene are you using, 2.9? >> >> ... >> > > Yes
I will update LUCENE-1488 with the latest code so you can steal the ICUTransformFilter from there. > I'm obviously misunderstanding I thought that Halfwidth was an encoding to > allow storing the most common Chinese characters in a single byte, therefore > the charcters would be read as different characters if you assumed they were > using the HalfWidth Encoding rather than Latin Encoding. But are you saying > Halfwidth characters are actually valid Unicode characters with their own > distinct unicode value so can just use a CharFilter again to map these, > please confirm. yes, fullwidth latin forms are distinct characters that have a different width: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:East_Asian_Width=Fullwidth:] so yes, you can use charfilter to map these to their standard latin forms. beware though, there is a similar issue with halfwidth characters: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:East_Asian_Width=Halfwidth:] example, ナ is halfwidth for the standard ナ so you might want to include mappings for those as well. the reason i brought up normalization is because this issue (width) is a subset of things normalization can help with. if you click on some of the characters in the two sets i provided you will notice properties like 'toNFKC' containing the 'standardized' form. if in the future, you run into trouble with things in other languages that aren't matching as expected, because they aren't being considered the "same" when perhaps they should, then a more general approach would be applying Unicode normalization form NFKC in a TokenFilter. -- Robert Muir rcm...@gmail.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org