Re: Unicode Normalization

2007-04-11 Thread Mike Klaas
On 4/11/07, Yonik Seeley <[EMAIL PROTECTED]> wrote: On 4/11/07, Mike Klaas <[EMAIL PROTECTED]> wrote: > Unicode characters do not map > precisely to code points: a single character can often be represented > via a single codepoint or a combination of two (surrogate pair). I normally hear surrog

Re: Unicode Normalization

2007-04-11 Thread Daniel Noll
Yonik Seeley wrote: have no idea how java's String class handles this--I doubt it does any intelligent normalization. UTF-16 surrogates are handled as of Java5. And as of Java6 we have the java.text.Normalizer utility. Daniel -- Daniel Noll Nuix Pty Ltd Suite 79, 89 Jones St, Ultimo NSW

Re: Unicode Normalization

2007-04-11 Thread Yonik Seeley
On 4/11/07, Mike Klaas <[EMAIL PROTECTED]> wrote: Unicode characters do not map precisely to code points: a single character can often be represented via a single codepoint or a combination of two (surrogate pair). I normally hear surrogates in the context of UTF-16 after the code point space

Re: Unicode Normalization

2007-04-11 Thread Mike Klaas
On 4/11/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: : I have encountered a problem searching in my application because of : inconsistant unicode normalization forms in the corpus (and the : queries). I would like to normalize to form NFKD in an analyzer (I : think). I was thinking

Re: Unicode Normalization

2007-04-11 Thread Chris Hostetter
: I have encountered a problem searching in my application because of : inconsistant unicode normalization forms in the corpus (and the : queries). I would like to normalize to form NFKD in an analyzer (I : think). I was thinking about creating a filter similar to the i'm very naive t

Unicode Normalization

2007-04-11 Thread David Woodward
Hi. I have encountered a problem searching in my application because of inconsistant unicode normalization forms in the corpus (and the queries). I would like to normalize to form NFKD in an analyzer (I think). I was thinking about creating a filter similar to the lowercasefilter that would do