On 4/11/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:
On 4/11/07, Mike Klaas <[EMAIL PROTECTED]> wrote:
> Unicode characters do not map
> precisely to code points: a single character can often be represented
> via a single codepoint or a combination of two (surrogate pair).
I normally hear surrog
Yonik Seeley wrote:
have no idea how java's String class handles this--I doubt it does any
intelligent normalization.
UTF-16 surrogates are handled as of Java5.
And as of Java6 we have the java.text.Normalizer utility.
Daniel
--
Daniel Noll
Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW
On 4/11/07, Mike Klaas <[EMAIL PROTECTED]> wrote:
Unicode characters do not map
precisely to code points: a single character can often be represented
via a single codepoint or a combination of two (surrogate pair).
I normally hear surrogates in the context of UTF-16 after the code point space
On 4/11/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:
: I have encountered a problem searching in my application because of
: inconsistant unicode normalization forms in the corpus (and the
: queries). I would like to normalize to form NFKD in an analyzer (I
: think). I was thinking
: I have encountered a problem searching in my application because of
: inconsistant unicode normalization forms in the corpus (and the
: queries). I would like to normalize to form NFKD in an analyzer (I
: think). I was thinking about creating a filter similar to the
i'm very naive t
Hi.
I have encountered a problem searching in my application because of
inconsistant unicode normalization forms in the corpus (and the queries). I
would like to normalize to form NFKD in an analyzer (I think). I was thinking
about creating a filter similar to the lowercasefilter that would do