Michael, as Simon mentioned I created an issue describing where you
might run into trouble, at least in lucene core.

The low-level lucene stuff, it treats these just fine (as surrogate pairs).

But most analyzers run into some trouble. (things like
WhitespaceAnalyzer are ok)

Also wildcard queries and some things like that might not work as you
expect, for example ? operator will not match a codepoint > FFFF, but
of course you could use ?? as a workaround.

On Fri, Jul 31, 2009 at 10:54 AM, Michael Thomsen<mikerthom...@gmail.com> wrote:
> Thanks for your quick response!
>
> Mike
>
> On Fri, Jul 31, 2009 at 10:25 AM, Simon
> Willnauer<simon.willna...@googlemail.com> wrote:
>> If I understand you correctly you are asking if lucene can deal with
>> encodings that use more than 16 bit. Well yes and no but mainly no.
>> The support for unicode 4.0 was introduced in Java 1.5 and lucene core
>> has still back-compat requirements for java 1.4. Lucene's analyzers
>> make use of char[] all over the place which is a sequence of UTF-16
>> code unit not a code point. As I said the support for codepoints was
>> introduced in 1.5 and I can remember that there is an issue which aims
>> to implement support for upplementary characters (those above FFFF).
>> Such a character is represented as 2 chars and the most of the
>> analysis code will simply remove those characters.
>> Have a look at this issue:
>> https://issues.apache.org/jira/browse/LUCENE-1689 ( @ Robert are you
>> working on this?)
>>
>> I'm sure there will be support for that in lucene 3.1.
>>
>> Simon
>> On Fri, Jul 31, 2009 at 4:08 PM, Michael Thomsen<mikerthom...@gmail.com> 
>> wrote:
>>> Is Lucene capable of handling UCS4 data natively?
>>>
>>> Thanks,
>>>
>>> Mike
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>



-- 
Robert Muir
rcm...@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to