Michael just out of curiousity, did you have a particular Analyzer in
mind you were planning on using, or rather certain features in Lucene
you were concerned would work with these codepoints?

On Fri, Jul 31, 2009 at 12:19 PM, Simon
Willnauer<simon.willna...@googlemail.com> wrote:
> Hey Robert, good to see that you found the link :)
>
> On Fri, Jul 31, 2009 at 6:06 PM, Robert Muir<rcm...@gmail.com> wrote:
>> Michael, as Simon mentioned I created an issue describing where you
>> might run into trouble, at least in lucene core.
>>
>> The low-level lucene stuff, it treats these just fine (as surrogate pairs).
>>
>> But most analyzers run into some trouble. (things like
>> WhitespaceAnalyzer are ok)
>>
>> Also wildcard queries and some things like that might not work as you
>> expect, for example ? operator will not match a codepoint > FFFF, but
>> of course you could use ?? as a workaround.
>>
>> On Fri, Jul 31, 2009 at 10:54 AM, Michael Thomsen<mikerthom...@gmail.com> 
>> wrote:
>>> Thanks for your quick response!
>>>
>>> Mike
>>>
>>> On Fri, Jul 31, 2009 at 10:25 AM, Simon
>>> Willnauer<simon.willna...@googlemail.com> wrote:
>>>> If I understand you correctly you are asking if lucene can deal with
>>>> encodings that use more than 16 bit. Well yes and no but mainly no.
>>>> The support for unicode 4.0 was introduced in Java 1.5 and lucene core
>>>> has still back-compat requirements for java 1.4. Lucene's analyzers
>>>> make use of char[] all over the place which is a sequence of UTF-16
>>>> code unit not a code point. As I said the support for codepoints was
>>>> introduced in 1.5 and I can remember that there is an issue which aims
>>>> to implement support for upplementary characters (those above FFFF).
>>>> Such a character is represented as 2 chars and the most of the
>>>> analysis code will simply remove those characters.
>>>> Have a look at this issue:
>>>> https://issues.apache.org/jira/browse/LUCENE-1689 ( @ Robert are you
>>>> working on this?)
>>>>
>>>> I'm sure there will be support for that in lucene 3.1.
>>>>
>>>> Simon
>>>> On Fri, Jul 31, 2009 at 4:08 PM, Michael Thomsen<mikerthom...@gmail.com> 
>>>> wrote:
>>>>> Is Lucene capable of handling UCS4 data natively?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Mike
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>>
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>>
>>
>>
>> --
>> Robert Muir
>> rcm...@gmail.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>



-- 
Robert Muir
rcm...@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to