Re: which unicode version is supported with lucene

Bernd Fehling Fri, 25 Feb 2011 05:19:57 -0800

Hi Simon,

actually I'm working with Solr from trunk but followed the problem
all the way down to Lucene. I think Solr trunk is build with Lucene 3.0.3.


My field is:
<field name="dcdescription" type="string" indexed="false" stored="true" />

No analysis done at all, just stored the content for result display.
But the result is unpredictable and can end in invalid utf-8 code.

Regards,
Bernd


Am 25.02.2011 13:43, schrieb Simon Willnauer:
> On Fri, Feb 25, 2011 at 1:02 PM, Bernd Fehling
> <bernd.fehl...@uni-bielefeld.de> wrote:
>> Hi Simon,
>>
>> thanks for the details.
>>
>> My platform supports and uses code above BMP (0x10000 and up).
>> So the limit is Lucene.
>> Don't know how to handle this problem.
>> May be deleting all code above BMP...???
> 
> the code will work fine even if they are in you text. It will just not
> respect them maybe throw them away during tokenization etc. so it
> really depends what you are using on the analyzer side. maybe you can
> give us little more details on what you use for analysis. One option
> would be to build 3.1 from the source and use the analyzers from
> there?!
> 
>>
>> Good to hear that Lucene 3.1 will come soon.
>> Any rough estimation when Lucene 3.1 will be available?
> 
> I hope it will happen within the next 4 weeks
> 
> simon
> 
>>
>> Regards,
>> Bernd
>>
>> Am 25.02.2011 12:04, schrieb Simon Willnauer:
>>> Hey Bernd,
>>>
>>> On Fri, Feb 25, 2011 at 11:23 AM, Bernd Fehling
>>> <bernd.fehl...@uni-bielefeld.de> wrote:
>>>> Dear list,
>>>>
>>>> a very basic question about lucene, which version of
>>>> unicode can be handled (indexed and searched) with lucene?
>>>
>>> if you ask for what the indexer / query can handle then it is really
>>> what UTF-8 can handle. Strings passed to the writer / reader are
>>> converted to UTF-8 internally (rough picture). On Trunk we are
>>> indexing bytes only (UTF-8 bytes by default). so the question is
>>> really what you platform supports in terms of utilities / operations
>>> on characters and strings. Since Lucene 3.0 we are on Java 1.5 and
>>> have the possibility to respect code points which are above the BMP.
>>> Lucene 2.9 still has Java 1.4 System Requirements that prevented us
>>> from moving forward to Unicode 4.0. If you look at Character.java all
>>> methods have been converted to operate on UTF-32 code points instead
>>> of UTF-16 code points in Java 1.4.
>>>
>>> Since 3.0 is a Java Generics / move to Java 1.5 only release these
>>> APIs are not in use yet in the latest released version. Lucene 3.1
>>> holds a largely converted Analyzer / TokenFilter / Tokenizer codebase
>>> (I think there are one or two which still have problems, I should
>>> check... Robert did we fix all NGram stuff?).
>>>
>>> So long story short 3.0 / 2.9 Tokenizer and TokenFilter will only
>>> support characters within the BMP <= 0xFFFF. 3.1 (to be released soon
>>> I hope) will fix most of the problems and includes ICU based analysis
>>> for full Unicode 5 support.
>>>
>>> hope that helps
>>>
>>> simon
>>>>
>>>> It looks like lucene can only handle the very old Unicode 2.0
>>>> but not the newer 3.1 version (4 byte utf-8 unicode).
>>>>
>>>> Is that true?
>>>>
>>>> Regards,
>>>> Bernd
>>>>
>>

-- 
*************************************************************
Bernd Fehling                Universitätsbibliothek Bielefeld
Dipl.-Inform. (FH)                        Universitätsstr. 25
Tel. +49 521 106-4060                   Fax. +49 521 106-4052
bernd.fehl...@uni-bielefeld.de                33615 Bielefeld

BASE - Bielefeld Academic Search Engine - www.base-search.net
*************************************************************

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: which unicode version is supported with lucene

Reply via email to