Solr trunk is using Lucene trunk since Lucene and Solr are merged. ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de
> -----Original Message----- > From: Bernd Fehling [mailto:bernd.fehl...@uni-bielefeld.de] > Sent: Friday, February 25, 2011 2:19 PM > To: simon.willna...@gmail.com > Cc: java-user@lucene.apache.org > Subject: Re: which unicode version is supported with lucene > > Hi Simon, > > actually I'm working with Solr from trunk but followed the problem all the > way down to Lucene. I think Solr trunk is build with Lucene 3.0.3. > > My field is: > <field name="dcdescription" type="string" indexed="false" stored="true" /> > > No analysis done at all, just stored the content for result display. > But the result is unpredictable and can end in invalid utf-8 code. > > Regards, > Bernd > > > Am 25.02.2011 13:43, schrieb Simon Willnauer: > > On Fri, Feb 25, 2011 at 1:02 PM, Bernd Fehling > > <bernd.fehl...@uni-bielefeld.de> wrote: > >> Hi Simon, > >> > >> thanks for the details. > >> > >> My platform supports and uses code above BMP (0x10000 and up). > >> So the limit is Lucene. > >> Don't know how to handle this problem. > >> May be deleting all code above BMP...??? > > > > the code will work fine even if they are in you text. It will just not > > respect them maybe throw them away during tokenization etc. so it > > really depends what you are using on the analyzer side. maybe you can > > give us little more details on what you use for analysis. One option > > would be to build 3.1 from the source and use the analyzers from > > there?! > > > >> > >> Good to hear that Lucene 3.1 will come soon. > >> Any rough estimation when Lucene 3.1 will be available? > > > > I hope it will happen within the next 4 weeks > > > > simon > > > >> > >> Regards, > >> Bernd > >> > >> Am 25.02.2011 12:04, schrieb Simon Willnauer: > >>> Hey Bernd, > >>> > >>> On Fri, Feb 25, 2011 at 11:23 AM, Bernd Fehling > >>> <bernd.fehl...@uni-bielefeld.de> wrote: > >>>> Dear list, > >>>> > >>>> a very basic question about lucene, which version of unicode can be > >>>> handled (indexed and searched) with lucene? > >>> > >>> if you ask for what the indexer / query can handle then it is really > >>> what UTF-8 can handle. Strings passed to the writer / reader are > >>> converted to UTF-8 internally (rough picture). On Trunk we are > >>> indexing bytes only (UTF-8 bytes by default). so the question is > >>> really what you platform supports in terms of utilities / operations > >>> on characters and strings. Since Lucene 3.0 we are on Java 1.5 and > >>> have the possibility to respect code points which are above the BMP. > >>> Lucene 2.9 still has Java 1.4 System Requirements that prevented us > >>> from moving forward to Unicode 4.0. If you look at Character.java > >>> all methods have been converted to operate on UTF-32 code points > >>> instead of UTF-16 code points in Java 1.4. > >>> > >>> Since 3.0 is a Java Generics / move to Java 1.5 only release these > >>> APIs are not in use yet in the latest released version. Lucene 3.1 > >>> holds a largely converted Analyzer / TokenFilter / Tokenizer > >>> codebase (I think there are one or two which still have problems, I > >>> should check... Robert did we fix all NGram stuff?). > >>> > >>> So long story short 3.0 / 2.9 Tokenizer and TokenFilter will only > >>> support characters within the BMP <= 0xFFFF. 3.1 (to be released > >>> soon I hope) will fix most of the problems and includes ICU based > >>> analysis for full Unicode 5 support. > >>> > >>> hope that helps > >>> > >>> simon > >>>> > >>>> It looks like lucene can only handle the very old Unicode 2.0 but > >>>> not the newer 3.1 version (4 byte utf-8 unicode). > >>>> > >>>> Is that true? > >>>> > >>>> Regards, > >>>> Bernd > >>>> > >> > > -- > ********************************************************** > *** > Bernd Fehling Universitätsbibliothek Bielefeld > Dipl.-Inform. (FH) Universitätsstr. 25 > Tel. +49 521 106-4060 Fax. +49 521 106-4052 > bernd.fehl...@uni-bielefeld.de 33615 Bielefeld > > BASE - Bielefeld Academic Search Engine - www.base-search.net > ********************************************************** > *** > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org