Hi Simon, actually I'm working with Solr from trunk but followed the problem all the way down to Lucene. I think Solr trunk is build with Lucene 3.0.3.
My field is: <field name="dcdescription" type="string" indexed="false" stored="true" /> No analysis done at all, just stored the content for result display. But the result is unpredictable and can end in invalid utf-8 code. Regards, Bernd Am 25.02.2011 13:43, schrieb Simon Willnauer: > On Fri, Feb 25, 2011 at 1:02 PM, Bernd Fehling > <bernd.fehl...@uni-bielefeld.de> wrote: >> Hi Simon, >> >> thanks for the details. >> >> My platform supports and uses code above BMP (0x10000 and up). >> So the limit is Lucene. >> Don't know how to handle this problem. >> May be deleting all code above BMP...??? > > the code will work fine even if they are in you text. It will just not > respect them maybe throw them away during tokenization etc. so it > really depends what you are using on the analyzer side. maybe you can > give us little more details on what you use for analysis. One option > would be to build 3.1 from the source and use the analyzers from > there?! > >> >> Good to hear that Lucene 3.1 will come soon. >> Any rough estimation when Lucene 3.1 will be available? > > I hope it will happen within the next 4 weeks > > simon > >> >> Regards, >> Bernd >> >> Am 25.02.2011 12:04, schrieb Simon Willnauer: >>> Hey Bernd, >>> >>> On Fri, Feb 25, 2011 at 11:23 AM, Bernd Fehling >>> <bernd.fehl...@uni-bielefeld.de> wrote: >>>> Dear list, >>>> >>>> a very basic question about lucene, which version of >>>> unicode can be handled (indexed and searched) with lucene? >>> >>> if you ask for what the indexer / query can handle then it is really >>> what UTF-8 can handle. Strings passed to the writer / reader are >>> converted to UTF-8 internally (rough picture). On Trunk we are >>> indexing bytes only (UTF-8 bytes by default). so the question is >>> really what you platform supports in terms of utilities / operations >>> on characters and strings. Since Lucene 3.0 we are on Java 1.5 and >>> have the possibility to respect code points which are above the BMP. >>> Lucene 2.9 still has Java 1.4 System Requirements that prevented us >>> from moving forward to Unicode 4.0. If you look at Character.java all >>> methods have been converted to operate on UTF-32 code points instead >>> of UTF-16 code points in Java 1.4. >>> >>> Since 3.0 is a Java Generics / move to Java 1.5 only release these >>> APIs are not in use yet in the latest released version. Lucene 3.1 >>> holds a largely converted Analyzer / TokenFilter / Tokenizer codebase >>> (I think there are one or two which still have problems, I should >>> check... Robert did we fix all NGram stuff?). >>> >>> So long story short 3.0 / 2.9 Tokenizer and TokenFilter will only >>> support characters within the BMP <= 0xFFFF. 3.1 (to be released soon >>> I hope) will fix most of the problems and includes ICU based analysis >>> for full Unicode 5 support. >>> >>> hope that helps >>> >>> simon >>>> >>>> It looks like lucene can only handle the very old Unicode 2.0 >>>> but not the newer 3.1 version (4 byte utf-8 unicode). >>>> >>>> Is that true? >>>> >>>> Regards, >>>> Bernd >>>> >> -- ************************************************************* Bernd Fehling Universitätsbibliothek Bielefeld Dipl.-Inform. (FH) Universitätsstr. 25 Tel. +49 521 106-4060 Fax. +49 521 106-4052 bernd.fehl...@uni-bielefeld.de 33615 Bielefeld BASE - Bielefeld Academic Search Engine - www.base-search.net ************************************************************* --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org