Simon, no problem. I am looking at it now. I will just post my approach and let people tear it apart / get things moving :)
On Fri, Jul 31, 2009 at 2:45 PM, Simon Willnauer<simon.willna...@googlemail.com> wrote: > @Michael: add yourself as a Watcher for the issue. > @Robert: I can start working on this within the next weeks - can you help too? > > simon > > On Fri, Jul 31, 2009 at 7:49 PM, Robert Muir<rcm...@gmail.com> wrote: >> Michael, makes sense. most of the issues probably have some >> workaround, so reply back if you need. >> >> Thanks for your feedback though, it is helpful to know that its important! >> >> On Fri, Jul 31, 2009 at 1:36 PM, Michael Thomsen<mikerthom...@gmail.com> >> wrote: >>> Not really. At this point, I just needed to know where the UCS4 >>> support stands. I'm reasonably familiar with the various analyzers and >>> what they can do. It's just the state of UCS4 support that might be an >>> issue for us. >>> >>> Thanks, >>> >>> Mike >>> >>> On Fri, Jul 31, 2009 at 12:25 PM, Robert Muir<rcm...@gmail.com> wrote: >>>> Michael just out of curiousity, did you have a particular Analyzer in >>>> mind you were planning on using, or rather certain features in Lucene >>>> you were concerned would work with these codepoints? >>>> >>>> On Fri, Jul 31, 2009 at 12:19 PM, Simon >>>> Willnauer<simon.willna...@googlemail.com> wrote: >>>>> Hey Robert, good to see that you found the link :) >>>>> >>>>> On Fri, Jul 31, 2009 at 6:06 PM, Robert Muir<rcm...@gmail.com> wrote: >>>>>> Michael, as Simon mentioned I created an issue describing where you >>>>>> might run into trouble, at least in lucene core. >>>>>> >>>>>> The low-level lucene stuff, it treats these just fine (as surrogate >>>>>> pairs). >>>>>> >>>>>> But most analyzers run into some trouble. (things like >>>>>> WhitespaceAnalyzer are ok) >>>>>> >>>>>> Also wildcard queries and some things like that might not work as you >>>>>> expect, for example ? operator will not match a codepoint > FFFF, but >>>>>> of course you could use ?? as a workaround. >>>>>> >>>>>> On Fri, Jul 31, 2009 at 10:54 AM, Michael >>>>>> Thomsen<mikerthom...@gmail.com> wrote: >>>>>>> Thanks for your quick response! >>>>>>> >>>>>>> Mike >>>>>>> >>>>>>> On Fri, Jul 31, 2009 at 10:25 AM, Simon >>>>>>> Willnauer<simon.willna...@googlemail.com> wrote: >>>>>>>> If I understand you correctly you are asking if lucene can deal with >>>>>>>> encodings that use more than 16 bit. Well yes and no but mainly no. >>>>>>>> The support for unicode 4.0 was introduced in Java 1.5 and lucene core >>>>>>>> has still back-compat requirements for java 1.4. Lucene's analyzers >>>>>>>> make use of char[] all over the place which is a sequence of UTF-16 >>>>>>>> code unit not a code point. As I said the support for codepoints was >>>>>>>> introduced in 1.5 and I can remember that there is an issue which aims >>>>>>>> to implement support for upplementary characters (those above FFFF). >>>>>>>> Such a character is represented as 2 chars and the most of the >>>>>>>> analysis code will simply remove those characters. >>>>>>>> Have a look at this issue: >>>>>>>> https://issues.apache.org/jira/browse/LUCENE-1689 ( @ Robert are you >>>>>>>> working on this?) >>>>>>>> >>>>>>>> I'm sure there will be support for that in lucene 3.1. >>>>>>>> >>>>>>>> Simon >>>>>>>> On Fri, Jul 31, 2009 at 4:08 PM, Michael >>>>>>>> Thomsen<mikerthom...@gmail.com> wrote: >>>>>>>>> Is Lucene capable of handling UCS4 data natively? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> Mike >>>>>>>>> >>>>>>>>> --------------------------------------------------------------------- >>>>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> --------------------------------------------------------------------- >>>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Robert Muir >>>>>> rcm...@gmail.com >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>>> >>>>>> >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> Robert Muir >>>> rcm...@gmail.com >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>> >>>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >> >> >> >> -- >> Robert Muir >> rcm...@gmail.com >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org