Not really. At this point, I just needed to know where the UCS4 support stands. I'm reasonably familiar with the various analyzers and what they can do. It's just the state of UCS4 support that might be an issue for us.
Thanks, Mike On Fri, Jul 31, 2009 at 12:25 PM, Robert Muir<rcm...@gmail.com> wrote: > Michael just out of curiousity, did you have a particular Analyzer in > mind you were planning on using, or rather certain features in Lucene > you were concerned would work with these codepoints? > > On Fri, Jul 31, 2009 at 12:19 PM, Simon > Willnauer<simon.willna...@googlemail.com> wrote: >> Hey Robert, good to see that you found the link :) >> >> On Fri, Jul 31, 2009 at 6:06 PM, Robert Muir<rcm...@gmail.com> wrote: >>> Michael, as Simon mentioned I created an issue describing where you >>> might run into trouble, at least in lucene core. >>> >>> The low-level lucene stuff, it treats these just fine (as surrogate pairs). >>> >>> But most analyzers run into some trouble. (things like >>> WhitespaceAnalyzer are ok) >>> >>> Also wildcard queries and some things like that might not work as you >>> expect, for example ? operator will not match a codepoint > FFFF, but >>> of course you could use ?? as a workaround. >>> >>> On Fri, Jul 31, 2009 at 10:54 AM, Michael Thomsen<mikerthom...@gmail.com> >>> wrote: >>>> Thanks for your quick response! >>>> >>>> Mike >>>> >>>> On Fri, Jul 31, 2009 at 10:25 AM, Simon >>>> Willnauer<simon.willna...@googlemail.com> wrote: >>>>> If I understand you correctly you are asking if lucene can deal with >>>>> encodings that use more than 16 bit. Well yes and no but mainly no. >>>>> The support for unicode 4.0 was introduced in Java 1.5 and lucene core >>>>> has still back-compat requirements for java 1.4. Lucene's analyzers >>>>> make use of char[] all over the place which is a sequence of UTF-16 >>>>> code unit not a code point. As I said the support for codepoints was >>>>> introduced in 1.5 and I can remember that there is an issue which aims >>>>> to implement support for upplementary characters (those above FFFF). >>>>> Such a character is represented as 2 chars and the most of the >>>>> analysis code will simply remove those characters. >>>>> Have a look at this issue: >>>>> https://issues.apache.org/jira/browse/LUCENE-1689 ( @ Robert are you >>>>> working on this?) >>>>> >>>>> I'm sure there will be support for that in lucene 3.1. >>>>> >>>>> Simon >>>>> On Fri, Jul 31, 2009 at 4:08 PM, Michael Thomsen<mikerthom...@gmail.com> >>>>> wrote: >>>>>> Is Lucene capable of handling UCS4 data natively? >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Mike >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>>> >>>>>> >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>> >>>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>> >>>> >>> >>> >>> >>> -- >>> Robert Muir >>> rcm...@gmail.com >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > > > -- > Robert Muir > rcm...@gmail.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org