Hello, I think if you analyze text correctly, then your highlighting will work too. Your problem is you need an analyzer that analyzes text correctly, then I think everything will work!
Here's a short intro with some links: You can get code that applies these algorithms here: http://site.icu-project.org/ none of it is too complex unless you need high performance, then it gets a bit tricky. so that is why my code is not ready yet :( segmentation (tokenization): Basically, each character in unicode has default word-break properties defined. This will break your hindi words correctly. Simple/StandardAnalyzer incorrectly break words around non-spacing marks such as your hindi dependent vowels and nukta dot, because the isLetter(x) property happens to be false. It is not possible to provide a uniform set of rules that resolves all issues across languages or that handles all ambiguous situations within a given language. The goal for the specification presented in this annex is to provide a workable default; tailored implementations can be more sophisticated. http://www.unicode.org/reports/tr29/tr29-13.html This is what you get if you apply BreakIterator For a "demo", put some text into windows notepad, and start double-clicking. The way in which words are highlighted by your mouse selection is basically what we are talking about here. normalization: For round-trip compatibility with existing standards, Unicode has encoded many entities that are really variants of the same abstract character. This is the part that will ensure your PHA + NUKTA DOT and FA are treated the same. http://www.unicode.org/reports/tr15/tr15-29.html This is what you get if you apply Normalizer case folding: Case folding is a special mapping, which if applied, erases case differences. This is different than lower-casing, for example 'ß' maps to 'ss'. http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf (page 61) this is what you get if you apply UCharacter.foldCase On Fri, May 22, 2009 at 12:38 AM, KK <dioxide.softw...@gmail.com> wrote: > Thank you all. > @Muir > Thanks for sharing your views. I'ld like to have some more details on the > process you mentioned as I've absolutely no idea on this highlighting > stuffs, could not make much out of our mail. Can you point me to some > tutorials/good write ups on the same, if you have some write ups on the > same, do give me the pointers. it'll help me a lot. > Pointers to the unicode default algorithms mentioned in your mail will be > equally helpful. > > Thanks, > KK. > > On Thu, May 21, 2009 at 8:03 PM, Robert Muir <rcm...@gmail.com> wrote: > > > its definitely an area in lucene that could use some improvement. > > > > my recommendation for multilingual text is to apply the unicode "default" > > algorithms: > > > > Tokenize text according to UAX #29: unicode text segmentation > > Apply full case-folding (unicode ch. 3.13) with FC_NFKC closure > > Apply UAX #15: unicode normalization > > > > for now you will have to write code to do this, but i'm looking forward > to > > contributing my implementation soon. > > > > i definitely feel your pain. > > > > On Thu, May 21, 2009 at 9:12 AM, Joel Halbert <j...@su3analytics.com> > > wrote: > > > > > > > > > If I index english pages > > > > with the same indexer, it will not take care of stemming and stop > word > > > > removal? > > > > > > correct > > > > > > > > > > Cant we have a single indexer that handles non-eng and eng in > > > > equally good ways? > > > > > > You can have a single indexer, but, if you wanted to use one Analyzer > for > > > English documents (with stemming/stops) and another analyzer for other > > > language documents > > > then you would need to know, at the point of both *indexing* and > > *querying* > > > what language your indexed document and your query were in. > > > > > > This makes the assumption that when a query is in English you only want > > to > > > query English lang docs, and vica versa. > > > You would also have to mark up your documents with a language > identifier > > > (i.e. 0=English, 1=Other Languages) so that when you query you have a > > > conditional on the language. > > > > > > > > > > > > I've not had to deal with multi-language documents though - so I'm sure > > > others will be better placed to offer their experience. > > > > > > > > > > > > -----Original Message----- > > > From: KK <dioxide.softw...@gmail.com> > > > Reply-To: java-user@lucene.apache.org > > > To: java-user@lucene.apache.org > > > Subject: Re: hit highlighting in lucene ? > > > Date: Thu, 21 May 2009 18:31:44 +0530 > > > > > > Initially I was using standardAnalyzer but I switched to simpleAnalyzer > > > which I guess doesnot do more that tokenizing[and may be tokenizing] > and > > I > > > think this does not do stemming which I dont/cant do because I've no > > > stemmer > > > for the languages I'm indexing. > > > For indexing and querring I'm using the same SimpelAnalyzer. So as you > > say > > > I > > > can go for the standard highlighter api which I mentioned in my last > > mail, > > > and this will handle any language for highlighting support. I should > > start > > > using this one, right? > > > > > > One more thing. I've a single indexer and searcher that I'm usign for > > > indexing pages of many different non-english languages and as I > mentioned > > > earier I'm using simpleAnalyzer, does that mean If I index english > pages > > > with the same indexer, it will not take care of stemming and stop word > > > removal? But I dont want to have multiple indexer that is specific to > > > languages. Cant we have a single indexer that handles non-eng and eng > in > > > equally good ways? Or any other ideas on the same ? > > > > > > Thanks, > > > KK. > > > > > > On Thu, May 21, 2009 at 6:18 PM, Joel Halbert <j...@su3analytics.com> > > > wrote: > > > > > > > The highlighter should be language independent. So long as you are > > > > consistent with your use of Analyzer between > > > > indexing/query/highlighting. > > > > > > > > As for the most appropriate Analyzer to use for your local language, > > > > this is a seperate question - especially if you are using stop word > and > > > > stemming filters. > > > > > > > > The StandardAnalyzer is designed for English since it used the > > > > StopFilter (English words only). > > > > > > > > > > > > -----Original Message----- > > > > From: KK <dioxide.softw...@gmail.com> > > > > Reply-To: java-user@lucene.apache.org > > > > To: java-user@lucene.apache.org > > > > Subject: hit highlighting in lucene ? > > > > Date: Thu, 21 May 2009 17:51:13 +0530 > > > > > > > > Hi All, > > > > I was looking for various ways of implementing hit highlighting in > > Lucene > > > > and found some standard classes that does support highlighting like > > this > > > > *lucene*. > > > apache.org/java/2_2_0/api/org/apache/*lucene*/search/*highlight* > > > > /package-summary.html< > > > > > > http://apache.org/java/2_2_0/api/org/apache/*lucene*/search/*highlight*%0A/package-summary.html > > > > > > > > > > > > ik but what i believe is that this is only for english or does it > > support > > > > other languages. I actually wanted to support highlighting for some > > > > non-english languages which I'm able to index and fetch using utf-8 > > > > encoding. So this means that if I want to have highlighting then > I've > > to > > > > get the utf-8 query and look for the same in the result and add apt > > tags > > > > whereever required, it essentially boils down to implementing the > > > standard > > > > highlighter. I think the standard highlighter also supports other > > > > languages. > > > > Correct me if i'm wrong. > > > > > > > > Due to my requirement constraints I'm using just simpleAnalyzer and > we > > > dont > > > > have tokenizers for these regional languages. Any other ideas of > doing > > > the > > > > same would be helpful as well. > > > > > > > > Thanks, > > > > KK. > > > > > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > > > > > -- > > Robert Muir > > rcm...@gmail.com > > > -- Robert Muir rcm...@gmail.com