its definitely an area in lucene that could use some improvement. my recommendation for multilingual text is to apply the unicode "default" algorithms:
Tokenize text according to UAX #29: unicode text segmentation Apply full case-folding (unicode ch. 3.13) with FC_NFKC closure Apply UAX #15: unicode normalization for now you will have to write code to do this, but i'm looking forward to contributing my implementation soon. i definitely feel your pain. On Thu, May 21, 2009 at 9:12 AM, Joel Halbert <j...@su3analytics.com> wrote: > > > If I index english pages > > with the same indexer, it will not take care of stemming and stop word > > removal? > > correct > > > > Cant we have a single indexer that handles non-eng and eng in > > equally good ways? > > You can have a single indexer, but, if you wanted to use one Analyzer for > English documents (with stemming/stops) and another analyzer for other > language documents > then you would need to know, at the point of both *indexing* and *querying* > what language your indexed document and your query were in. > > This makes the assumption that when a query is in English you only want to > query English lang docs, and vica versa. > You would also have to mark up your documents with a language identifier > (i.e. 0=English, 1=Other Languages) so that when you query you have a > conditional on the language. > > > > I've not had to deal with multi-language documents though - so I'm sure > others will be better placed to offer their experience. > > > > -----Original Message----- > From: KK <dioxide.softw...@gmail.com> > Reply-To: java-user@lucene.apache.org > To: java-user@lucene.apache.org > Subject: Re: hit highlighting in lucene ? > Date: Thu, 21 May 2009 18:31:44 +0530 > > Initially I was using standardAnalyzer but I switched to simpleAnalyzer > which I guess doesnot do more that tokenizing[and may be tokenizing] and I > think this does not do stemming which I dont/cant do because I've no > stemmer > for the languages I'm indexing. > For indexing and querring I'm using the same SimpelAnalyzer. So as you say > I > can go for the standard highlighter api which I mentioned in my last mail, > and this will handle any language for highlighting support. I should start > using this one, right? > > One more thing. I've a single indexer and searcher that I'm usign for > indexing pages of many different non-english languages and as I mentioned > earier I'm using simpleAnalyzer, does that mean If I index english pages > with the same indexer, it will not take care of stemming and stop word > removal? But I dont want to have multiple indexer that is specific to > languages. Cant we have a single indexer that handles non-eng and eng in > equally good ways? Or any other ideas on the same ? > > Thanks, > KK. > > On Thu, May 21, 2009 at 6:18 PM, Joel Halbert <j...@su3analytics.com> > wrote: > > > The highlighter should be language independent. So long as you are > > consistent with your use of Analyzer between > > indexing/query/highlighting. > > > > As for the most appropriate Analyzer to use for your local language, > > this is a seperate question - especially if you are using stop word and > > stemming filters. > > > > The StandardAnalyzer is designed for English since it used the > > StopFilter (English words only). > > > > > > -----Original Message----- > > From: KK <dioxide.softw...@gmail.com> > > Reply-To: java-user@lucene.apache.org > > To: java-user@lucene.apache.org > > Subject: hit highlighting in lucene ? > > Date: Thu, 21 May 2009 17:51:13 +0530 > > > > Hi All, > > I was looking for various ways of implementing hit highlighting in Lucene > > and found some standard classes that does support highlighting like this > > *lucene*. > apache.org/java/2_2_0/api/org/apache/*lucene*/search/*highlight* > > /package-summary.html< > http://apache.org/java/2_2_0/api/org/apache/*lucene*/search/*highlight*%0A/package-summary.html > > > > > > ik but what i believe is that this is only for english or does it support > > other languages. I actually wanted to support highlighting for some > > non-english languages which I'm able to index and fetch using utf-8 > > encoding. So this means that if I want to have highlighting then I've to > > get the utf-8 query and look for the same in the result and add apt tags > > whereever required, it essentially boils down to implementing the > standard > > highlighter. I think the standard highlighter also supports other > > languages. > > Correct me if i'm wrong. > > > > Due to my requirement constraints I'm using just simpleAnalyzer and we > dont > > have tokenizers for these regional languages. Any other ideas of doing > the > > same would be helpful as well. > > > > Thanks, > > KK. > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com