uwe what KK needs here is 'proper unicode handling'. since the latest WordDelimiterFilter has pretty good handling of unicode categories, combining this with WhiteSpaceTokenizer effectively gives you a pretty good solution for unicode tokenization.
KK doesn't need detection of anything, the porter stem filter will simply leave the indic text alone... so it will just work. On Thu, Jun 4, 2009 at 8:40 AM, Uwe Schindler <u...@thetaphi.de> wrote: > > I request Uwe to give me some more ideas on using the analyzers from solr > > that will do the job for me, handling a mix of both english and non- > > english content. > > Look here: > > http://lucene.apache.org/solr/api/org/apache/solr/analysis/package-summary.h > tml<http://lucene.apache.org/solr/api/org/apache/solr/analysis/package-summary.h%0Atml> > > As you see, the Solr analyzers are just standard Lucene analyzers. So you > can drop the solr core jar into your project and just use them :-) > > Currently I am not sure which one is the analyzer Robert means, that can do > english stemming and detecting non-english parts, but there is to look for > it. > > Uwe > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com