its not too bad, here would be a simple one that only breaks words on whitespace and lowercases:
public class Example extends Analyzer { public TokenStream tokenStream(String fieldName, Reader reader) { TokenStream ts = new WhitespaceTokenizer(reader); ts = new LowerCaseFilter(ts); return ts; } } can you give a better idea as to what languages you have and what your search requirements are (accent marks, punctuation, etc etc) ? On Mon, Jun 15, 2009 at 5:39 PM, OBender Hotmail<osya_ben...@hotmail.com> wrote: > I've looked over SolR quickly, it is a bit too heavy for my project. > So what is required (at a minimum) to build an analyzer, sandbox has a few of > them varying in complexity. > > -----Original Message----- > From: Robert Muir [mailto:rcm...@gmail.com] > Sent: Monday, June 15, 2009 4:51 PM > To: java-user@lucene.apache.org > Subject: Re: Lucene and multi-lingual Unicode - advice needed > > Well just reply back if SolR is inappropriate for your needs. > > In that case, you will need to build a custom analyzer (its not too > bad), so that you can use compass. > > On Mon, Jun 15, 2009 at 4:19 PM, OBender Hotmail<osya_ben...@hotmail.com> > wrote: >> Hi, >> >> My goal is to find a framework that encapsulates as much low level >> indexing/search technology as possible and have it integrate nicely with >> Spring. >> It looked like Compass was/is a good encapsulation of the functionality. >> I'll take a look at SolR though, thanks for the pointer. >> >> -----Original Message----- >> From: Robert Muir [mailto:rcm...@gmail.com] >> Sent: Monday, June 15, 2009 1:14 PM >> To: java-user@lucene.apache.org >> Subject: Re: Lucene and multi-lingual Unicode - advice needed >> >> Hi, >> >> (Since this is an issue you brought up on the Compass forums) >> >> I wonder what stage you are in the development process? >> Have you considered SolR, or does compass provide some other >> functionality that you need? >> >> The reason I say this, is because the easiest solution might be to use >> a nightly SolR for your application. >> >> I'm not personally biased one way or the other for any particular >> framework, but recently there has been some improvements added to SolR >> so that the default type 'text' is pretty good for multilingual >> processing. >> >> In fact I hope in the future it will be improved in lucene so that >> your decision is really based upon other application needs... >> >> On Mon, Jun 15, 2009 at 1:10 PM, OBender Hotmail<osya_ben...@hotmail.com> >> wrote: >>> Hi All! >>> >>> >>> >>> I'm new to Lucene so forgive me if this question was asked before. >>> >>> I have a database with records in the same table in many different languages >>> (up to 70) it includes all W-European, Arabic, Eastern, CJK, Cyrillic, etc. >>> you name it. >>> I've looked at what people say about Lucene and it looks like for the most >>> part standard analyzers should do fine with most Unicode languages but there >>> are quite a few exceptions. >>> Here is some recently updated Lucene Jira thread: >>> https://issues.apache.org/jira/browse/LUCENE-1488 >>> >>> My question is, what would be the safest bet for me in terms of >>> analyzers/tokenizers? >>> Do I really have to write my own ones for the bunch of languages that are >>> not supported? >>> Did anyone already solve the problem similar to mine? I'm sure someone >>> already did :) >>> >>> And yes, I looked at the Lucene sandbox analyzers. It just adds more >>> confusion. For example why there analyzers for DE and FR? Wouldn't the >>> standard analyzer (which is Unicode complaint as I understood) deal with EU >>> languages just fine? >>> >>> Thanks in advance for advices :) >>> >>> >> >> >> >> -- >> Robert Muir >> rcm...@gmail.com >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > > > -- > Robert Muir > rcm...@gmail.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org