Really, you have a requirement that the system should search written Cornish?
I think you might have larger problems! On Mon, Jun 15, 2009 at 9:18 PM, OBender Hotmail<osya_ben...@hotmail.com> wrote: > Here is the list of possible languages. Don't laugh :) I know those are > almost all world languages but it is a true requirement. Well, actual number > will be closer to 70 not 100 but still I don't really know which ones from > the list below will end up in the DB. > > ------- > Afrikaans Albanian Arabic Armenian Austrian Aymara Azerbaijani > Basque Belorussian Bemba Bengali Blackfoot Bosnian Breton Bulgarian > Canadian French Catalan Cebuano Chamorro Chinese Chechen Cornish Croatian > Czech > Danish Dutch > Ecuadorian Quechua English English-Portuguese Esperanto Estonian > Faroese Farsi Finnish Flemish French Frisian > Galician Georgian German Greek Guarani > Haitian Creole Hausa Hawaiian Hebrew Hindi Hungarian Icelandic Indonesian > Inuktitut Irish Italian > Japanese > Kazakh Kongo Korean > Latin Latvian Lithuanian Luganda Luxembourgish > Macedonian Malagasy Malay Maori Maya Mohawk Mongolian > Nahuatl Norwegian > Papago Pashto Pidgin English Polish Portuguese (European) > Portuguese (Brazilian) Provençal > Quechua > Romanian Romansch Romany Ruanda Russian > Samoan Scottish Sepedi Serbian Shona Sicilian Slovak Slovene Somali > Sorbian Sotho Spanish Swahili Swazi Swedish > Tagalog Tahitian Thai Tongan Tswana Turkish Turkmen Tuvan > Ukrainian Urdu Uzbek Vietnamese > Welsh Wolof > Xhosa > Yiddish Yoruba > Zulu > > -----Original Message----- > From: Robert Muir [mailto:rcm...@gmail.com] > Sent: Monday, June 15, 2009 5:56 PM > To: java-user@lucene.apache.org > Subject: Re: Lucene and multi-lingual Unicode - advice needed > > its not too bad, here would be a simple one that only breaks words on > whitespace and lowercases: > > public class Example extends Analyzer { > public TokenStream tokenStream(String fieldName, Reader reader) { > TokenStream ts = new WhitespaceTokenizer(reader); > ts = new LowerCaseFilter(ts); > return ts; > } > } > > can you give a better idea as to what languages you have and what your > search requirements are (accent marks, punctuation, etc etc) ? > > On Mon, Jun 15, 2009 at 5:39 PM, OBender Hotmail<osya_ben...@hotmail.com> > wrote: >> I've looked over SolR quickly, it is a bit too heavy for my project. >> So what is required (at a minimum) to build an analyzer, sandbox has a few >> of them varying in complexity. >> >> -----Original Message----- >> From: Robert Muir [mailto:rcm...@gmail.com] >> Sent: Monday, June 15, 2009 4:51 PM >> To: java-user@lucene.apache.org >> Subject: Re: Lucene and multi-lingual Unicode - advice needed >> >> Well just reply back if SolR is inappropriate for your needs. >> >> In that case, you will need to build a custom analyzer (its not too >> bad), so that you can use compass. >> >> On Mon, Jun 15, 2009 at 4:19 PM, OBender Hotmail<osya_ben...@hotmail.com> >> wrote: >>> Hi, >>> >>> My goal is to find a framework that encapsulates as much low level >>> indexing/search technology as possible and have it integrate nicely with >>> Spring. >>> It looked like Compass was/is a good encapsulation of the functionality. >>> I'll take a look at SolR though, thanks for the pointer. >>> >>> -----Original Message----- >>> From: Robert Muir [mailto:rcm...@gmail.com] >>> Sent: Monday, June 15, 2009 1:14 PM >>> To: java-user@lucene.apache.org >>> Subject: Re: Lucene and multi-lingual Unicode - advice needed >>> >>> Hi, >>> >>> (Since this is an issue you brought up on the Compass forums) >>> >>> I wonder what stage you are in the development process? >>> Have you considered SolR, or does compass provide some other >>> functionality that you need? >>> >>> The reason I say this, is because the easiest solution might be to use >>> a nightly SolR for your application. >>> >>> I'm not personally biased one way or the other for any particular >>> framework, but recently there has been some improvements added to SolR >>> so that the default type 'text' is pretty good for multilingual >>> processing. >>> >>> In fact I hope in the future it will be improved in lucene so that >>> your decision is really based upon other application needs... >>> >>> On Mon, Jun 15, 2009 at 1:10 PM, OBender Hotmail<osya_ben...@hotmail.com> >>> wrote: >>>> Hi All! >>>> >>>> >>>> >>>> I'm new to Lucene so forgive me if this question was asked before. >>>> >>>> I have a database with records in the same table in many different >>>> languages >>>> (up to 70) it includes all W-European, Arabic, Eastern, CJK, Cyrillic, etc. >>>> you name it. >>>> I've looked at what people say about Lucene and it looks like for the most >>>> part standard analyzers should do fine with most Unicode languages but >>>> there >>>> are quite a few exceptions. >>>> Here is some recently updated Lucene Jira thread: >>>> https://issues.apache.org/jira/browse/LUCENE-1488 >>>> >>>> My question is, what would be the safest bet for me in terms of >>>> analyzers/tokenizers? >>>> Do I really have to write my own ones for the bunch of languages that are >>>> not supported? >>>> Did anyone already solve the problem similar to mine? I'm sure someone >>>> already did :) >>>> >>>> And yes, I looked at the Lucene sandbox analyzers. It just adds more >>>> confusion. For example why there analyzers for DE and FR? Wouldn't the >>>> standard analyzer (which is Unicode complaint as I understood) deal with EU >>>> languages just fine? >>>> >>>> Thanks in advance for advices :) >>>> >>>> >>> >>> >>> >>> -- >>> Robert Muir >>> rcm...@gmail.com >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >> >> >> >> -- >> Robert Muir >> rcm...@gmail.com >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > > > -- > Robert Muir > rcm...@gmail.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > No virus found in this incoming message. > Checked by AVG - www.avg.com > Version: 8.5.339 / Virus Database: 270.12.62/2168 - Release Date: 06/15/09 > 17:54:00 > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Robert Muir rcm...@gmail.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org