Re: Lucene and multi-lingual Unicode - advice needed

Robert Muir Mon, 15 Jun 2009 19:45:45 -0700

ok, well at first i thought you must be playing a joke on me or something...


Maybe you want to create a lucene analyzer that mimic's solr defaults.

Search the mail archives for this recent thread, and KK posted his code:
Re: How to support stemming and case folding for english content mixed
with non-english content?

Then again, maybe the sample code i gave you (whitespace + lowercase)
is good enough. By the time the company in question manages to get its
Chamorro, Cornish, Blackfoot, and Pashto testers together to evaluate
the search you will be retired :)


On Mon, Jun 15, 2009 at 10:30 PM, OBender
Hotmail<osya_ben...@hotmail.com> wrote:
> That's the thing there is no actual requirement.
> I've been presented with all the languages that company theoretically 
> provides.
> My guess is that what I'm going to end up with is all western languages, good 
> share of Arabic family, complete set of Eastern and Eastern European ones and 
> of course CJK.
>
> -----Original Message-----
> From: Robert Muir [mailto:rcm...@gmail.com]
> Sent: Monday, June 15, 2009 9:52 PM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene and multi-lingual Unicode - advice needed
>
> Really, you have a requirement that the system should search written Cornish?
>
> I think you might have larger problems!
>
> On Mon, Jun 15, 2009 at 9:18 PM, OBender Hotmail<osya_ben...@hotmail.com> 
> wrote:
>> Here is the list of possible languages. Don't laugh :) I know those are 
>> almost all world languages but it is a true requirement. Well, actual number 
>> will be closer to 70 not 100 but still I don't really know which ones from 
>> the list below will end up in the DB.
>>
>> -------
>> Afrikaans  Albanian Arabic Armenian Austrian Aymara Azerbaijani
>> Basque Belorussian Bemba Bengali Blackfoot Bosnian Breton Bulgarian     
>> Canadian French  Catalan Cebuano Chamorro Chinese Chechen Cornish Croatian 
>> Czech
>> Danish Dutch
>> Ecuadorian Quechua  English  English-Portuguese Esperanto Estonian
>> Faroese Farsi Finnish Flemish French Frisian
>> Galician Georgian German Greek Guarani
>> Haitian Creole  Hausa Hawaiian Hebrew Hindi Hungarian Icelandic Indonesian   
>>   Inuktitut Irish Italian
>> Japanese
>> Kazakh Kongo Korean
>> Latin Latvian Lithuanian Luganda Luxembourgish
>> Macedonian Malagasy Malay Maori Maya Mohawk Mongolian
>> Nahuatl Norwegian
>> Papago Pashto Pidgin English Polish Portuguese (European)
>> Portuguese (Brazilian) Provencal
>> Quechua
>> Romanian Romansch Romany Ruanda Russian
>> Samoan Scottish Sepedi Serbian Shona Sicilian Slovak Slovene Somali    
>> Sorbian Sotho Spanish Swahili Swazi Swedish
>> Tagalog Tahitian Thai Tongan Tswana Turkish Turkmen Tuvan
>> Ukrainian Urdu Uzbek Vietnamese
>> Welsh Wolof
>> Xhosa
>> Yiddish Yoruba
>> Zulu
>>
>> -----Original Message-----
>> From: Robert Muir [mailto:rcm...@gmail.com]
>> Sent: Monday, June 15, 2009 5:56 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: Lucene and multi-lingual Unicode - advice needed
>>
>> its not too bad, here would be a simple one that only breaks words on
>> whitespace and lowercases:
>>
>> public class Example extends Analyzer {
>>  public TokenStream tokenStream(String fieldName, Reader reader) {
>>   TokenStream ts = new WhitespaceTokenizer(reader);
>>   ts = new LowerCaseFilter(ts);
>>   return ts;
>>  }
>> }
>>
>> can you give a better idea as to what languages you have and what your
>> search requirements are (accent marks, punctuation, etc etc) ?
>>
>> On Mon, Jun 15, 2009 at 5:39 PM, OBender Hotmail<osya_ben...@hotmail.com> 
>> wrote:
>>> I've looked over SolR quickly, it is a bit too heavy for my project.
>>> So what is required (at a minimum) to build an analyzer, sandbox has a few 
>>> of them varying in complexity.
>>>
>>> -----Original Message-----
>>> From: Robert Muir [mailto:rcm...@gmail.com]
>>> Sent: Monday, June 15, 2009 4:51 PM
>>> To: java-user@lucene.apache.org
>>> Subject: Re: Lucene and multi-lingual Unicode - advice needed
>>>
>>> Well just reply back if SolR is inappropriate for your needs.
>>>
>>> In that case, you will need to build a custom analyzer (its not too
>>> bad), so that you can use compass.
>>>
>>> On Mon, Jun 15, 2009 at 4:19 PM, OBender Hotmail<osya_ben...@hotmail.com> 
>>> wrote:
>>>> Hi,
>>>>
>>>> My goal is to find a framework that encapsulates as much low level 
>>>> indexing/search technology as possible and have it integrate nicely with 
>>>> Spring.
>>>> It looked like Compass was/is a good encapsulation of the functionality. 
>>>> I'll take a look at SolR though, thanks for the pointer.
>>>>
>>>> -----Original Message-----
>>>> From: Robert Muir [mailto:rcm...@gmail.com]
>>>> Sent: Monday, June 15, 2009 1:14 PM
>>>> To: java-user@lucene.apache.org
>>>> Subject: Re: Lucene and multi-lingual Unicode - advice needed
>>>>
>>>> Hi,
>>>>
>>>> (Since this is an issue you brought up on the Compass forums)
>>>>
>>>> I wonder what stage you are in the development process?
>>>> Have you considered SolR, or does compass provide some other
>>>> functionality that you need?
>>>>
>>>> The reason I say this, is because the easiest solution might be to use
>>>> a nightly SolR for your application.
>>>>
>>>> I'm not personally biased one way or the other for any particular
>>>> framework, but recently there has been some improvements added to SolR
>>>> so that the default type 'text' is pretty good for multilingual
>>>> processing.
>>>>
>>>> In fact I hope in the future it will be improved in lucene so that
>>>> your decision is really based upon other application needs...
>>>>
>>>> On Mon, Jun 15, 2009 at 1:10 PM, OBender Hotmail<osya_ben...@hotmail.com> 
>>>> wrote:
>>>>> Hi All!
>>>>>
>>>>>
>>>>>
>>>>> I'm new to Lucene so forgive me if this question was asked before.
>>>>>
>>>>> I have a database with records in the same table in many different 
>>>>> languages
>>>>> (up to 70) it includes all W-European, Arabic, Eastern, CJK, Cyrillic, 
>>>>> etc.
>>>>> you name it.
>>>>> I've looked at what people say about Lucene and it looks like for the most
>>>>> part standard analyzers should do fine with most Unicode languages but 
>>>>> there
>>>>> are quite a few exceptions.
>>>>> Here is some recently updated Lucene Jira thread:
>>>>> https://issues.apache.org/jira/browse/LUCENE-1488
>>>>>
>>>>> My question is, what would be the safest bet for me in terms of
>>>>> analyzers/tokenizers?
>>>>> Do I really have to write my own ones for the bunch of languages that are
>>>>> not supported?
>>>>> Did anyone already solve the problem similar to mine? I'm sure someone
>>>>> already did :)
>>>>>
>>>>> And yes, I looked at the Lucene sandbox analyzers. It just adds more
>>>>> confusion. For example why there analyzers for DE and FR? Wouldn't the
>>>>> standard analyzer (which is Unicode complaint as I understood) deal with 
>>>>> EU
>>>>> languages just fine?
>>>>>
>>>>> Thanks in advance for advices :)
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Robert Muir
>>>> rcm...@gmail.com
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Robert Muir
>>> rcm...@gmail.com
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>>
>>
>>
>> --
>> Robert Muir
>> rcm...@gmail.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>> No virus found in this incoming message.
>> Checked by AVG - www.avg.com
>> Version: 8.5.339 / Virus Database: 270.12.62/2168 - Release Date: 06/15/09 
>> 17:54:00
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
>
>
> --
> Robert Muir
> rcm...@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>



-- 
Robert Muir
rcm...@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Lucene and multi-lingual Unicode - advice needed

Reply via email to