On Sun, Apr 6, 2014 at 10:30 AM, Herb Roitblat <herb.roitb...@orcatec.com>wrote:
> Just curious, what are some of the things that people do to properly > tokenize the queries with mixed language collections? What do you do with > mixed language queries? > You can either force the user to tell you the language, or ... you can run a language detector. They are less accurate for short strings, or ... you can process it in _all_ of the languages and OR up the results. > > On 4/6/2014 4:51 AM, Benson Margulies wrote: > >> You must know what language each text is in, and use an appropriate >> analyzer. Some people do this by using a separate field (text_eng, >> text_spa, text_jpn). Other people put some extra information at the >> beginning of the field, and then make an analyzer that peeks in order to >> dispatch to the correct tokenizer. >> >> >> On Sat, Apr 5, 2014 at 9:59 PM, <j7a42e4fd7...@softbank.ne.jp> wrote: >> >> I am pretty new with Lucene, however I have not problem understanding >>> what >>> is about. >>> My big problem is trying to understand how Kuromoji works. I need to >>> implement a search functinality thats supports initially English, Spanish >>> and Japanese. I doesn't seem to be a deal with the two firsts, as I can >>> just use the analyzersーcommon to index both languages contents, but when >>> it >>> comes to Japanese it has it's own analyzer. I could't find any clues >>> about >>> combining analyzers, so I still don't if I can combine all languages >>> under >>> the same index (which would be ideal, as I expect mix searches in the >>> context of my project) or I have to detect the language first and then >>> index Japanese texts separately (what it will be a big disadvantage when >>> it >>> comes to mixed searches and future localization expansion). >>> I found out about Lucene throgh Kuromoji, it will be great to find out a >>> solution to be able to use all the greatness that Lucene offers. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >>> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >