Re: Multiple languages - possible approach

Grant Ingersoll Thu, 16 Mar 2006 05:36:02 -0800

Hi Paul,

Sounds like a really interesting system! I am curious, are your usersfluent in multiple languages or are you using some type of translationcomponent?


Some comments below and a few thoughts here.

How are you querying? Are users entering mixed language queries too?Do you have a cross language component too? Or is it the case that ifthey enter an English query they only want English searches? If this istrue, having multiple indexes would make things easier (or multiplefields) as you could simply detect (or know, based on user profileinformation) the query language and then select the appropriateindex/field and you wouldn't need the extra complexity of #4.

Also, is the text so finely delineated as your example? We sometimesrun across the case where foreign languages will use other languages(mostly English) mid-sentence and it makes things quite ugly. Approach4 should handle this, though

It seems no matter which approach you take, except for #3, you have tohave a way of delineating the languages.

Also, you could use the PerFieldAnalyzerWrapper, and have one field perlanguage per document, this way you wouldn't have to manage multipleindexes. You would have to demarcate your text before indexing, Isuppose, so you would have to process it twice, but that may not be abig deal for you.


-Grant

Paul Cowan wrote:

Hi everyone,
We are currently using Lucene to index correspondence between variouspeople, who may or may not use the same language in their discussionsto each other. Think an email system where participants might use thelanguage that seems most appropriate to the thought at the time, justas they would in conversation.
An example (CN = some chinese text. Use your imagination!):

    From: Someone in the UK
    To: Someone in China
    Subject: Re: CNCNCNCNCNCNCNCNCNCNCN

    > CNCNCNCNCNCNCNCN

    Yes, I think that's fine. I'm OK with that as long as Bob is.

    > CNCNCNCNCNCN

    CNCN?

    > Tuesday OK?

    I need it by Monday, sorry. CNCN!
We need to index that, and be able to search on it -- for both theChinese and English text. Note that stemming is not a particular needof ours -- we're happy to search for literal tokens, but of coursethat may not apply to other languages where stemming is expectedbehaviour, not just a 'nicety'.
Anyway: so far, fine -- StandardAnalyzer is perfectly suitable to ourneeds. The problem is, the next language out of the starting blocks isArabic, which StandardAnalyzer doesn't seem to be up to.
I've looked into previous discussions about this on the various lists,and it seems to me there are a few options:
1) Maintain multiple indexes (StandardAnalyzer-analyzed,ArabicAnalyzer-analyzed, LanguageXXXAnalyzer-analyzed) and searchacross all of them, merging results

It is not always straightforward to merge results, as scores do nottranslate well across indexes.

2) Maintain multiple indexes, ask the user which one to use atsearch-time:
    Search for the [Arabic \/] text: [______________________]


3) Use StandardAnalyzer and hope for the best.

I don't think this is viable. You are probably safe to useStandardAnalyzer as a default case for #4

4) Write a new... "Super Analyzer" that tries to deal with this. Thisis POSSIBLY the best idea -- and, of course, almost certainly thehardest!
Basically, what we're considering is writing some sort of newCompositeAnalyzer class which applies the following algorithm (in verysimple terms):
a) Start reading the stream

b) Look at the next character
c) Use some sort of Character.UnicodeBlock (or Character.Subsetgenerally) -> Analyzer mapping to work out which Analyzer we want touse. e.g. find a member of Character.UnicodeBlock.GREEK, load aGreekAnalyzer.
d) Keep reading until we hit something that makes us think we need tochange analyzers (either end-of-stream or something incongruous --e.g. something from Character.UnicodeBlock.CYRILLIC). Then bundle upwhat we've got, hand it to the GreekAnalyzer, and then start theprocess again with a RussianAnalyzer (or whatever).
Obviously the best way to do this would be to have these mappingsdynamic, not set in stone -- some people might like allCJK_COMPATABILITY to be handed to the CJKAnalyzer, some to theChineseAnalyzer, some might like to use their own, etc. Of coursethere's no reason default mappings can't be supplied.
I guess the basic question is -- what does everyone think? Is thisuseful/workable/are there any fatal flaws with it? Obviously thebiggie is that sometimes Unicode ranges are not sufficient todetermine which analyzer to use -- for example, we may want tospecifically use the GermanAnalyzer for German text, but that isbasically impossible to tell from English purely based on the Unicodeblock of the next character. At least this way, though, we'd have theOPTION of farming off to more specific Analyzers based on Characterset; being able to have an Analyzer which can tell Urdu from Arabic issomething of separate issue; at least the "CompositeAnalyzer" wouldbring us a bit closer to the goal. It may be rudimentary but I thinkthe 'pluggable' architecture could be useful -- certainly more usefulin our case than just running the StandardAnalyzer over everything.

This sounds like a reasonable approach. I wonder a bit about how theTokenStream mechanism will work, considering tokenization can be quitedifferent for Chinese and some of the other Asian languages as comparedto Latin based languages. Essentially, as things come into theTokenizer, you will need to indicate to the analyzer which Filter toapply. I guess this could be done by setting the Type property on theToken and having a Filter that wraps all of your other Filters and,based on Type, hands it off to the appropriate filter for that language.

If this project goes ahead, it's possible (even likely) that it wouldbe contributed back to the Lucene sandbox. As such, I'm veryinterested to hear about any suggestions, criticisms, or otherfeedback you might have.
Cheers,

Paul Cowan

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

--

Grant IngersollSr. Software EngineerCenter for Natural Language ProcessingSyracuse UniversitySchool of Information Studies335 Hinds HallSyracuse, NY 13244http://www.cnlp.orgVoice: 315-443-5484Fax: 315-443-6886


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Multiple languages - possible approach

Reply via email to