Hi Paul,
Sounds like a really interesting system! I am curious, are your users
fluent in multiple languages or are you using some type of translation
component?
Some comments below and a few thoughts here.
How are you querying? Are users entering mixed language queries too?
Do you have a cross language component too? Or is it the case that if
they enter an English query they only want English searches? If this is
true, having multiple indexes would make things easier (or multiple
fields) as you could simply detect (or know, based on user profile
information) the query language and then select the appropriate
index/field and you wouldn't need the extra complexity of #4.
Also, is the text so finely delineated as your example? We sometimes
run across the case where foreign languages will use other languages
(mostly English) mid-sentence and it makes things quite ugly. Approach
4 should handle this, though
It seems no matter which approach you take, except for #3, you have to
have a way of delineating the languages.
Also, you could use the PerFieldAnalyzerWrapper, and have one field per
language per document, this way you wouldn't have to manage multiple
indexes. You would have to demarcate your text before indexing, I
suppose, so you would have to process it twice, but that may not be a
big deal for you.
-Grant
Paul Cowan wrote:
Hi everyone,
We are currently using Lucene to index correspondence between various
people, who may or may not use the same language in their discussions
to each other. Think an email system where participants might use the
language that seems most appropriate to the thought at the time, just
as they would in conversation.
An example (CN = some chinese text. Use your imagination!):
From: Someone in the UK
To: Someone in China
Subject: Re: CNCNCNCNCNCNCNCNCNCNCN
> CNCNCNCNCNCNCNCN
Yes, I think that's fine. I'm OK with that as long as Bob is.
> CNCNCNCNCNCN
CNCN?
> Tuesday OK?
I need it by Monday, sorry. CNCN!
We need to index that, and be able to search on it -- for both the
Chinese and English text. Note that stemming is not a particular need
of ours -- we're happy to search for literal tokens, but of course
that may not apply to other languages where stemming is expected
behaviour, not just a 'nicety'.
Anyway: so far, fine -- StandardAnalyzer is perfectly suitable to our
needs. The problem is, the next language out of the starting blocks is
Arabic, which StandardAnalyzer doesn't seem to be up to.
I've looked into previous discussions about this on the various lists,
and it seems to me there are a few options:
1) Maintain multiple indexes (StandardAnalyzer-analyzed,
ArabicAnalyzer-analyzed, LanguageXXXAnalyzer-analyzed) and search
across all of them, merging results
It is not always straightforward to merge results, as scores do not
translate well across indexes.
2) Maintain multiple indexes, ask the user which one to use at
search-time:
Search for the [Arabic \/] text: [______________________]
3) Use StandardAnalyzer and hope for the best.
I don't think this is viable. You are probably safe to use
StandardAnalyzer as a default case for #4
4) Write a new... "Super Analyzer" that tries to deal with this. This
is POSSIBLY the best idea -- and, of course, almost certainly the
hardest!
Basically, what we're considering is writing some sort of new
CompositeAnalyzer class which applies the following algorithm (in very
simple terms):
a) Start reading the stream
b) Look at the next character
c) Use some sort of Character.UnicodeBlock (or Character.Subset
generally) -> Analyzer mapping to work out which Analyzer we want to
use. e.g. find a member of Character.UnicodeBlock.GREEK, load a
GreekAnalyzer.
d) Keep reading until we hit something that makes us think we need to
change analyzers (either end-of-stream or something incongruous --
e.g. something from Character.UnicodeBlock.CYRILLIC). Then bundle up
what we've got, hand it to the GreekAnalyzer, and then start the
process again with a RussianAnalyzer (or whatever).
Obviously the best way to do this would be to have these mappings
dynamic, not set in stone -- some people might like all
CJK_COMPATABILITY to be handed to the CJKAnalyzer, some to the
ChineseAnalyzer, some might like to use their own, etc. Of course
there's no reason default mappings can't be supplied.
I guess the basic question is -- what does everyone think? Is this
useful/workable/are there any fatal flaws with it? Obviously the
biggie is that sometimes Unicode ranges are not sufficient to
determine which analyzer to use -- for example, we may want to
specifically use the GermanAnalyzer for German text, but that is
basically impossible to tell from English purely based on the Unicode
block of the next character. At least this way, though, we'd have the
OPTION of farming off to more specific Analyzers based on Character
set; being able to have an Analyzer which can tell Urdu from Arabic is
something of separate issue; at least the "CompositeAnalyzer" would
bring us a bit closer to the goal. It may be rudimentary but I think
the 'pluggable' architecture could be useful -- certainly more useful
in our case than just running the StandardAnalyzer over everything.
This sounds like a reasonable approach. I wonder a bit about how the
TokenStream mechanism will work, considering tokenization can be quite
different for Chinese and some of the other Asian languages as compared
to Latin based languages. Essentially, as things come into the
Tokenizer, you will need to indicate to the analyzer which Filter to
apply. I guess this could be done by setting the Type property on the
Token and having a Filter that wraps all of your other Filters and,
based on Type, hands it off to the appropriate filter for that language.
If this project goes ahead, it's possible (even likely) that it would
be contributed back to the Lucene sandbox. As such, I'm very
interested to hear about any suggestions, criticisms, or other
feedback you might have.
Cheers,
Paul Cowan
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
School of Information Studies
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org
Voice: 315-443-5484
Fax: 315-443-6886
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]