RE: Non-English Languages Search

Burton-West, Tom Fri, 13 May 2011 08:38:36 -0700

Hi Ivan and Robert,

>> sounds like you should talk to Tom Burton-West!
Ok, I'll bite.

A few questions:

Are you planning to have separate fields for each language or the same fields 
with contents in different languages?
If #2 are you planning to have a field to indicate the language so you can do 
filter queries?
Do you need to accommodate searches where you don't know what language the user 
is searching in?

>> 2. What about mixing Latin and non-Latin languages?  We ran tests on English 
>> and Chinese collections mixed together >>and didn't see any negative impact 
>> (precision/recall).

 Interesting.  I've wondered whether mixing languages would cause any issues 
with idf stats in the ranking formula, especially if the number of documents in 
each language is significantly different.

This may not be relevant to your use case. We found that dirty OCR combined 
with multiple languages can cause a large number of unique terms.  If you have 
a large enough index, this can make multiterm queries (i.e. prefix/wildcard 
etc) computationally expensive.   It can also seriously increase memory use.  
We started by changing the termInfosIndexDivisor to deal with this at search 
time (http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again), 
but then when we were re-indexing, we discovered that the termInfosIndexDivisor 
doesn't currently affect the indexReader opened when indexing so we changed the 
termIndexInterval from 128 to 1024.  This took our memory use from over 18GB 
down to under 4GB and also eliminated large stop-the-world garbage collection 
pauses. (Index size is about 350GB).

Tom

-----Original Message-----
On Mon, May 9, 2011 at 5:32 PM, Provalov, Ivan
<ivan.prova...@cengage.com> wrote:
> We are planning to ingest some non-English content into our application.  All 
> content is OCR'ed and there are a lot of misspellings and garbage terms 
> because of this.  Each document has one primary language with a some 
> exceptions (e.g. a few English terms mixed in with primarily non-English 
> document terms).
>

sounds like you should talk to Tom Burton-West!

> 1. Does it make sense to mix two or more different Latin-based languages in 
> the same index directory in Lucene (e.g. Spanish/French/English)?

I think it depends upon the application. If the user is specifying the
language via the UI somehow then its probably simplest to just use
different indexes for each collection.

> 2. What about mixing Latin and non-Latin languages?  We ran tests on English 
> and Chinese collections mixed together and didn't see any negative impact 
> (precision/recall).  Any other potential issues?

Right, none of the terms would overlap here... the only "issue" would
be a skewed maxDoc but this is probably not a big deal at all. But
whats the benefit to mixing them?

> 3. Any recommendations for an Urdu analyzer?
>

you can always start with standardanalyzer as it will tokenize it...
you might be able to make use of resources such as
http://www.crulp.org/software/ling_resources/UrduClosedClassWordsList.htm
and http://www.crulp.org/software/ling_resources/UrduHighFreqWords.htm
as a stoplist.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Non-English Languages Search

Reply via email to