Indexing and searching multiple languages

Alexander Mashtakov Sun, 09 Jul 2006 04:50:53 -0700

Hi folks,
I'd like to ask your advice about how to organize index for documents
in multiple languages.


As an input:

The database which holds the documents metadata. Each document consists
from
language-neutral attributes, such as: document_id, date, categories mapping

and language-dependent attributes, such as title, author, abstract etc.

Each document has a default language record - "EN" and may have several
       records with language-dependent attributes translated to other
languages - for example
"Russian" (one record per-language, with FK to document_id).

Each document has a list of "attachments" (PDF, MSOffice files) with the
language
"indicator". Attachment's language is selected from the controlled
vocabulary
  during the file upload and includes western/eastern european languages.

The search should be performed within the documents metadata and attachments
as well
in -ALL- languages (i.e. user just types in search term and click on button
- probably this
is a different topic about how to detect input language in order to apply
appropriate
analyzer to QueryParser).


At this moment of time I'm thinking about the following alternatives:


1. The simple one - create one record per-document with the basic
  metadata structure and include all languages for a given attribute
  in a single field - for example title will contain -ALL- translaltion
(EN, RU, etc).
  The "Contents" field will hold -ALL- attachments texts for a given
document.

2. Create a single record for each metadata language in one index. Create
second index
  with attachments - one record per document.


The first approach is easier, but I'm not sure whether the score will be
calculated correctly
In second approach - I don't know how to "join" the results from MultiQuery
and don't know
how it'll affect the performance (Sorry, I've just started to experiment
with Lucene).

Any ideas, suggestions ?

Thank you,
/Alexander

Indexing and searching multiple languages

Reply via email to