Hi folks, I'd like to ask your advice about how to organize index for documents in multiple languages.
As an input: The database which holds the documents metadata. Each document consists from language-neutral attributes, such as: document_id, date, categories mapping and language-dependent attributes, such as title, author, abstract etc. Each document has a default language record - "EN" and may have several records with language-dependent attributes translated to other languages - for example "Russian" (one record per-language, with FK to document_id). Each document has a list of "attachments" (PDF, MSOffice files) with the language "indicator". Attachment's language is selected from the controlled vocabulary during the file upload and includes western/eastern european languages. The search should be performed within the documents metadata and attachments as well in -ALL- languages (i.e. user just types in search term and click on button - probably this is a different topic about how to detect input language in order to apply appropriate analyzer to QueryParser). At this moment of time I'm thinking about the following alternatives: 1. The simple one - create one record per-document with the basic metadata structure and include all languages for a given attribute in a single field - for example title will contain -ALL- translaltion (EN, RU, etc). The "Contents" field will hold -ALL- attachments texts for a given document. 2. Create a single record for each metadata language in one index. Create second index with attachments - one record per document. The first approach is easier, but I'm not sure whether the score will be calculated correctly In second approach - I don't know how to "join" the results from MultiQuery and don't know how it'll affect the performance (Sorry, I've just started to experiment with Lucene). Any ideas, suggestions ? Thank you, /Alexander