I am trying to index in Lucene a field that could have label of concepts in different languages. Most of the approaches I have seen so far are:
- Use a single index, where each document has a field per each language it uses, or - Use M indexes, M being the number of languages in the corpus. Lucene 2.9+ has a feature called Payload that allows to attach attributes to term. Is anyone use this mechanism to store language (or other attributes such as datatypes) information ? Does this approach if labels are the same in different languages (does it break inverted index) ? How is performance compared to the two other approaches ? Any pointer on source code showing how it is done would help. Thanks -- Stephane Fellah, M.Sc, B.Sc Principal Engineer/Product Manager smartRealm LLC 201 Loudoun St. SW Leesburg, VA 20175 Tel: 703 669 5514 Cell: 571 502 8478 Fax: 703 669 5515