Erick, I am trying to index multilingual taxonomies such as SKOS, Wordnet, Eurowordnet. Taxonomies are composed of concepts which have preferred and alternative labels in different languages. Some labels are the same lexical form in different languages. I want to be able to index these concepts in Lucene in order to be able to search concepts by their label in one or several languages. I want also be able to display concept definition with all the alternative labels in different languages. My question is: could we use the payload mechanism to store the language assigned to the word (i read somewhere Google was using payload to store information such as font for example, so why not language) ? Wouldn't be a better approach then using one field per language or one index per language ?
REgards Stephane On Fri, Mar 11, 2011 at 7:52 AM, Erick Erickson <erickerick...@gmail.com>wrote: > It's not so much a matter of problems with indexing/searching > as it is with search behavior. The reason these strategies > are implemented is that using English stemming, say, on > other languages will produce "interesting" results. > > There's no a-priori reason you can't index multiple languages > in the same field. > > So I don't see what you would accomplish by using payloads > to indicate which language the term is in. Could you expand > a bit on what you're trying to accomplish here? Maybe there > are better solutions.... > > Best > Erick > > > On Thu, Mar 10, 2011 at 10:29 PM, Stephane Fellah > <sfel...@smartrealm.com> wrote: > > I am trying to index in Lucene a field that could have label of concepts > in > > different languages. Most of the approaches I have seen so far are: > > > > - > > > > Use a single index, where each document has a field per each language > it > > uses, or > > - > > > > Use M indexes, M being the number of languages in the corpus. > > > > Lucene 2.9+ has a feature called Payload that allows to attach attributes > to > > term. Is anyone use this mechanism to store language (or other attributes > > such as datatypes) information ? Does this approach if labels are the > same > > in different languages (does it break inverted index) ? How is > performance > > compared to the two other approaches ? Any pointer on source code showing > > how it is done would help. > > > > Thanks > > > > -- > > Stephane Fellah, M.Sc, B.Sc > > Principal Engineer/Product Manager > > smartRealm LLC > > 201 Loudoun St. SW > > Leesburg, VA 20175 > > Tel: 703 669 5514 > > Cell: 571 502 8478 > > Fax: 703 669 5515 > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Stephane Fellah, M.Sc, B.Sc Principal Engineer/Product Manager smartRealm LLC 201 Loudoun St. SW Leesburg, VA 20175 Tel: 703 669 5514 Cell: 571 502 8478 Fax: 703 669 5515