Hello Stephane,

I think a better way is to have resource file with different language and store pointer in the index to get to correct resource file ( Something like I18N and L10N approach). Store the internationalised string in index and all related localised string in resource file .

This way index size will be reduced (adding to payload would have impact on performance)
and help performance too.

Now your Total Search Time would be (searchtime+time to retrieve the language based data)

Hope this helps.
-Vinaya

On Friday 11 March 2011 09:05 PM, Stephane Fellah wrote:
Erick,

I am trying to index multilingual taxonomies such as SKOS, Wordnet,
Eurowordnet. Taxonomies are composed of concepts which have preferred and
alternative labels in different languages. Some labels are the same lexical
form in different languages. I want to be able to index these concepts in
Lucene in order to be able to search concepts by their label in one or
several languages. I want also be able to display concept definition with
all the alternative labels in different languages. My question is: could we
use the payload mechanism to store the language assigned to the word (i read
somewhere Google was using payload to store information such as font for
example, so why not language) ? Wouldn't be a better approach then using one
field per language or one index per language ?

REgards
Stephane

On Fri, Mar 11, 2011 at 7:52 AM, Erick Erickson<erickerick...@gmail.com>wrote:

It's not so much a matter of problems with indexing/searching
as it is with search behavior. The reason these strategies
are implemented is that using English stemming, say, on
other languages will produce "interesting" results.

There's no a-priori reason you can't index multiple languages
in the same field.

So I don't see what you would accomplish by using payloads
to indicate which language the term is in. Could you expand
a bit on what you're trying to accomplish here? Maybe there
are better solutions....

Best
Erick


On Thu, Mar 10, 2011 at 10:29 PM, Stephane Fellah
<sfel...@smartrealm.com>  wrote:
I  am trying to index in Lucene a field that could have label of concepts
in
different languages. Most of the approaches I have seen so far are:

   -

   Use a single index, where each document has a field per each language
it
   uses, or
   -

   Use M indexes, M being the number of languages in the corpus.

Lucene 2.9+ has a feature called Payload that allows to attach attributes
to
term. Is anyone use this mechanism to store language (or other attributes
such as datatypes) information ? Does this approach if labels are the
same
in different languages (does it break inverted index) ? How is
performance
compared to the two other approaches ? Any pointer on source code showing
how it is done would help.

Thanks

--
Stephane Fellah, M.Sc, B.Sc
Principal Engineer/Product Manager
smartRealm LLC
201 Loudoun St. SW
Leesburg, VA 20175
Tel: 703 669 5514
Cell: 571 502 8478
Fax: 703 669 5515

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to