Hi,
During a recent Solr project we needed to index document in a lot of
languages. The natural solution with Lucene and Solr is to define one
field per languages. Each field is configured in the schema.xml file to
use a language specific processing (tokenizing, stop words, stemmer,
...). This is really not easy to manage if you have a lot of languages
and this means that 1) the search interface need to know in which
language your are searching 2) the search interface can't search in all
languages at the same time.
So, I decided that the only solution was to index all languages in only
one field.
Obviously, each language needs to be processed specifically. For this, I
developped a analyzer that is in charge to redirect content to the
correct tockenizer, filters and stemmer accordingly to its language.
This analyzer is also used at query time. If the user specify the
language of its query, the query is processed by appropriate tockenizer,
filters and stemmer otherwise the query is processed by a defaut
tockenizer, filters and stemmer.
With this solution :
1. I only need one field (or two if I want both stemmed and unstemmed
processing)
2. The user can search in all document regarless to there language
I hope this help.
Dominique
www.zoonix.fr
www.crawl-anywhere.com
Le 20/01/11 00:29, Bill Janssen a écrit :
Paul Libbrecht<p...@hoplahup.net> wrote:
I did several changes of this sort and the precision and recall
measures went better in particular in presence of language-indication
failure which happened to be very common in our authoring environment.
There are two kinds of failures: no language, or wrong language.
For no language, I fall back to StandardAnalyzer, so I should have
results similar to yours. For wrong language, well, I'm using OTS
trigram-based language guessers, and they're pretty good these days.
Wouldn't it be better to prefer precise matches (a field that is
analyzed with StandardAnalyzer for example) but also allow matches are
stemmed.
Yes, I think it might improve things, but again, by how much? Stemming is
better than no stemming, in terms of recall. But this approach would also
improve precision.
Bill
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org