Isn't this approach somewhat bad for term-frequency? Words that would appear in several languages would be a lot more frequent (hence less significative).
I'm still preferring the split-field method with a proper query expansion. This way, the term-frequency is evaluated on the corpus of one language. Dominique, in your case, at least if on the web, you have: - the user's preferred language (if defined in a profile) - the list of languages the browser says it accepts And that can easily be limited to around 8 so that you cover any language the user is expecting to search. paul Le 20 janv. 2011 à 10:46, Dominique Bejean a écrit : > Hi, > > During a recent Solr project we needed to index document in a lot of > languages. The natural solution with Lucene and Solr is to define one field > per languages. Each field is configured in the schema.xml file to use a > language specific processing (tokenizing, stop words, stemmer, ...). This is > really not easy to manage if you have a lot of languages and this means that > 1) the search interface need to know in which language your are searching 2) > the search interface can't search in all languages at the same time. > > So, I decided that the only solution was to index all languages in only one > field. > > Obviously, each language needs to be processed specifically. For this, I > developped a analyzer that is in charge to redirect content to the correct > tockenizer, filters and stemmer accordingly to its language. This analyzer > is also used at query time. If the user specify the language of its query, > the query is processed by appropriate tockenizer, filters and stemmer > otherwise the query is processed by a defaut tockenizer, filters and stemmer. > > With this solution : > > 1. I only need one field (or two if I want both stemmed and unstemmed > processing) > 2. The user can search in all document regarless to there language > > I hope this help. > > Dominique > www.zoonix.fr > www.crawl-anywhere.com > > > > Le 20/01/11 00:29, Bill Janssen a écrit : >> Paul Libbrecht<p...@hoplahup.net> wrote: >> >>> I did several changes of this sort and the precision and recall >>> measures went better in particular in presence of language-indication >>> failure which happened to be very common in our authoring environment. >> There are two kinds of failures: no language, or wrong language. >> >> For no language, I fall back to StandardAnalyzer, so I should have >> results similar to yours. For wrong language, well, I'm using OTS >> trigram-based language guessers, and they're pretty good these days. >> >>>>> Wouldn't it be better to prefer precise matches (a field that is >>>>> analyzed with StandardAnalyzer for example) but also allow matches are >>>>> stemmed. >> Yes, I think it might improve things, but again, by how much? Stemming is >> better than no stemming, in terms of recall. But this approach would also >> improve precision. >> >> Bill >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org