Re: Designing a multilingual index

2012-01-03 Thread Robert Muir
On Tue, Jan 3, 2012 at 10:10 AM, Paul Libbrecht wrote: > I think the idf is also about terms and not about tokens. > Maybe an expert can confirm my belief or we have to invent a test. > idf is docFreq and maxDoc. docFreq is per-field, maxDoc is not. This might not even matter though. if you are

Re: Designing a multilingual index

2012-01-03 Thread Paul Libbrecht
I think the idf is also about terms and not about tokens. Maybe an expert can confirm my belief or we have to invent a test. paul Le 3 janv. 2012 à 15:43, heikki a écrit : > hi Paul, > > yes, but my concern isn't about the term-frequency, but rather the > inverted-document-frequency, which als

Re: Designing a multilingual index

2012-01-03 Thread heikki
hi Paul, yes, but my concern isn't about the term-frequency, but rather the inverted-document-frequency, which also is used in the relevance score and which takes into account all documents in the index.. in this way the relevance score of one document is influenced by the contents of all other do

Re: Designing a multilingual index

2012-01-03 Thread Paul Libbrecht
Heikki, it does solve your main concern: a term in lucene is a pair of a token and field name. The term frequency is, thus, the frequency of a token in a field. So the term-frequency of text-stemmed-de:firewall is independent of the term-frequency of text-stemmed-en:firewall (for example). But

Re: Designing a multilingual index

2012-01-03 Thread heikki
hi, thanks for your response : > On the web it is often hard to trust such (e.g. because of people working in multiple languages, internet cafés...) but... it is your choice. our web app has a language selector for the user to choose the GUI language >After? >Would "shallow matches" in the righ

Re: Designing a multilingual index

2012-01-03 Thread Paul Libbrecht
Le 3 janv. 2012 à 13:56, heikki a écrit : > In our case, it is "known" in which language the user is searching (because > he tells us, and if he doesn't, we use the current GUI language). On the web it is often hard to trust such (e.g. because of people working in multiple languages, internet c

Re: Designing a multilingual index

2012-01-03 Thread heikki
appropriate stopwords/differnt analyzers when indexing and searching a particular language, but that's a different issue obviously. thanks in advance, Heikki Doeleman -- View this message in context: http://lucene.472066.n3.nabble.com/Designing-a-multiling

Re: Designing a multilingual index

2010-04-02 Thread henrib
sion. The simple route is to ignore the language, use ngrams, forget stemmers & al and just fire; recall will likely be good, precision not that much. Cheers Henrib -- View this message in context: http://n3.nabble.com/Designing-a-multilingual-index-tp688766p692481.html Sent from the Lucene - J

Re: Designing a multilingual index

2010-04-02 Thread Paul Libbrecht
Le 01-avr.-10 à 16:29, henrib a écrit : By issuing multiple queries, one against each localized index, results being clustered by locale. You can further refine by translating the end-user input query terms for each locale and issue "translated" queries against the respective indices. I've

Re: Designing a multilingual index

2010-04-01 Thread henrib
ith "key" terms dictionaries. -- View this message in context: http://n3.nabble.com/Designing-a-multilingual-index-tp688766p691687.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsub

Re: Designing a multilingual index

2010-04-01 Thread henrib
per names), then how do I get about mixing the two > rankings? (as I don't want to display the same result twice) I think using > a single index would solve that problem, since the ranking would already > take all fields (hence all languages into account). > If we consider a uniq

Re: Designing a multilingual index

2010-04-01 Thread henrib
per names), then how do I get about mixing the two > rankings? (as I don't want to display the same result twice) I think using > a single index would solve that problem, since the ranking would already > take all fields (hence all languages into account). > If we consider a uniq

Re: Designing a multilingual index

2010-04-01 Thread henrib
ith "key" terms dictionaries. -- View this message in context: http://n3.nabble.com/Designing-a-multilingual-index-tp688766p690881.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsub

Re: Designing a multilingual index

2010-04-01 Thread henrib
ent in the French index (which would quite often be the > case for e.g. proper names), then how do I get about mixing the two > rankings? (as I don't want to display the same result twice) I think using > a single index would solve that problem, since the ranking would already > take

Re: Designing a multilingual index

2010-04-01 Thread henrib
ith "key" terms dictionaries. Henri -- View this message in context: http://n3.nabble.com/Designing-a-multilingual-index-tp688766p691744.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsub

Re: Designing a multilingual index

2010-04-01 Thread David Vergnaud
using a single index would solve that problem, since the ranking would already take all fields (hence all languages into account). Cheers, David - Original Message From: henrib To: java-user@lucene.apache.org Sent: Thu, April 1, 2010 2:19:07 PM Subject: Re: Designing a multilingual ind

Re: Designing a multilingual index

2010-04-01 Thread Paul Libbrecht
How? paul Le 01-avr.-10 à 14:19, henrib a écrit : Finally, query expansion can also be used in the multiple indices case and might even use automated/guided translation. - To unsubscribe, e-mail: java-user-unsubscr...@lu

Re: Designing a multilingual index

2010-04-01 Thread henrib
his message in context: http://n3.nabble.com/Designing-a-multilingual-index-tp688766p690625.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org

Re: Designing a multilingual index

2010-04-01 Thread David Vergnaud
rience. Does anyone have any technical arguments why the one (several indices) or the other (localized fields in a single index) method might be better? Cheers, David - Original Message From: Paul Libbrecht To: java-user@lucene.apache.org Sent: Wed, March 31, 2010 10:00:

Re: Designing a multilingual index

2010-03-31 Thread Paul Libbrecht
David, I'm doing exactly that. And I think there's one crucial advantage aside: multilingual queries: if your user requests "segment" you have no way to know which language he is searching for; erm, well, you have the user-language(s) (through the browser Accept-Language header for example)

Designing a multilingual index

2010-03-31 Thread David Vergnaud
Hi everyone! I'm about to build a search engine that will handle documents in several languages (4 for now but the number will increase in the near future). In order to index them properly and offer the best user experience, I'm automatically recognizing the language of each document in order t