On Tue, Jan 3, 2012 at 10:10 AM, Paul Libbrecht wrote:
> I think the idf is also about terms and not about tokens.
> Maybe an expert can confirm my belief or we have to invent a test.
>
idf is docFreq and maxDoc.
docFreq is per-field, maxDoc is not. This might not even matter though.
if you are
I think the idf is also about terms and not about tokens.
Maybe an expert can confirm my belief or we have to invent a test.
paul
Le 3 janv. 2012 à 15:43, heikki a écrit :
> hi Paul,
>
> yes, but my concern isn't about the term-frequency, but rather the
> inverted-document-frequency, which als
hi Paul,
yes, but my concern isn't about the term-frequency, but rather the
inverted-document-frequency, which also is used in the relevance score and
which takes into account all documents in the index.. in this way the
relevance score of one document is influenced by the contents of all other
do
Heikki,
it does solve your main concern: a term in lucene is a pair of a token and
field name.
The term frequency is, thus, the frequency of a token in a field.
So the term-frequency of text-stemmed-de:firewall is independent of the
term-frequency of text-stemmed-en:firewall (for example).
But
hi,
thanks for your response :
> On the web it is often hard to trust such (e.g. because of people working
in multiple languages, internet cafés...) but... it is your choice.
our web app has a language selector for the user to choose the GUI language
>After?
>Would "shallow matches" in the righ
Le 3 janv. 2012 à 13:56, heikki a écrit :
> In our case, it is "known" in which language the user is searching (because
> he tells us, and if he doesn't, we use the current GUI language).
On the web it is often hard to trust such (e.g. because of people working in
multiple languages, internet c
appropriate stopwords/differnt analyzers when indexing and searching a
particular language, but that's a different issue obviously.
thanks in advance,
Heikki Doeleman
--
View this message in context:
http://lucene.472066.n3.nabble.com/Designing-a-multiling
sion.
The simple route is to ignore the language, use ngrams, forget stemmers & al
and just fire; recall will likely be good, precision not that much.
Cheers
Henrib
--
View this message in context:
http://n3.nabble.com/Designing-a-multilingual-index-tp688766p692481.html
Sent from the Lucene - J
Le 01-avr.-10 à 16:29, henrib a écrit :
By issuing multiple queries, one against each localized index,
results being
clustered by locale.
You can further refine by translating the end-user input query terms
for
each locale and issue "translated" queries against the respective
indices.
I've
ith "key" terms dictionaries.
--
View this message in context:
http://n3.nabble.com/Designing-a-multilingual-index-tp688766p691687.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
-
To unsub
per names), then how do I get about mixing the two
> rankings? (as I don't want to display the same result twice) I think using
> a single index would solve that problem, since the ranking would already
> take all fields (hence all languages into account).
>
If we consider a uniq
per names), then how do I get about mixing the two
> rankings? (as I don't want to display the same result twice) I think using
> a single index would solve that problem, since the ranking would already
> take all fields (hence all languages into account).
>
If we consider a uniq
ith "key" terms dictionaries.
--
View this message in context:
http://n3.nabble.com/Designing-a-multilingual-index-tp688766p690881.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
-
To unsub
ent in the French index (which would quite often be the
> case for e.g. proper names), then how do I get about mixing the two
> rankings? (as I don't want to display the same result twice) I think using
> a single index would solve that problem, since the ranking would already
> take
ith "key" terms dictionaries.
Henri
--
View this message in context:
http://n3.nabble.com/Designing-a-multilingual-index-tp688766p691744.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
-
To unsub
using a single index would solve
that problem, since the ranking would already take all fields (hence all
languages into account).
Cheers,
David
- Original Message
From: henrib
To: java-user@lucene.apache.org
Sent: Thu, April 1, 2010 2:19:07 PM
Subject: Re: Designing a multilingual ind
How?
paul
Le 01-avr.-10 à 14:19, henrib a écrit :
Finally, query expansion can also be used in the multiple indices
case and
might even use automated/guided translation.
-
To unsubscribe, e-mail: java-user-unsubscr...@lu
his message in context:
http://n3.nabble.com/Designing-a-multilingual-index-tp688766p690625.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
rience.
Does anyone have any technical arguments why the one (several indices) or the
other (localized fields in a single index) method might be better?
Cheers,
David
- Original Message
From: Paul Libbrecht
To: java-user@lucene.apache.org
Sent: Wed, March 31, 2010 10:00:
David,
I'm doing exactly that.
And I think there's one crucial advantage aside: multilingual queries:
if your user requests "segment" you have no way to know which language
he is searching for; erm, well, you have the user-language(s) (through
the browser Accept-Language header for example)
Hi everyone!
I'm about to build a search engine that will handle documents in several
languages (4 for now but the number will increase in the near future). In order
to index them properly and offer the best user experience, I'm automatically
recognizing the language of each document in order t
21 matches
Mail list logo