Michael McCandless wrote:
I don't understand how this would address the "docFreq does
not reflect deletions".
Bad mail-quoting, sorry. I am not interested by document deletion, I
just index Wikipedia once, and want to get a co-occurrence-based
similarity distance between words called NGD (normalized google
distance). No documents get deleted. I wanted to do it the cheap way by
indexing everything and than counting hits but it's not fast enough (I
want to calculate all distances between all the words, that is approx
10,000 x 10,000 comparisons).
You can use the shingles analyzer (under contrib/analyzers)
to create and index bigrams. (But the docFreq would still not
reflect deletions).
This similarity measure is based on co-occurence within a text window.
If I'm not mistaken, bigrams require the words to be exactly next to
each other. But thanks for pointing out that analyzer, I may be
interested in some n-grams processing later.
Adrian.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org