Hi Jack, thanks for your ideas, I've added some comments to your
questions, maybe you can throw some more light on this...
On 01/08/2013 11:34 PM, Jack Krupansky wrote:
The term "arv" is on the first list, but not the second. Maybe it's
document frequency fell below the setting for minimum document frequency
on the second run.
Or, maybe the minimum word length was set to 4 or more on the second run.
The same parameters are used (same code) for each run, all that changes
is that I change the path to a different folder, one containing 16, the
other 4 files. The smaller folder was made by simply deleting the
unwanted 12 files.
Are you using MoreLikeThisQuery or directly using MoreLikeThis?
I'm using MoreLikeThis directly, does this make a difference?
Or, possibly "arv" appears later in a document on the second run, after
the number of tokens specified by maxNumTokensParsed.
The files used in the second run are identical, and each file is read
from disk and indexed individually (as is common I'm sure). I look at
this, and when all 16 files are indexed together, the results are
repeatedly identical, and the same for the 4 files runs. I.e. the
outcomes for both 16 and 4 files can be reproduced.
The reason for my question (and for doing these runs) is that I'm using
Lucene in an application where I want to use the similarity measurements
between documents as a metric in another area. If the similarity score
changes when the size of the index changes, I need to understand.
thanks again,
Peter
-- Jack Krupansky
-----Original Message----- From: Peter Lavin
Sent: Tuesday, January 08, 2013 1:46 PM
To: java-user@lucene.apache.org
Subject: Differences in MLT Query Terms Question
Dear Users,
I am running some simple experiments with Lucene and am seeing something
I don't understand.
I have 16 text files on 4 different topics, ranging in size from 50-900
KB. When I index all 16 of these and run an MLT query based on one of
the indexed documents, I get an expected result (i.e. similar topics are
found).
When I reduce the number of text files to 4 and index them (having taken
care to overwriting the previous index files), and then run the same MLT
query (based on the same document from the index), I get slightly
different scores. I'm assuming this is because the IDF is now different
because there is less documents.
For each run, I have set the max number of terms as...
mlt.setMaxQueryTerms(100)
However, when I compare the terms which get used for the MLT query on
the 16 document index and the 4 document index, they are slightly
different. I've printed, parsed and sorted them into two columns of a
CSV file. I've pasted a small part of it at the end of this email.
My Question(s)...
1) Can anybody explain why the set of terms used for the MLT query is
different when a file from an index of 16 documents versus 4 documents
is used?
2) Am I right in assuming that the reason for slightly different scores
in the IDF, or could it be this slight difference in the sets of terms
used (or possibly both)?
regards,
Peter
--
with best regards,
Peter Lavin,
PhD Candidate,
CAG - Computer Architecture & Grid Research Group,
Lloyd Institute, 005,
Trinity College Dublin, Ireland.
+353 1 8961536
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org