I've never used MoreLikeThis myself, but based on how i know it works, your problem probably has more to do with the size of your test corpus and th frequency of the words in your docs then by the size of the docs themselves.
: There's still the issue of the queries from MoreLikeThis not : returning results for terms I had expected ("bikes"). A quick glance at the source for MoreLikeThis turns up these lines... /** * Ignore terms with less than this frequency in the source doc. * @see #getMinTermFreq * @see #setMinTermFreq */ public static final int DEFAULT_MIN_TERM_FREQ = 2; /** * Ignore words which do not occur in at least this many docs. * @see #getMinDocFreq * @see #setMinDocFreq */ public static final int DEFALT_MIN_DOC_FREQ = 5; ...which i'm guessing mean that unless a word appears in a doc at least twice, it's ignored for that doc, and unless a word appears in at least 5 docs, it's ignored completely. that could easily explain your bike examples. : I then loaded some large (5K+) documents and I noticed that : MoreLikeThis's query started to return similar documents, but explain : () said they were similar because of words like "from" and "can" rather : than the text I expected to be used for similarity in the documents. Other then a stop words list, one other thing you might consider is the notion of a "maxDocFreq" option you could set to ignore words that appear in lots of documents -- or a maxDocFreqRatio that would take a percentage of the total number of docs ... it should be fairly straightforward to add. -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]