Re: More like this returning similarities that are too generic

Chris Hostetter Tue, 08 Aug 2006 00:46:21 -0700

I've never used MoreLikeThis myself, but based on how i know it works,
your problem probably has more to do with the size of your test corpus and
th frequency of the words in your docs then by the size of the docs
themselves.


: There's still the issue of the queries from MoreLikeThis not
: returning results for terms I had expected ("bikes").

A quick glance at the source for MoreLikeThis turns up these lines...

    /**
     * Ignore terms with less than this frequency in the source doc.
         * @see #getMinTermFreq
         * @see #setMinTermFreq
     */
    public static final int DEFAULT_MIN_TERM_FREQ = 2;

    /**
     * Ignore words which do not occur in at least this many docs.
         * @see #getMinDocFreq
         * @see #setMinDocFreq
     */
    public static final int DEFALT_MIN_DOC_FREQ = 5;

...which i'm guessing mean that unless a word appears in a doc at least
twice, it's ignored for that doc, and unless a word appears in at least 5
docs, it's ignored completely.  that could easily explain your bike
examples.

: I then loaded some large (5K+) documents and I noticed that
: MoreLikeThis's query started to return similar documents, but explain
: () said they were similar because of words like "from" and "can" rather
: than the text I expected to be used for similarity in the documents.

Other then a stop words list, one other thing you might consider is
the notion of a "maxDocFreq" option you could set to ignore words that
appear in lots of documents -- or a maxDocFreqRatio that would take a
percentage of the total number of docs ... it should be fairly
straightforward to add.




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: More like this returning similarities that are too generic

Reply via email to