You're soo right! I'm totally new to lucene (and text analyses, searching etc), but now that you showed me I "get it". Thank you so much for your reply.

Chad


On Aug 8, 2006, at 12:45 AM, Chris Hostetter wrote:


I've never used MoreLikeThis myself, but based on how i know it works,
your problem probably has more to do with the size of your test corpus and
th frequency of the words in your docs then by the size of the docs
themselves.

: There's still the issue of the queries from MoreLikeThis not
: returning results for terms I had expected ("bikes").

A quick glance at the source for MoreLikeThis turns up these lines...

    /**
     * Ignore terms with less than this frequency in the source doc.
         * @see #getMinTermFreq
         * @see #setMinTermFreq
     */
    public static final int DEFAULT_MIN_TERM_FREQ = 2;

    /**
     * Ignore words which do not occur in at least this many docs.
         * @see #getMinDocFreq
         * @see #setMinDocFreq
     */
    public static final int DEFALT_MIN_DOC_FREQ = 5;

...which i'm guessing mean that unless a word appears in a doc at least twice, it's ignored for that doc, and unless a word appears in at least 5
docs, it's ignored completely.  that could easily explain your bike
examples.

: I then loaded some large (5K+) documents and I noticed that
: MoreLikeThis's query started to return similar documents, but explain : () said they were similar because of words like "from" and "can" rather
: than the text I expected to be used for similarity in the documents.

Other then a stop words list, one other thing you might consider is
the notion of a "maxDocFreq" option you could set to ignore words that
appear in lots of documents -- or a maxDocFreqRatio that would take a
percentage of the total number of docs ... it should be fairly
straightforward to add.




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to