Thank you Erick, that was what I anticipated would be necessary.

There's still the issue of the queries from MoreLikeThis not returning results for terms I had expected ("bikes").

For example, I have these four very short documents:

"bikes are a handy tool for getting from diffrent locations on the earth, Ireally like bikes"
"Bikes can be really fast"
"bikes can be dangerous"
"I like bikes because they are fun"


I expected the query from MoreLikeThis to say that these short documents are similar because they all have "Bikes", but it doesn't.

Is there some reason for this to be so? Perhaps because the documents are so short?

I then loaded some large (5K+) documents and I noticed that MoreLikeThis's query started to return similar documents, but explain () said they were similar because of words like "from" and "can" rather
than the text I expected to be used for similarity in the documents.




Chad



On Aug 7, 2006, at 1:29 PM, Erick Erickson wrote:

Well, I expect that defining "less common" is tricky and doesn't lend itself to a canned answer <G>. Would it work to create your own list of stop words
(possibly very large) to use for indexing and/or searching? This would
simply exclude the "less common" words (as you define them).
StandardAnalyzer, for instance, can take a File of stop words in one of its
constructors.......

Erick

On 8/7/06, Chad Hardin <[EMAIL PROTECTED]> wrote:

hi all,

I'm new to lucene but I'm loving it!  I'm writing a prototype that
links documents together based upon similarities.  Obviously the
first thing I did was use MoreLikeThis.  However, it seems to be
finding matches based upon words that are too common, in this case
the words "from" and "can" and seems to be missing matches using the
terms I would expect (in this case documents about "bikes").

I seems I need a more custom tailored Filter that only passes through
more less-common words.  Does something like this already exist?

Thanks,

Chad



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to