Hello all,

This is my first time on the list and my first question...forgive me it this
has been hacked out in the past.

We have set up Lucene/Solr and are getting somewhat spurious results. It
appears to be a result of heterogeneous document sizes. In other words, the
top results are sometimes (at least when the user is using typical search
terms) monopolized by a distinct type of document, which is otherwise small
(in number of terms). It appears that TF/IDF even with the cosine similarity
is sensitive to document size. I have run some tests and it in fact does
appear to be the case.

(Number of times the term appears in a document)/(Total Number of terms in
that document) * Log10(Number of total documents/Number of times search term
appears in all documents)

Are there any suggestions or best practices to deal with the intrinsic
heterogeneity in a corpus.

Thank you,
Rich

Reply via email to