Re: Random selection of files

Erick Erickson Mon, 04 Feb 2008 07:11:07 -0800

Well, assuming that by "same weight" you are referring to the document
scores (relevance), you certainly have to do the search first. But you can
use TopDocs to get a list of the document IDs arranged by decreasing
score i.e. sorted by relevance.

But "same weight" is tricky. It's virtually certain that your first 1,000
documents
will NOT have the same relevance score. Whether they differ by a little or a
lot
is entirely dependent upon the query and your corpus. Note that TopDocs
scores
are NOT normalized, although you can normalize them because TopDocs will
max score of any doc in this search. Si O assume you'll have to make a
decision
how many of the docs constitute the set relevant enough to have in your set
of choices.

So, it should be pretty straightforward to get a list of the top N document
IDs.
If speed is an issue, think about lazy loading the fields you need to
extract
from each document.

But you certainly don't want to do this with a Hits object for any number
greater
than about 100, since the Hits object will re-execute the query every 100
docs or
so.

Best
Erick

On Feb 4, 2008 6:37 AM, Juerg Meier <[EMAIL PROTECTED]> wrote:

> Hi,
>
> We have the requirement for an "i'm feeling lucky" button, at least sort
> of. Whereas google just delivers the first record in a result set, we should
> deliver 10 arbitrary hits chosen out of, let's say, 1000. All of these
> documents have the same importance i.e. have the same weight.
>
> So, is there an elegant way with the Lucene API to achieve this? Or do we
> need to retrieve all 1000 docs first, to do a random selection on our own
> afterwards? That appears to be quite expensive.
>
> Thanks for any hint,
> -- Juerg
>

Re: Random selection of files

Reply via email to