On Sep 17, 2008, at 4:39 PM, Dino Korah wrote:

I know in applications where we search for a words or phrases and expect the
result sorted by relevance, TopDocCollector would work like a dream.
But what about scenario where the result needs to be sorted chronologically
or by some kind of metadata.

A very common application would be email applications. If someone is to search on the Inbox, the result will be expected to appear sorted by date.

Wouldn't you expect by relevance and then by date? One way to achieve kind of what you want is a Function Query that uses the date as a factor in the relevance score.


If there are too many results, the user will most probably be willing to look through a fair part of the result list, which means paging through the
generated hits/result is quite handy feature for a generic library.

Well, the way this is typically done is you ask for increasingly more results and re-execute the query. Another way is to cache. In my experience, it usually is very fast to requery, especially once things are in the OS cache, etc. I just don't see how you can say give me results 100-100 if you don't know what results 1-99 are.

You said scoring was expensive, which maybe is true. Have you actually seen an issue w/ performance? Are you doing really complex queries? Or are you searching on really common terms? In your original email that you have a 100M+ index. Is this all on one machine?




2008/9/17 Grant Ingersoll <[EMAIL PROTECTED]>


On Sep 17, 2008, at 11:51 AM, Cam Bazz wrote:

And how about queries that need starting position, like hits between
100 and 200?


could we pass something to the collector that will count between 0 to
100 and then get the next 100 records?


The collector uses a Priority Queue to store doc ids and scores as they are collected. All the collector knows is the document id and the score and, presumably what it has seen so far, to some extent. Ordering is not defined
until all the candidate docs have been scored.

If you expect to do a lot of paging on a given set of results, I could imagine using an approach whereby you don't bother to insert entries if you've already seen them and could maybe save on some queue operations, but
not sure how well it would work.

The other thing to do is just ask for slightly more than you think you will need in the first query, but it depends on your users. Most users, in my experience, don't go beyond page 2 or 3 at most, so you could consider paying the cost to get the top 30 or 50 and caching that for your paging. If you have other application specific knowledge, you can then adjust as
appropriate.

-Grant


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




--
d i n o k o r a h
Tel: +44 7956 66 52 83
---------------------------
51°21'50.5902"N 0°6'11.8116"W

--------------------------
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ








---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to