On Sep 17, 2008, at 4:39 PM, Dino Korah wrote:
I know in applications where we search for a words or phrases and
expect the
result sorted by relevance, TopDocCollector would work like a dream.
But what about scenario where the result needs to be sorted
chronologically
or by some kind of metadata.
A very common application would be email applications. If someone is
to
search on the Inbox, the result will be expected to appear sorted by
date.
Wouldn't you expect by relevance and then by date? One way to achieve
kind of what you want is a Function Query that uses the date as a
factor in the relevance score.
If there are too many results, the user will most probably be
willing to
look through a fair part of the result list, which means paging
through the
generated hits/result is quite handy feature for a generic library.
Well, the way this is typically done is you ask for increasingly more
results and re-execute the query. Another way is to cache. In my
experience, it usually is very fast to requery, especially once things
are in the OS cache, etc. I just don't see how you can say give me
results 100-100 if you don't know what results 1-99 are.
You said scoring was expensive, which maybe is true. Have you
actually seen an issue w/ performance? Are you doing really complex
queries? Or are you searching on really common terms? In your
original email that you have a 100M+ index. Is this all on one machine?
2008/9/17 Grant Ingersoll <[EMAIL PROTECTED]>
On Sep 17, 2008, at 11:51 AM, Cam Bazz wrote:
And how about queries that need starting position, like hits between
100 and 200?
could we pass something to the collector that will count between 0
to
100 and then get the next 100 records?
The collector uses a Priority Queue to store doc ids and scores as
they are
collected. All the collector knows is the document id and the
score and,
presumably what it has seen so far, to some extent. Ordering is
not defined
until all the candidate docs have been scored.
If you expect to do a lot of paging on a given set of results, I
could
imagine using an approach whereby you don't bother to insert
entries if
you've already seen them and could maybe save on some queue
operations, but
not sure how well it would work.
The other thing to do is just ask for slightly more than you think
you will
need in the first query, but it depends on your users. Most users,
in my
experience, don't go beyond page 2 or 3 at most, so you could
consider
paying the cost to get the top 30 or 50 and caching that for your
paging.
If you have other application specific knowledge, you can then
adjust as
appropriate.
-Grant
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--
d i n o k o r a h
Tel: +44 7956 66 52 83
---------------------------
51°21'50.5902"N 0°6'11.8116"W
--------------------------
Grant Ingersoll
http://www.lucidimagination.com
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]