Top field count scoring across documents

Peter 4U Sun, 22 Nov 2009 08:43:12 -0800

Hello Lucene Experts,
 
I wonder if someone might be able to shed some insight on this interesting 
scoring question:
 
The problem:
Build a search query that will return [ordered] hits by the top number of 
occurences of field values across matched documents (or as close to this as 
possible).
The built-in scoring is great for scoring number of hits within a document, but 
is there an efficient way to do this across the same field in a set of matched 
documents? (maybe scoring isn't the best way?)
 
Example:
Let's say you have an index containing book information. Each document has a 
'title' field.
Let's say the index contains 100 entries, with:
65 'title's containing the word 'tiger'
21 containing 'lion'
6 containing 'panther'
5 containing 'kitten'
3 containing 'slug'
 
What would be the best way to build a query such that returned documents are 
ordered in this way:
Rank    Value         Occurences
================================
1       tiger            65
2       lion             21
3       panther          6
4       kitten           5
5       slug             3
 
I can, of course, build a standard query, traverse the returned documents and 
build such a list, but if the returned query had many 100,000's of hits, the 
performance would degrade linearly, particularly if only the 'Top 5' are 
actually required.



One idea is to maintain a separate index with this information - the main 
problem with this is that you essentially need to know what you're searching 
for at index-time, which isn't ideal.


Has anyone come across and solved this particular issue using Lucene?
 
Many thanks,
Peter
 

                                          
_________________________________________________________________
Add your Gmail and Yahoo! Mail email accounts into Hotmail - it's easy
http://clk.atdmt.com/UKM/go/186394592/direct/01/

Top field count scoring across documents

Reply via email to