>>you would still have the major problem of which matches do you keep 
>>information for

Yes, doing this efficiently is the main issue. Some vague thoughts I had:
1) A special HighlightObserverQuery could wrap any query and use it's rewrite 
method to further wrap child component queries if necessary.
2) A ThreadLocal could be used to contain low-level match info generated by 
child query components e.g. position info of phrase/span queries (maybe 
generatable by a HighlightingIndexReader wrapper which observed TermPositions 
access)
3) For each call to scorer.next() on the top level query, the HighlightObserver 
class would check to see if the doc was a "keeper" (i.e. it's score places it 
in the required top "n" docs PriorityQueue) and if so, would retain a copy of 
all the transient match info currently held in the ThreadLocal for this doc and 
associate it with the new TopDoc object placed in the top docs PriorityQueue.

This approach tries hard not to require changes to existing Query/scorer 
classes by using wrappers/ThreadLocals and would only hold low-level match 
highlighting info for N documents where "N" is the maximum number of results to 
be returned. 
However there are likely to be many detailed complications with implementing 
this. I haven't pursued this train of thought further because the main killer 
is likely to be the performance overhead from all the unnecessary object 
creation when generating match info objects for documents that don't make the 
final selection anyway. That and the cost of synchronization around ThreadLocal 
accesses.

I think we're right to stick with the existing highlighting approach of 
searching for the top N docs then re-considering the basis of the match for 
just these few docs.

Cheers
Mark




----- Original Message ----
From: Mark Miller <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Wednesday, 27 June, 2007 12:59:21 PM
Subject: Re: Highlighter that works with phrase and span queries



markharw00d wrote:
>
> I was thinking along the lines of wrapping some core classes such as 
> IndexReader to somehow observe the query matching process and deduce 
> from that what to highlight (avoiding the need for MemoryIndex)  but 
> I'm not sure that is viable. It would be nice to get some more match 
> info out of the main query logic as it runs to aid highlighting rather 
> than reverse engineering the basis of a match after the event. 

I have been thinking about a way to pursue this, and it does not seem 
clear that there is a nice solution. Even if you could wrap Querys or 
other classes to observe matched tokens (non trivial since a Query is 
only concerned if it matches a doc, not which tokens it matches at which 
positions), you would still have the major problem of which matches do 
you keep information for. It does not seem practical to save all of the 
information to highlight *any* doc after a search and it also seems 
unlikely that you would know which docs you wanted to highlight before 
the search. The only compromise that I can see is maybe just storing 
info to highlight the first n docs, but even here, while the scoring is 
occurring you do not yet know the return order. Also, there is probably 
little value in knowing which Tokens were matches for highlighting 
unless you have stored offsets as well.

Unless someone has any suggestions on how to accomplish this, I think 
time would be better spent improving the existing Highlighter framework.

Perhaps Ronnie's Highlighter should be added as an alternate Highlighter 
that is less feature rich but much faster on large docs. It looks to me 
like there is unlikely to be a faster Highlighting method for simple 
non-position aware highlighting.

- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






      ___________________________________________________________
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/ 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to