Re: Multiword Highlighting

Otis Gospodnetic Sat, 27 Jan 2007 22:02:22 -0800

For what it's worth Mark (Miller), there *is* a need for "just highlight the 
query terms without trying to get excerpts" functionality - something a la 
Google cache (different colours...mmm, nice).  I've had people ask me for this 
before, and I know I could use this functionality, too.  Please contrib to 
contrib/ if you end up working on this.


Otis
--
Simpy -- http://www.simpy.com/ -- Tag.  Search.  Share.

----- Original Message ----
From: Mark Miller <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Sunday, January 28, 2007 7:39:29 AM
Subject: Re: Multiword Highlighting


markharw00d wrote:
> >>Isn't it semi trivial if you are not interested in the fragments (I 
> swear it seems that most people are not)? I
>
> I haven't conducted a survey but it's the typical web search engine 
> scenario - select only a small subset of the matching document content 
> for display in SERPS. I would expect that to be a pretty commonplace 
> requirement for which we should retain a solution.
No doubt. I certainly am not suggesting you ditch fragments and I have 
no evidence more people just want to highlight a doc...it's just the 
impression that I get from the mailing list is that most people just 
want to highlight the returned doc...I am sure plenty of people need 
google style results too, but my experience with Lucene has not often 
been in the area of web search engines. I bet a lot of users would 
benefit from a highlighter that highlights actual hits and doesn't 
summarize though (both would be great). I wouln't claim to be an 
authority on any of this though...take my opinion for what its worth -- 
very little.
>
> Maybe a new highlighter with no attempt at summarising could more 
> easily address phrase support for small pieces of content. It will 
> always be hard to  faithfully represent all possible query match logic 
> - especially if there are NOTs, ANDs and ORs mixed in with all the 
> term proximity logic e.g. NotNear. Some compromise is required. I did 
> suggest that spans maybe a better basis for highlighting than terms 
> and pointed at some existing code to get you along this path - see 
> here http://marc.theaimsgroup.com/?l=lucene-user&m=112496111224218&w=2
I have some code that you wrote that seems to turn almost any query into 
a series of spans. Perhaps it is not as robust as my limited testing 
made it seem.
>
> There are also a couple of other Highlighter packages contributed 
> recently which I listed in my previous mail but I simply haven't had 
> the time to look at in detail so they may be useful. Anyone had any 
> experience of those?
Non of them seem to do full span highlighting...again based on my 
limited investigation.
>
> >> every new highlight has to be compared against every previous 
> highlight for overlap
> Yes, Analyzers that produce overlapping tokens are an added 
> complication when implementing highlighting logic. I think we have a 
> reasonable Junit test containing several of the more exotic analyzer 
> scenarios which you could/should use for testing any other highlighter 
> implementation.
thanks for the tip.

I appreciate your response Mark. I will continue to look at your span 
extractor...I thought that it alone was enough to what I wanted, but 
your comments seem to suggest maybe I'll need more. I hope not <g> If I 
do manage something I will be sure to post my results.


- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Multiword Highlighting

Reply via email to