I'm posting this primarily hoping to give back a tiny bit to a very
helpful community. More likely however, someone else will open my eyes
to an easier approach than what I outline below...
I've come up with a very ugly conversion approach from regular Query
objects into SpanQuery objects. I then use the converted SpanQuery to
get span positions (currently both token #, and start/end position). In
effect, I have highlighting for simple queries with a very inefficient
approach (yea for me!).
The goal(s) I am trying to accomplish is rather specific I think, so I
imagine the use of my hacking is rather limited (i.e. just to me).
At the moment my code:
* parses the search text (i.e. user entered query)
* rewrites the resulting query to expand wildcards and such against
index
* calls a recursive conversion function with very basic conversion
understanding
o TermQuery -> SpanTerm
o PhraseQuery -> SpanNear
o others in progress as time permits
Currently, I only process simple query strings like:
"blue green yellow" => SpanOrQuery
"luce* acti*" => SpanOrQuery with wild cards expanded
e.g.: lucene lucent action acting ... all or'ed together in a
braindead fashion
"luce* acti* \"book rocks\"" => SpanOrQuery combining SpanTerms and
SpanNear (no slop)
er, hopefully you get the picture, I'm not up to showing a vector of
this one... :-)
I would be happy to discuss my approach if there is anyone interested. I
assume I am pretty much alone in finding this ineffecient approach
useful. For me, it is the functionality that overrides perfomance
issues. I have something which can take user search strings and do hit
highlighting for the exact hit found. This is really only useful for
"termA near 'some phrase'" at the moment, but might become more advanced
in the next 2-3 months.
Sean
Paul Elschot wrote:
On Thursday 20 October 2005 00:40, Sean O'Connor wrote:
Hello,
I have user entered search commands which I want to convert to
SpanQueries. I have seen in the book "Lucene in Action" that no parser
existed at time of publication, but there was someone working on a
SpanQuery parser. Can anyone point me to that code, or provide any
suggestions?
I want to use SpanQueries for their detail on the number of hits
from a query, and more importantly, the location (position start and
end) of each hit. My application requires me to do precise hit
highlighting. I also need to perform calculations on the number of hits
per document, as well as per query (sum of document hits).
You may want to use the getSpans() method of SpanQuery and operate
on the result directly.
It is fairly critical I highlight the hits, and only the hits. From
what I've read SpanQueries (with dumpSpans) is a better approach than
using 'regular' queries. I _think_ regular queries currently use a
highlighter which shows all terms highlighted. This can give more
highlighting than actual hits (i.e false positives).
So, that being said, should I stick with SpanQueries? Is there any
current work on a parser to convert a string, or regular (Token,
Boolean, Phrase, Prefix,...) query into a SpanQuery?
I have written some very duct tape-ish code which will convert basic
booleanOR and prefix queries into SpanQueries. I just realized I'm in
deeper water than I expected when I tried converting my first query
string containing several boolean queries, AND a phrase query. So now I
am looking to either help an existing effort, or just continue with my
own hacking.
:)
Have a look at the surround query parser in the svn trunk:
http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/contrib/surround/
There is also some code that does highlighting based on Spans,
but I don't know where that is. Hopefully someone else can point you at that.
Regards,
Paul Elschot
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]