Re: Counting hits in a document

Erick Erickson Thu, 18 Jan 2007 15:29:26 -0800

Hoss:

It was late this afternooon and I was square-eyed, so I didn't add the
detail. The app we're working on first returns a summary list of all the
books that match a query, no hit information. Next, the user clicks on a
returned title and we show the hits by chapter. That is, a list of chapters
and the count of the hits for each. The index is nearing 15G at present, so
I *assumed* that I really didn't want to re-query the entire index when I
know the particular document I care about already. But what do I know?


Mark:

Very most excellent. I'll give it a look in the morning. I hope that the
class doesn't need the raw text since I don't have it any more, but your
comment "Give it a query it will give you the spans" makes me hopeful.



The real issue is that it looks like I'm reverting to my old "C" days. The
code I was writing the last couple of days started to look like a program
from...well...a long time ago. So I *know* it must be wrong <G>...... It's a
real pain in the neck to *think* in Java terms when much of my training was
before this new-fangled way of looking at programming problems happened. I
suppose I could go into management, but that would be giving in to the dark
side....

Thanks all
Erick


On 1/18/07, Mark Miller <[EMAIL PROTECTED]> wrote:


Just threw together a highlighter that can handle spans (combining a
rewrite with dumspans from LIA) and used this:
http://issues.apache.org/bugzilla/attachment.cgi?id=15568

Nice spans extractor from Mark (not me <G>). Give it a query it will
give you the spans.

- Mark

Erick Erickson wrote:
> Hi again.
>
> I've been struggling for the last couple of days and getting nowhere, so
> it's time to swallow my pride and say "Help"....
>
> OK, let's say I have a document indexed and I do NOT have access to
> the raw
> text. I need to find the offset of all the hits for a query on a single
> document. Advice was offered a while ago to use getSpans from a
> spanquery,
> but for the life of me I don't see how to make this work. As I remember,
> Erik was talking about rewriting the original query as a set of spans.
>
> The trouble I'm having is that I sure don't see how to rewrite the
> standard
> query as a span query, then feed that back into my index for a
particular
> document (that I have a unique ID for). It seems that the getSpans looks
> through my entire index, which is *probably* prohibitive.
>
> I can make each part of the query into a SpanTermQuery. I can assemble
> these
> together into a bunch of nested span queries. At the end of this, I
> have a
> single Span query that I can call getSpans on. But what now? I don't
> see how
> the spans relate to the document I need to focus on. From what I see
> of the
> Spans interface, it's intended to look at the entire index rather than
be
> confined to a subset of the documents (in this case, exactly one.
> Guaranteed).
>
> I've thought about putting the documentID in a MUST clause of a
> BooleanQuery, and adding my span query to that, but it doesn't look like
> getSpans does me any good there.
>
> I looked at the SrndQuery family and don't see anything there that
> lets me
> get the offsets of my matches.
>
> I don't have the text, so I can't highlight all the hits and count.
>
> The code I've been writing feels like the wrong solution to the wrong
> problem at the wrong time. Given that I know the document ID on the
> way in,
> is my best bet to roll my own? That is, enumerate the relevant terms
> in my
> document and measure the distance between the terms and aggregate the
> results myself? I'd rather not do that, of course, but can if necessary.
>
> I *want* someone to say "just call <fill in magic method here>"....
>
> Any help greatly appreciated...
>
> Thanks
> Erick
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Counting hits in a document

Reply via email to