Re: Design questions

Erick Erickson Fri, 11 Jan 2008 07:16:41 -0800

See below

On Jan 11, 2008 9:36 AM, <[EMAIL PROTECTED]> wrote:


> Hi,
>
>
> > You could even store all of the page offsets in your
> > meta-data document
> > in a special field if you wanted, then lazy-load that field
> > rather than
> > dynamically counting.
>
> How can I lazy load a field?
>

See FieldSelector. Basically, it allows you to determine
which fields are loaded when you fetch a doc. You can still
load the other fields explicitly.


>
> > You'd have to be careful that your offsets
> > corresponded to the data *after* it was analyzed, but that shouldn't
> > be too hard.
>
> The TermPosition is the position after analyzing?
>

Yes. The PositionIncrementGap directly affects this by changing
the positions of two consecutive terms by something other than 1.

To see why it must be this way, consider stop words. Let's say a
stop word is "are". Then indexing "democrats are running for office"
must position "democrats" right next to running, because searching
for, say, span queries with a slop of 0 of "democrats running" and
"democrats are running" should both match. But they wouldn't
if the position of "democrats" was 1 and "running" was 3.


>
> > You'd have to read this field before deleting the doc
> > and make sure it was stored with the replacement.....
>
> Don't understand...
>

If you are deleting a document but NOT reindexing all the text,
you want to preserve any special fields you have stored with the
offsets (if this field is stored in the Lucene doc being replaced)
and put it back into the replacement. Unless you're willing to
re-tokenize all the text to recreate the field.


>
> > And, since I'm getting random ideas anyway, here's another.
> > The PositionIncrementGap is the "distance" (measured in
> > terms) between two tokens. Let's claim that you have no
> > page with more than 10,000 (or whatever) tokens. Just
> > bump the positionincrementgap to the next 10,000 at the
> > start of each page. So, the first term on the first page
> > has an offset of 0. the first term on the second page
> > has an offset of 10,000. The first term on the third
> > page has an offset of 20,000.
> >
> > Now, with the SpanNearQuery trick from above, your
> > term position modulo 10,000 is also your page. This would
> > also NOT match across pages. Hmmmm, I kind of like that
> > idea.
>
> But I have to know, how many tokens each page has!?


No, you only need to know the maximum number of tokens on any page
you're going to index.

Make (that number * some slop factor) * (page number - 1) be the starting
term of each page. I'm assuming here that parsing a number of pages
in your data set and counting will give you a reasonable upper limit.

I suspect that you'll find that the actual number of tokens on a page is
*very* much lower than 10,000. More in the 1,500 range at most. But
only examining your dataset will tell. But you must be sure that whatever
number you use times the most pages you'll index (10,000 in your example?)
is smaller than an int.

And, of course, build in something that'll tell you if this expectation is
violated
during actual indexing of the dataset.

But you could also vary this scheme by simply storing in your document
the offsets for the beginning of each page. You'd have to load the offsets
field whenever you wanted to correlate positions to pages, but this may be
acceptable for your problem space, I haven't a clue since I don't know the
characteristics of your app.

Hope this helps
Erick


>
> Thank you!
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: Design questions

Reply via email to