See below On Jan 11, 2008 9:36 AM, <[EMAIL PROTECTED]> wrote:
> Hi, > > > > You could even store all of the page offsets in your > > meta-data document > > in a special field if you wanted, then lazy-load that field > > rather than > > dynamically counting. > > How can I lazy load a field? > See FieldSelector. Basically, it allows you to determine which fields are loaded when you fetch a doc. You can still load the other fields explicitly. > > > You'd have to be careful that your offsets > > corresponded to the data *after* it was analyzed, but that shouldn't > > be too hard. > > The TermPosition is the position after analyzing? > Yes. The PositionIncrementGap directly affects this by changing the positions of two consecutive terms by something other than 1. To see why it must be this way, consider stop words. Let's say a stop word is "are". Then indexing "democrats are running for office" must position "democrats" right next to running, because searching for, say, span queries with a slop of 0 of "democrats running" and "democrats are running" should both match. But they wouldn't if the position of "democrats" was 1 and "running" was 3. > > > You'd have to read this field before deleting the doc > > and make sure it was stored with the replacement..... > > Don't understand... > If you are deleting a document but NOT reindexing all the text, you want to preserve any special fields you have stored with the offsets (if this field is stored in the Lucene doc being replaced) and put it back into the replacement. Unless you're willing to re-tokenize all the text to recreate the field. > > > And, since I'm getting random ideas anyway, here's another. > > The PositionIncrementGap is the "distance" (measured in > > terms) between two tokens. Let's claim that you have no > > page with more than 10,000 (or whatever) tokens. Just > > bump the positionincrementgap to the next 10,000 at the > > start of each page. So, the first term on the first page > > has an offset of 0. the first term on the second page > > has an offset of 10,000. The first term on the third > > page has an offset of 20,000. > > > > Now, with the SpanNearQuery trick from above, your > > term position modulo 10,000 is also your page. This would > > also NOT match across pages. Hmmmm, I kind of like that > > idea. > > But I have to know, how many tokens each page has!? No, you only need to know the maximum number of tokens on any page you're going to index. Make (that number * some slop factor) * (page number - 1) be the starting term of each page. I'm assuming here that parsing a number of pages in your data set and counting will give you a reasonable upper limit. I suspect that you'll find that the actual number of tokens on a page is *very* much lower than 10,000. More in the 1,500 range at most. But only examining your dataset will tell. But you must be sure that whatever number you use times the most pages you'll index (10,000 in your example?) is smaller than an int. And, of course, build in something that'll tell you if this expectation is violated during actual indexing of the dataset. But you could also vary this scheme by simply storing in your document the offsets for the beginning of each page. You'd have to load the offsets field whenever you wanted to correlate positions to pages, but this may be acceptable for your problem space, I haven't a clue since I don't know the characteristics of your app. Hope this helps Erick > > Thank you! > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >