OK, I will give this a try. Now I have the problem that I do not know how to get the offsets (or positions? What is the difference?) back from the searched document...
There is a IndexReader#termPositions (Term t) - but this returns the positions for the whole index, not a single document. > -----Original Message----- > From: Erick Erickson [mailto:[EMAIL PROTECTED] > Sent: Donnerstag, 24. Januar 2008 20:56 > To: java-user@lucene.apache.org > Subject: Re: Design questions > > I think you'll have to implement your own Analyzer and count. > That is, every call to next() that returns a token will have to > also increment some counter by 1. > > To use this, you must have some way of knowing when a page > ends, and at that point you call your instance of your custom > analyzer to see what the count is. Or your analyzer maintains > the list and you can call for it after you've added all the pages. > > Analyzer.getPositionIncrementGap is called every time you > call document.add("field"..... > > So, you have something like this > while (more pages for doc) { > string pagedata = getPageText(); > doc.add("text", pagedata); > } > > Under the covers, your custom analyzer adds the current offset > (which you've kept track of) to, say, an ArrayList. And after the > last page is added, you get this arraylist and add it to your > document. > > Or, you could just do things twice. That is, send your text through > a TokenStream, then call next() and count. Then send it all > through doc.add(). > > There are probably cleverer ways, but that should do for a start. > > Best > Erick > > On Jan 24, 2008 2:33 PM, <[EMAIL PROTECTED]> wrote: > > > > -----Original Message----- > > > From: Erick Erickson [mailto:[EMAIL PROTECTED] > > > Sent: Freitag, 11. Januar 2008 16:16 > > > To: java-user@lucene.apache.org > > > Subject: Re: Design questions > > > > > But you could also vary this scheme by simply storing in > your document > > > the offsets for the beginning of each page. > > > > Well, this is the best for my app I think, but... > > > > How do I find out these offsets? > > > > I'm adding the content field with: > > > > IndexWriter#add(new Field("content", myContentReader)); > > > > I have no clue how find out the offsets in this reader. > Must be something > > with an analyzer and a TokenStream? > > > > Thank you > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]