I'm having a set of issues in trying to use Lucene that are all connected to the difficulty of retrieving offsets. I need some advice on how best to proceed, or a pointer if this has been answered somewhere.
My app requires that I display all portions of the documents where the search term or terms are found. Because of this, I always use IndexReader.getSpans(), since knowing only which documents matched isn't enough. However, this still leaves me with a lot of unresolved problems. - I cannot find any standard way to map the returned span positions to offsets. For single term queries, I can get at offsets by writing a custom TermVectorMapper. For more complex queries, I have to (I think) use rewrite(), extract the target terms, then load their term vectors and go through them to find the positions that match what's in the span, and pull up the corresponding offsets. This is...surprising. We took considerable pains during indexing to maintain the offset information through several layers of analysis filters, but now we can't get to it while searching without considerably more pain. Am I missing something obvious? - More generally, I would like to be able to iterate over positions in a document, collecting offset information for those positions as I go. Is there any way to do this? I didn't find such an iterator, but I may not know where to look. Everything I did find was tied to iterating over positions for specific terms, which is not relevant here. Right now, I can think of these options: 1) get at offsets via term vectors; try to make that as fast as possible by "short-circuiting" how much of the term vector we load. 2) Maintain external per-document position->offset maps outside Lucene. 3) Maybe store offsets as payload? But is there already a (non-term-vector based) way of getting at offsets that I don't know about? My ideal solution would be an iterable position->offset map for each document; failing that, an enhancement to getSpans() that returns offset information along with position. It seems like LUCENE-2878 and LUCENE-3318 are concerned with at least some of these issues, but the comments are a bit inside-baseball for me at this stage. So I would greatly appreciate any advice on this issue. nishad --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org