Re: Incorrect Token Offset when using multiple fieldable instance

2008-07-02 Thread Michael McCandless
Toph wrote: Michael McCandless-2 wrote: We could alternatively extend TokenStream so you could query it for the final offset, then fix indexing to use that value instead of the endOffset of the last token that it saw. Querying the tokenstream for the final offset would good, but then w

Re: Incorrect Token Offset when using multiple fieldable instance

2008-07-02 Thread Toph
Michael McCandless-2 wrote: > > > This would actually be a fairly large change: it's a change to the > index format and all APIs that handle offsets during indexing & > searching/retrieving. > > For now I just changed the offset calculation in DocumentWriter as specified here by the OP:

Re: Incorrect Token Offset when using multiple fieldable instance

2008-07-02 Thread Michael McCandless
This would actually be a fairly large change: it's a change to the index format and all APIs that handle offsets during indexing & searching/retrieving. We could alternatively extend TokenStream so you could query it for the final offset, then fix indexing to use that value instead of the

Re: Incorrect Token Offset when using multiple fieldable instance

2008-06-30 Thread Toph
Interesting discussion... glad I'm not the only one with this challenge. Michael McCandless-2 wrote: > > EG, if you use Highlighter on a > multi-valued field indexed with stored field & term vectors and say > the first field ended with a stop word that was filtered out, then > your offset

Re: Incorrect Token Offset when using multiple fieldable instance

2008-03-05 Thread Michael McCandless
Well, first off, sometimes the thing being indexed isn't a string, so you have no stringValue to get its length. It could be a Reader or a TokenStream. Second off, it's conceivable that an analyzer computes its own "interesting" offsets that are not in fact simple indices into the stri

Re: Incorrect Token Offset when using multiple fieldable instance

2008-03-05 Thread Renaud Delbru
Do you know if there will be side-effects if we replace in DocumentWriter$FieldData#invertField offset = offsetEnd+1; by offset = stringValue.length(); I still not understand the reason of such choice for the incrementation of the start offset. Regards. Michael McCandless wrote: This is ho

Re: Incorrect Token Offset when using multiple fieldable instance

2008-03-05 Thread Michael McCandless
This is how Lucene has worked for quite some time (since 1.9). When there are multiple fields with the same name in one Document, each field's offset starts from the last offset (offset of the last token) seen in the previous field. If tokens are skipped at the end there's no way IndexWri