Re: Token character positions

Grant Ingersoll Wed, 18 Nov 2009 03:25:15 -0800

On Nov 17, 2009, at 10:37 AM, Christopher Tignor wrote:

> Hello,
> 
> Hoping someone might clear up a question for me:
> 
> When Tokenizing we provide the start and end character offsets for each
> token locating it within the source text.
> 
> If I tokenize the text "word" and then serach for the term "word" in the
> same field, how can I recover this character offset information in the
> matching documents to precisely locate the word?  I have been storing this
> character info myself using payload data but if lucene stores it, then I am
> doing so needlessly.  If recovering this character offset info isn't
> possible, what is this charcter offset info used for?


Lucene doesn't, currently, store offset information, so you are not duplicating 
this.

There are 4 possible ways to store it that I know of, one of which is under 
construction now and will eventually be the best solution:

1. Payloads - in fact, there is a TokenFilter in the payloads package under 
contrib/analysis that does just this.
2. Term Vectors - Stores lots of info along w/ the offsets.  I've often used 
this along w/ SpanQueries to get precise locations
3. Hack up the highlighter (assuming you aren't just doing this for 
highlighting)
4. Flexible indexing (future) - Create your index your way and store the offset 
info in a strongly typed payload.  This will require writing your own code.

-Grant
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Token character positions

Reply via email to