On Nov 17, 2009, at 10:37 AM, Christopher Tignor wrote: > Hello, > > Hoping someone might clear up a question for me: > > When Tokenizing we provide the start and end character offsets for each > token locating it within the source text. > > If I tokenize the text "word" and then serach for the term "word" in the > same field, how can I recover this character offset information in the > matching documents to precisely locate the word? I have been storing this > character info myself using payload data but if lucene stores it, then I am > doing so needlessly. If recovering this character offset info isn't > possible, what is this charcter offset info used for?
Lucene doesn't, currently, store offset information, so you are not duplicating this. There are 4 possible ways to store it that I know of, one of which is under construction now and will eventually be the best solution: 1. Payloads - in fact, there is a TokenFilter in the payloads package under contrib/analysis that does just this. 2. Term Vectors - Stores lots of info along w/ the offsets. I've often used this along w/ SpanQueries to get precise locations 3. Hack up the highlighter (assuming you aren't just doing this for highlighting) 4. Flexible indexing (future) - Create your index your way and store the offset info in a strongly typed payload. This will require writing your own code. -Grant --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org