Hi, If I use tika for parsing HTML code and inject parsed String to a lucene analyzer. What about the offset information for KWIC and return to text (like the google cache view)? how can I keep track of the offsets between tika parser and lucene analyzer?
What are the solutions/ideas to do a sort of google cache view with tika and lucene analyzer API? With the provided API I can't keep the original content as a cache, I need to cache the tika output and result in degraded cache view. I didn't look too closely at tika but there is maybe a way with SAX Locators? Build an associative array of tika parsed string offsets vs actual offsets and use a sort of token filter to rectify OffsetAttribute? -- David Causse Spotter http://www.spotter.com/ --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org