Re: Use of tika for parsing, offsets questions

Grant Ingersoll Thu, 03 Sep 2009 06:51:59 -0700


On Sep 2, 2009, at 5:40 AM, David Causse wrote:

Hi,

If I use tika for parsing HTML code and inject parsed String to aluceneanalyzer. What about the offset information for KWIC and return totext

(like the google cache view)? how can I keep track of the offsets
between tika parser and lucene analyzer?

What are the solutions/ideas to do a sort of google cache view with
tika and lucene analyzer API?

With the provided API I can't keep the original content as a cache, I
need to cache the tika output and result in degraded cache view. I
didn't look too closely at tika but there is maybe a way with SAX
Locators? Build an associative array of tika parsed string offsets vs
actual offsets and use a sort of token filter to rectify
OffsetAttribute?

Hmm, maybe you could implement the ContentHandler for Tika thatinstead of creating a string for the Document, creates a TokenStream.Then, you can have it add the offsets as payloads so that you thenhave those offsets later when rendering your view.

--
David Causse
Spotter
http://www.spotter.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)using Solr/Lucene:

http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Use of tika for parsing, offsets questions

Reply via email to