Re: Bypassing ExtractingRequestHandler

Charlie Hull Fri, 10 Jun 2016 01:23:00 -0700

On 10/06/2016 02:20, Justin Lee wrote:

Has anybody had any experience bypassing ExtractingRequestHandler and
simply managing Tika manually?  I want to make a small modification to Tika
to get and save additional data from my PDFs, but I have been
procrastinating in no small part due to the unpleasant prospect of setting
up a development environment where I could compile and debug modifications
that might run through PDFBox, Tika, and ExtractingRequestHandler.  It
occurs to me that it would be much easier if the two were separate, so I
could have direct control over Tika and just submit the text to Solr after
extraction.  Am I going to regret this approach?  I'm not sure what
ExtractingRequestHandler really does for me that Tika doesn't already do.

We tend to prefer running Tika externally as it's entirely possible thatTika will crash or hang with certain files - and that will bring downSolr if you're running Tika within it. Here's a Dropwizard wrapperaround Tika that might be of use:

https://github.com/mattflax/dropwizard-tika-server

Cheers

Charlie


Also, I was reading this
<http://stackoverflow.com/questions/33292776/solr-tika-processor-not-crawling-my-pdf-files-prefectly>
stackoverflow entry and someone offhandedly mentioned that
ExtractingRequestHandler might be separated in the future anyway. Is there
a public roadmap for the project, or does one have to keep up with the
developer's mailing list and hunt through JIRA entries to keep up with the
pulse of the project?

Thanks,
Justin



--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Re: Bypassing ExtractingRequestHandler

Reply via email to