On 10/06/2016 02:20, Justin Lee wrote:
Has anybody had any experience bypassing ExtractingRequestHandler and simply managing Tika manually? I want to make a small modification to Tika to get and save additional data from my PDFs, but I have been procrastinating in no small part due to the unpleasant prospect of setting up a development environment where I could compile and debug modifications that might run through PDFBox, Tika, and ExtractingRequestHandler. It occurs to me that it would be much easier if the two were separate, so I could have direct control over Tika and just submit the text to Solr after extraction. Am I going to regret this approach? I'm not sure what ExtractingRequestHandler really does for me that Tika doesn't already do.
We tend to prefer running Tika externally as it's entirely possible that Tika will crash or hang with certain files - and that will bring down Solr if you're running Tika within it. Here's a Dropwizard wrapper around Tika that might be of use:
https://github.com/mattflax/dropwizard-tika-server Cheers Charlie
Also, I was reading this <http://stackoverflow.com/questions/33292776/solr-tika-processor-not-crawling-my-pdf-files-prefectly> stackoverflow entry and someone offhandedly mentioned that ExtractingRequestHandler might be separated in the future anyway. Is there a public roadmap for the project, or does one have to keep up with the developer's mailing list and hunt through JIRA entries to keep up with the pulse of the project? Thanks, Justin
-- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk