jnioche opened a new issue, #1901: URL: https://github.com/apache/stormcrawler/issues/1901
### What would you like to be improved? Every so often an open crawl will stumble upon a very large pdf, these can take a lot of CPU to parse when effectively most of the content will not be indexed. ### How should we improve? https://github.com/apache/tika/pull/2803 introduced a config for PDF parsing in Tika to stop processing after X pages. We should make use of it as soon as the next version of Tika is released (currently 3.3.0) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
