[I] [Improvement] Add configuration to stop parsing PDFs after X pages [stormcrawler]

via GitHub Fri, 08 May 2026 06:18:13 -0700


jnioche opened a new issue, #1901:
URL: https://github.com/apache/stormcrawler/issues/1901


   ### What would you like to be improved?
   
   Every so often an open crawl will stumble upon a very large pdf, these can 
take a lot of CPU to parse when effectively most of the content will not be 
indexed.
   
   ### How should we improve?
   
   https://github.com/apache/tika/pull/2803 introduced a config for PDF parsing 
in Tika to stop processing after X pages. We should make use of it as soon as 
the next version of Tika is released (currently 3.3.0) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Improvement] Add configuration to stop parsing PDFs after X pages [stormcrawler]

Reply via email to