Tim Allison created TIKA-4674:
---------------------------------
Summary: Add a progress timeout feature
Key: TIKA-4674
URL: https://issues.apache.org/jira/browse/TIKA-4674
Project: Tika
Issue Type: New Feature
Reporter: Tim Allison
When processing a 100 page pdf that requires OCR, we want to allow a LOT of
time, but we also don't want to allow a lot of time for some file that triggers
an infinite loop in a parser.
I propose adding a progress timeout feature that will be enforced in
tika-pipes. We'll update the progress counter in ocr parsers and anywhere else
where we expect processing to take a while.
TotalTaskTimeout will still be operative.
So, one scenario would be totaltasktimeout is an hour, with progress timeout
set for 2 minutes. If a call to tesseract takes more than 2 minutes, then the
job is stopped. Or if a rogue parser goes for longer than 2 minutes (and the
progress counter is not in the loop where it is going rogue!), then that will
timeout in 2 minutes.
We could then get rid of timeouts on the external parsers, and then have them
read these global timeouts, with a focus on the progress timeout.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)