Tim Allison created TIKA-4674:
---------------------------------

             Summary: Add a progress timeout feature
                 Key: TIKA-4674
                 URL: https://issues.apache.org/jira/browse/TIKA-4674
             Project: Tika
          Issue Type: New Feature
            Reporter: Tim Allison


When processing a 100 page pdf that requires OCR, we want to allow a LOT of 
time, but we also don't want to allow a lot of time for some file that triggers 
an infinite loop in a parser.

I propose adding a progress timeout feature that will be enforced in 
tika-pipes. We'll update the progress counter in ocr parsers and anywhere else 
where we expect processing to take a while.

TotalTaskTimeout will still be operative. 

So, one scenario would be totaltasktimeout is an hour, with progress timeout 
set for 2 minutes. If a call to tesseract takes more than 2 minutes, then the 
job is stopped. Or if a rogue parser goes for longer than 2 minutes (and the 
progress counter is not in the loop where it is going rogue!), then that will 
timeout in 2 minutes.

We could then get rid of timeouts on the external parsers, and then have them 
read these global timeouts, with a focus on the progress timeout.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to