jnioche commented on PR #1944:
URL: https://github.com/apache/stormcrawler/pull/1944#issuecomment-4717937566

   yes, the split makes sense
   
   > One thing I'd like your take on: should the re-emitted URLs go out as 
`Status.ERROR` (reuses the existing path, but carries error/retry semantics), 
or should we set an explicit future `nextFetchDate` so the scheduler honors the 
exact back-off?
   
   `Status.ERROR` is not the right status: it indicates an irremediable problem 
with the content of the document, like a pdf that would be unparsable for 
instance or a URL blocked by robots.txt
   
   Could set an explicit `nextFetchDate` but I think just mimicking what is 
done via `crawl-delay-too-long` would be good enough.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to