rzo1 commented on PR #1944: URL: https://github.com/apache/stormcrawler/pull/1944#issuecomment-4722757632
Thanks @jnioche and @sebastian-nagel for the review and discussion. Summary of where we landed: honouring `Retry-After` by holding the internal queue inside `FetcherBolt` is workable, but fragile. To make it correct we'd also need to make queue reaping back-off aware (don't reap a queue while its `nextFetchTime` is in the future, see @sebastian-nagel's note and Nutch's `FetchItemQueues`), and even then a long delay risks back-pressure and tuple timeouts, and the number of held queues can grow large in a broad crawl. Given that #784 has been open for a long time without concrete user demand, investing in the interim in-bolt workaround doesn't seem worth the added complexity. The proper long-term home for this is the host-aware spout / host stream design (#867), which avoids the back-pressure problem entirely. So I'll close this PR in favour of pursuing #867. #784 stays open and we can revisit `Retry-After` there as part of the host-aware implementation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
