sebastian-nagel commented on code in PR #1944:
URL: https://github.com/apache/stormcrawler/pull/1944#discussion_r3420836230


##########
core/src/main/java/org/apache/stormcrawler/bolt/FetcherBolt.java:
##########


Review Comment:
   If the next fetch time is in the future, the queue cannot be released for 
now. Otherwise the retry-after is not honored for new fetch items of the same 
site arriving through the topology.
   
   See [Nutch's 
FetchItemQueues](https://github.com/apache/nutch/blob/8e03a3e998c1aac32ae3eb06f9f7cdf116c0c5f8/src/java/org/apache/nutch/fetcher/FetchItemQueues.java#L193)
 code to release queues in combination with the exponential back-off.
   
   > or should we set an explicit future `nextFetchDate` so the scheduler 
honors the exact back-off?
   
   This would only help if there are not too many items from the same site.
   
   > I'd treat the host stream / host-aware spout design 
(https://github.com/apache/stormcrawler/issues/867, your branch 990) as the 
proper long-term home for this,
   
   Definitely. Avoiding back-pressure is important. And in a broad crawl the 
number of queues, which need to be kept to ensure the retry-after delay, can 
grow large.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to