GitHub user michaeldinzinger added a comment to the discussion: WARCHdfsBolt forwarding WARC file path to StatusUpdaterBolt
Another thing that came up on our end regarding this issue: Besides the before mentioned information https://stormcrawler.net/faq/ --is_stored_in--> s3://path/to/file/WARC_file_0815.warc.gz especially the information s3://path/to/file/WARC_file_0815.warc.gz --was_created_on--> Timestamp.now() would be good to have. This is also not possible because (1) the WARCHdfsBolt is a dead-end, and (2) information within the StormCrawler topology is only propagated URL-wise, so to say. (that's dangerous half-knowledge from my side) Am I right with these? Background of this question is that we want to trigger further processing of the WARC files when the WARC file is completely written. So I'm wondering whether the crawler can provide us with the info "Now WARC file ready". GitHub link: https://github.com/apache/stormcrawler/discussions/1566#discussioncomment-13495258 ---- This is an automatically sent email for dev@stormcrawler.apache.org. To unsubscribe, please send an email to: dev-unsubscr...@stormcrawler.apache.org