GitHub user michaeldinzinger added a comment to the discussion: WARCHdfsBolt 
forwarding WARC file path to StatusUpdaterBolt

Another thing that came up on our end regarding this issue:
Besides the before mentioned information
https://stormcrawler.net/faq/ --is_stored_in--> 
s3://path/to/file/WARC_file_0815.warc.gz
especially the information
s3://path/to/file/WARC_file_0815.warc.gz --was_created_on--> Timestamp.now()
would be good to have.
This is also not possible because
(1) the WARCHdfsBolt is a dead-end, and
(2) information within the StormCrawler topology is only propagated URL-wise, 
so to say. (that's dangerous half-knowledge from my side)
Am I right with these?

Background of this question is that we want to trigger further processing of 
the WARC files when the WARC file is completely written. So I'm wondering 
whether the crawler can provide us with the info "Now WARC file ready".

GitHub link: 
https://github.com/apache/stormcrawler/discussions/1566#discussioncomment-13495258

----
This is an automatically sent email for dev@stormcrawler.apache.org.
To unsubscribe, please send an email to: dev-unsubscr...@stormcrawler.apache.org

Reply via email to