GitHub user sebastian-nagel added a comment to the discussion: WARCHdfsBolt 
forwarding WARC file path to StatusUpdaterBolt

Hi @michaeldinzinger, this overlaps with #567 and recently I started to explore 
potential ways to implement a CDX indexer:

1. the first idea was to send the a tuple with the URL, metadata, WARC file 
name and WARC record offsets forward in the topology. This seems more elegant 
because it's on the user to define which bolt consumes the WARC record 
location. However, looks like it's challenging to implement because the method 
[execute(tuple) in 
AbstractHdfsBolt](https://github.com/apache/storm/blob/bf29d1cc9914d4fe596b5e65532322e3dfd3e4ff/external/storm-hdfs/src/main/java/org/apache/storm/hdfs/bolt/AbstractHdfsBolt.java#L129)
 is final. I haven't yet explored some "dirty" tricks, such as holding a 
reference to the collector in the writer. Seems like the HdfsBolt is designed 
to be dead-end (however, there is nothing about that in the [storm-hdfs 
docs](https://storm.apache.org/releases/2.3.0/storm-hdfs.html)).
2. the alternative would be to write the CDX file along with the WARC file. 
This is a viable use case of the HdfsBolt, cf. apache/storm#1044.

Given that there is a more general interest, I'd continue to explore variant 1 
- but I cannot promise when and whether this will be successful. Any 
suggestions or help are welcome!

GitHub link: 
https://github.com/apache/stormcrawler/discussions/1566#discussioncomment-13495256

----
This is an automatically sent email for dev@stormcrawler.apache.org.
To unsubscribe, please send an email to: dev-unsubscr...@stormcrawler.apache.org

Reply via email to