GitHub user sebastian-nagel added a comment to the discussion: WARCHdfsBolt forwarding WARC file path to StatusUpdaterBolt
Hi @michaeldinzinger, this overlaps with #567 and recently I started to explore potential ways to implement a CDX indexer: 1. the first idea was to send the a tuple with the URL, metadata, WARC file name and WARC record offsets forward in the topology. This seems more elegant because it's on the user to define which bolt consumes the WARC record location. However, looks like it's challenging to implement because the method [execute(tuple) in AbstractHdfsBolt](https://github.com/apache/storm/blob/bf29d1cc9914d4fe596b5e65532322e3dfd3e4ff/external/storm-hdfs/src/main/java/org/apache/storm/hdfs/bolt/AbstractHdfsBolt.java#L129) is final. I haven't yet explored some "dirty" tricks, such as holding a reference to the collector in the writer. Seems like the HdfsBolt is designed to be dead-end (however, there is nothing about that in the [storm-hdfs docs](https://storm.apache.org/releases/2.3.0/storm-hdfs.html)). 2. the alternative would be to write the CDX file along with the WARC file. This is a viable use case of the HdfsBolt, cf. apache/storm#1044. Given that there is a more general interest, I'd continue to explore variant 1 - but I cannot promise when and whether this will be successful. Any suggestions or help are welcome! GitHub link: https://github.com/apache/stormcrawler/discussions/1566#discussioncomment-13495256 ---- This is an automatically sent email for dev@stormcrawler.apache.org. To unsubscribe, please send an email to: dev-unsubscr...@stormcrawler.apache.org