So you have a single Kafka topic which has very high retention period ( that decides the storage capacity of a given Kafka topic) and you want to process all historical data first using Camus and then start the streaming process ?
The challenge is, Camus and Spark are two different consumer for Kafka topic and both maintains their own consumed offset different way. Camus stores offset in HDFS, and Spark Consumer in ZK. What I understand, you need something which identify till which point Camus pulled ( for a given partitions of topic) and want to start Spark receiver from there ? Dib On Wed, Sep 24, 2014 at 2:29 PM, Tobias Pfeiffer <t...@preferred.jp> wrote: > Hi, > > I have a setup (in mind) where data is written to Kafka and this data is > persisted in HDFS (e.g., using camus) so that I have an all-time archive of > all stream data ever received. Now I want to process that all-time archive > and when I am done with that, continue with the live stream, using Spark > Streaming. (In a perfect world, Kafka would have infinite storage and I > would always use the Kafka receiver, starting from offset 0.) > Does anyone have an idea how to realize such a setup? Would I write a > custom receiver that first reads the HDFS file and then connects to Kafka? > Is there an existing solution for that use case? > > Thanks > Tobias > >