So you have a single Kafka topic which has very high retention period (
that decides the storage capacity of a given Kafka topic) and you want to
process all historical data first using Camus and then start the streaming
process ?

The challenge is, Camus and Spark are two different consumer for Kafka
topic and both maintains their own consumed offset different way. Camus
stores offset in HDFS, and Spark Consumer in ZK. What I understand, you
need something which identify till which point Camus pulled ( for a given
partitions of topic) and want to start Spark receiver from there ?


Dib

On Wed, Sep 24, 2014 at 2:29 PM, Tobias Pfeiffer <t...@preferred.jp> wrote:

> Hi,
>
> I have a setup (in mind) where data is written to Kafka and this data is
> persisted in HDFS (e.g., using camus) so that I have an all-time archive of
> all stream data ever received. Now I want to process that all-time archive
> and when I am done with that, continue with the live stream, using Spark
> Streaming. (In a perfect world, Kafka would have infinite storage and I
> would always use the Kafka receiver, starting from offset 0.)
> Does anyone have an idea how to realize such a setup? Would I write a
> custom receiver that first reads the HDFS file and then connects to Kafka?
> Is there an existing solution for that use case?
>
> Thanks
> Tobias
>
>

Reply via email to