If a Spark streaming job stops at 12:01 and I resume the job at 12:02. Will
it still start to consume the data that were produced to Kafka at 12:01? Or
it will just start consuming from the current time?


On Tue, May 19, 2015 at 10:58 AM, Cody Koeninger <c...@koeninger.org> wrote:

> Have you read
> https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md ?
>
> 1.  There's nothing preventing that.
>
> 2. Checkpointing will give you at-least-once semantics, provided you have
> sufficient kafka retention.  Be aware that checkpoints aren't recoverable
> if you upgrade code.
>
> On Tue, May 19, 2015 at 12:42 PM, Bill Jay <bill.jaypeter...@gmail.com>
> wrote:
>
>> Hi all,
>>
>> I am currently using Spark streaming to consume and save logs every hour
>> in our production pipeline. The current setting is to run a crontab job to
>> check every minute whether the job is still there and if not resubmit a
>> Spark streaming job. I am currently using the direct approach for Kafka
>> consumer. I have two questions:
>>
>> 1. In the direct approach, no offset is stored in zookeeper and no group
>> id is specified. Can two consumers (one is Spark streaming and the other is
>> a Kafak console consumer in Kafka package) read from the same topic from
>> the brokers together (I would like both of them to get all messages, i.e.
>> publish-subscribe mode)? What about two Spark streaming jobs read from the
>> same topic?
>>
>> 2. How to avoid data loss if a Spark job is killed? Does checkpointing
>> serve this purpose? The default behavior of Spark streaming is to read the
>> latest logs. However, if a job is killed, can the new job resume from what
>> was left to avoid loosing logs?
>>
>> Thanks!
>>
>> Bill
>>
>
>

Reply via email to