Re: spark kafka batch integration

Koert Kuipers Mon, 15 Dec 2014 14:02:39 -0800

gwen,
i thought about it a little more and i feel pretty confident i can make it
so that it's deterministic in case of node failure. will push that change
out after holidays.


On Mon, Dec 15, 2014 at 12:03 AM, Koert Kuipers <ko...@tresata.com> wrote:
>
> hey gwen,
>
> no immediate plans to contribute it to spark but of course we are open to
> this. given sparks pullreq backlog my suspicion is that spark community
> prefers a user library at this point.
>
> if you lose a node the task will restart. and since each task reads until
> the end of a kafka partition, which is somewhat of a moving target, the
> resulting data will not be the same (it will include whatever was written
> to the partition in the meantime). i am not sure if there is an elegant way
> to implement this where the data would be the same upon a task restart.
>
> if you need the data read to be the same upon retry this can be done with
> a transformation on the rdd in some cases. for example if you need data
> exactly up to midnight you can include a timestamp in the data, and start
> the KafkaRDD sometime just after midnight, and then filter to remove any
> data with a timestamp after midnight. now the filtered rdd will be the same
> even if there is a node failure.
>
> On Sun, Dec 14, 2014 at 8:27 PM, Gwen Shapira <gshap...@cloudera.com>
> wrote:
>>
>> Thank you, this is really cool! Are you planning on contributing this to
>> Spark?
>>
>> Another question? What's the behavior if I lose a node while my Spark
>> App is running? Will the RDD recovery process get the exact same data
>> from Kafka as the original? even if we wrote additional data to Kafka
>> in the mean time?
>>
>> Gwen
>>
>> On Sun, Dec 14, 2014 at 5:22 PM, Koert Kuipers <ko...@tresata.com> wrote:
>> > hello all,
>> > we at tresata wrote a library to provide for batch integration between
>> > spark and kafka. it supports:
>> > * distributed write of rdd to kafa
>> > * distributed read of rdd from kafka
>> >
>> > our main use cases are (in lambda architecture speak):
>> > * periodic appends to the immutable master dataset on hdfs from kafka
>> using
>> > spark
>> > * make non-streaming data available in kafka with periodic data drops
>> from
>> > hdfs using spark. this is to facilitate merging the speed and batch
>> layers
>> > in spark-streaming
>> > * distributed writes from spark-streaming
>> >
>> > see here:
>> > https://github.com/tresata/spark-kafka
>> >
>> > best,
>> > koert
>>
>

Re: spark kafka batch integration

Reply via email to