gwen,
i thought about it a little more and i feel pretty confident i can make it
so that it's deterministic in case of node failure. will push that change
out after holidays.
On Mon, Dec 15, 2014 at 12:03 AM, Koert Kuipers wrote:
>
> hey gwen,
>
> no immediate plans to contribute it to spark but
hey joe,
glad to hear you think its a good use case of SimpleConsumer.
not sure if i understand your question. are you asking why we make the
offsets available in the rdd? we have a daily partitioned dataset on hdfs,
and processes that run at night that do the following: 1) read the last
daily par
hey gwen,
no immediate plans to contribute it to spark but of course we are open to
this. given sparks pullreq backlog my suspicion is that spark community
prefers a user library at this point.
if you lose a node the task will restart. and since each task reads until
the end of a kafka partition,
I like the idea of the KafkaRDD and Spark partition/split per Kafka
partition. That is good use of the SimpleConsumer.
I can see a few different strategies for the commitOffsets and
partitionOwnership.
What use case are you committing your offsets for?
/**
hello all,
we at tresata wrote a library to provide for batch integration between
spark and kafka. it supports:
* distributed write of rdd to kafa
* distributed read of rdd from kafka
our main use cases are (in lambda architecture speak):
* periodic appends to the immutable master dataset on hdfs