On 28 Mar 2014, at 01:37, Tathagata Das <tathagata.das1...@gmail.com> wrote:

> I see! As I said in the other thread, no one reported these issues until now! 
> A good and not-too-hard fix is to add the functionality of the limiting the 
> data rate that the receivers receives at. I have opened a JIRA. 
> 

Yes, actually you should have another Jira on this
https://github.com/apache/incubator-spark/blob/master/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaInputDStream.scala#L106

This just erases offsets from zookeeper. But auto.offsets.reset have another 
meaning.

What to do when there is no initial offset in Zookeeper or if an offset is out 
of range:
* smallest : automatically reset the offset to the smallest offset
* largest : automatically reset the offset to the largest offset
* anything else: throw exception to the consumer. If this is set to largest, 
the consumer may lose some messages when the number of partitions, for the 
topics it subscribes to, changes on the broker. To prevent data loss during 
partition addition, set auto.offset.reset to smallest

i will stress it — WHEN THERE IS NO INITIAL OFFSET OF IT IS OUT OF RANGE
not “hey! i’ll just reset your position because you restarted app"

> TD
> 
> 
> On Thu, Mar 27, 2014 at 3:28 PM, Evgeny Shishkin <itparan...@gmail.com> wrote:
> 
> On 28 Mar 2014, at 01:13, Tathagata Das <tathagata.das1...@gmail.com> wrote:
> 
>> Seems like the configuration of the Spark worker is not right. Either the 
>> worker has not been given enough memory or the allocation of the memory to 
>> the RDD storage needs to be fixed. If configured correctly, the Spark 
>> workers should not get OOMs.
> 
> 
> Yes, it is easy to start with latest offsets, get steady configuration and 
> everything is nice.
> 
> Then your machine failes. And you stop receiving from kafka anything.
> 
> Then you notice this and restart your app hoping it would continue from 
> offsets on zookeeper.
> BUT NO
> YOUR DEFAULT STREAM CONSUMERS JUST ERASED OFFSETS FROM ZOOKEEPER
> 
> After we fixed erasing offsets, we start from Some Offsets in the past.
> And during batch duration we can’t limit how many messages we get from Kafka.
> AND HERE WE OOM
> 
> And it's just a pain. Complete pain.
> 
> And you remember, only some machines consumes. Usually two or three. Because 
> of broken high-level consumer in kafka.
> 

Reply via email to