Re: Data loss - Spark streaming and network receiver

Wei Liu Mon, 18 Aug 2014 22:23:24 -0700

BTW, we heard that it is not so easy to setup/admin kafka on AWS, if any of
you had good or bad experiences, do you mind sharing them with us? Thanks.
I knew this is an off topic for spark user group, I wouldn't mind if you
just reply to my email address.  Thanks in advance.


Wei


On Mon, Aug 18, 2014 at 10:18 PM, Wei Liu <[email protected]>
wrote:

> Thank you all for responding to my question. I am pleasantly surprised by
> this many prompt responses I got. It shows the strength of the spark
> community.
>
> Kafka is still an option for us, I will check out the link provided by
> Dibyendu.
>
> Meanwhile if someone out there already figured this out with Kinesis,
> please keep your suggestion coming. Thanks.
>
> Thanks,
> Wei
>
>
> On Mon, Aug 18, 2014 at 9:31 PM, Dibyendu Bhattacharya <
> [email protected]> wrote:
>
>> Dear All,
>>
>> Recently I have written a Spark Kafka Consumer to solve this problem.
>> Even we have seen issues with KafkaUtils which is using Highlevel Kafka
>> Consumer and consumer code has no handle to offset management.
>>
>> The below code solves this problem, and this has is being tested in our
>> Spark Cluster and this working fine as of now.
>>
>> https://github.com/dibbhatt/kafka-spark-consumer
>>
>> This is Low Level Kafka Consumer using Kafka Simple Consumer API.
>>
>> Please have a look at it and let me know your opinion. This has been
>> written to eliminate the Data loss by committing the offset after it is
>> written to BM. Also existing HighLevel KafkaUtils does not have any feature
>> to control Data Flow, and is gives Out Of Memory error is there is too much
>> backlogs in Kafka. This consumer solves this problem as well.  And this
>> code has been modified from earlier Storm Kafka consumer code and it has
>> lot of other features like recovery from Kafka node failures, ZK failures,
>> recover from Offset errors etc.
>>
>> Regards,
>> Dibyendu
>>
>>
>> On Tue, Aug 19, 2014 at 9:49 AM, Shao, Saisai <[email protected]>
>> wrote:
>>
>>>  I think Currently Spark Streaming lack a data acknowledging mechanism
>>> when data is stored and replicated in BlockManager, so potentially data
>>> will be lost even pulled into Kafka, say if data is stored just in
>>> BlockGenerator not BM, while in the meantime Kafka itself commit the
>>> consumer offset, also at this point node is failed, from Kafka’s point this
>>> part of data is feed into Spark Streaming but actually this data is not yet
>>> processed, so potentially this part of data will never be processed again,
>>> unless you read the whole partition again.
>>>
>>>
>>>
>>> To solve this potential data loss problem, Spark Streaming needs to
>>> offer a data acknowledging mechanism, so custom Receiver can use this
>>> acknowledgement to do checkpoint or recovery, like Storm.
>>>
>>>
>>>
>>> Besides, driver failure is another story need to be carefully
>>> considered. So currently it is hard to make sure no data loss in Spark
>>> Streaming, still need to improve at some points J.
>>>
>>>
>>>
>>> Thanks
>>>
>>> Jerry
>>>
>>>
>>>
>>> *From:* Tobias Pfeiffer [mailto:[email protected]]
>>> *Sent:* Tuesday, August 19, 2014 10:47 AM
>>> *To:* Wei Liu
>>> *Cc:* user
>>> *Subject:* Re: Data loss - Spark streaming and network receiver
>>>
>>>
>>>
>>> Hi Wei,
>>>
>>>
>>>
>>> On Tue, Aug 19, 2014 at 10:18 AM, Wei Liu <[email protected]>
>>> wrote:
>>>
>>> Since our application cannot tolerate losing customer data, I am
>>> wondering what is the best way for us to address this issue.
>>>
>>> 1) We are thinking writing application specific logic to address the
>>> data loss. To us, the problem seems to be caused by that Kinesis receivers
>>> advanced their checkpoint before we know for sure the data is replicated.
>>> For example, we can do another checkpoint ourselves to remember the kinesis
>>> sequence number for data that has been processed by spark streaming. When
>>> Kinesis receiver is restarted due to worker failures, we restarted it from
>>> the checkpoint we tracked.
>>>
>>>
>>>
>>> This sounds pretty much to me like the way Kafka does it. So, I am not
>>> saying that the stock KafkaReceiver does what you want (it may or may not),
>>> but it should be possible to update the "offset" (corresponds to "sequence
>>> number") in Zookeeper only after data has been replicated successfully. I
>>> guess "replace Kinesis by Kafka" is not in option for you, but you may
>>> consider pulling Kinesis data into Kafka before processing with Spark?
>>>
>>>
>>>
>>> Tobias
>>>
>>>
>>>
>>
>>
>

Re: Data loss - Spark streaming and network receiver

Reply via email to