Re: Can spark support exactly once based kafka ? Due to these following question?

Michal Šenkýř Sun, 04 Dec 2016 23:59:53 -0800

Hello John,

1. If a task complete the operation, it will notify driver. The drivermay not receive the message due to the network, and think the task isstill running. Then the child stage won't be scheduled ?

Spark's fault tolerance policy is, if there is a problem in processing atask or an executor is lost, run the task (and any dependent tasks)again. Spark attempts to minimize the number of tasks it has torecompute, so usually only a small part of the data is recomputed.

So in your case, the driver simply schedules the task on anotherexecutor and continues to the next stage when it receives the data.

2. how do spark guarantee the downstream-task can receive theshuffle-data completely. As fact, I can't find the checksum for blocksin spark. For example, the upstream-task may shuffle 100Mb data, butthe downstream-task may receive 99Mb data due to network. Can sparkverify the data is received completely based size ?

Spark uses compression with checksuming for shuffle data so it shouldknow when the data is corrupt and initiate a recomputation.


As for your question in the subject:

All of this means that Spark supports at-least-once processing. There isno way that I know of to ensure exactly-once. You can try to minimizemore-than-once situations by updating your offsets as soon as possiblebut that does not eliminate the problem entirely.


Hope this helps,

Michal Senkyr

Re: Can spark support exactly once based kafka ? Due to these following question?

Reply via email to