The boundary is a bit flexible. In terms of observed DStream effective
state the direct stream semantics is exactly-once.
In terms of external system observations (like message emission), Spark
Streaming semantics is at-least-once.

Regards,
Piotr

On Mon, Dec 5, 2016 at 8:59 AM, Michal Šenkýř <bina...@gmail.com> wrote:

> Hello John,
>
> 1. If a task complete the operation, it will notify driver. The driver may
> not receive the message due to the network, and think the task is still
> running. Then the child stage won't be scheduled ?
>
> Spark's fault tolerance policy is, if there is a problem in processing a
> task or an executor is lost, run the task (and any dependent tasks) again.
> Spark attempts to minimize the number of tasks it has to recompute, so
> usually only a small part of the data is recomputed.
>
> So in your case, the driver simply schedules the task on another executor
> and continues to the next stage when it receives the data.
>
> 2. how do spark guarantee the downstream-task can receive the shuffle-data
> completely. As fact, I can't find the checksum for blocks in spark. For
> example, the upstream-task may shuffle 100Mb data, but the downstream-task
> may receive 99Mb data due to network. Can spark verify the data is received
> completely based size ?
>
> Spark uses compression with checksuming for shuffle data so it should know
> when the data is corrupt and initiate a recomputation.
>
> As for your question in the subject:
> All of this means that Spark supports at-least-once processing. There is
> no way that I know of to ensure exactly-once. You can try to minimize
> more-than-once situations by updating your offsets as soon as possible but
> that does not eliminate the problem entirely.
>
> Hope this helps,
>
> Michal Senkyr
>

Reply via email to