How does Spark speculation prevent duplicated work?

Mingyu Kim Tue, 15 Jul 2014 10:56:26 -0700

Hi all,

I was curious about the details of Spark speculation. So, my understanding
is that, when ³speculated² tasks are newly scheduled on other machines, the
original tasks are still running until the entire stage completes. This
seems to leave some room for duplicated work because some spark actions are
not idempotent. For example, it may be counting a partition twice in case of
RDD.count or may be writing a partition to HDFS twice in case of
RDD.save*(). How does it prevent this kind of duplicated work?


Mingyu

smime.p7s
Description: S/MIME cryptographic signature

How does Spark speculation prevent duplicated work?

Reply via email to