Hi all, I was curious about the details of Spark speculation. So, my understanding is that, when ³speculated² tasks are newly scheduled on other machines, the original tasks are still running until the entire stage completes. This seems to leave some room for duplicated work because some spark actions are not idempotent. For example, it may be counting a partition twice in case of RDD.count or may be writing a partition to HDFS twice in case of RDD.save*(). How does it prevent this kind of duplicated work?
Mingyu
smime.p7s
Description: S/MIME cryptographic signature