Spark Streaming checkpoint recovery causes IO re-execution

RodrigoB Thu, 21 Aug 2014 03:17:34 -0700

Dear Spark users,

We have a spark streaming application which receives events from kafka and
has an updatestatebykey call that executes IO like writing to Cassandra or
sending events to other systems.


Upon metadata checkpoint recovery (before the data checkpoint occurs) all
lost RDDs get recomputed, this means that those RDDS which are used for the
updatestatebyKey function will get recomputed and consequently the
updateStateByKey will get called again on those to recreate the final RDD.

This is all good except for the fact that Spark re-executes all IO
operations on the partition re-creation. This is a problem has while
re-computing thousands of partitions from the last seconds, IO execution
like DB writes and reactive events would get re-thrown generating system
overall inconsistency. Ideally only re-computation should occur in this case 

Could there be an easy way of adding recomputation awareness into the
updateStateByKey function so that we can avoid IO execution while
recomputing? I've looked for posts on this but couldn't find it.

Any comment will be much appreciated.

Tnks,
Rod





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-checkpoint-recovery-causes-IO-re-execution-tp12568.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Spark Streaming checkpoint recovery causes IO re-execution

Reply via email to