Dear Spark users, We have a spark streaming application which receives events from kafka and has an updatestatebykey call that executes IO like writing to Cassandra or sending events to other systems.
Upon metadata checkpoint recovery (before the data checkpoint occurs) all lost RDDs get recomputed, this means that those RDDS which are used for the updatestatebyKey function will get recomputed and consequently the updateStateByKey will get called again on those to recreate the final RDD. This is all good except for the fact that Spark re-executes all IO operations on the partition re-creation. This is a problem has while re-computing thousands of partitions from the last seconds, IO execution like DB writes and reactive events would get re-thrown generating system overall inconsistency. Ideally only re-computation should occur in this case Could there be an easy way of adding recomputation awareness into the updateStateByKey function so that we can avoid IO execution while recomputing? I've looked for posts on this but couldn't find it. Any comment will be much appreciated. Tnks, Rod -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-checkpoint-recovery-causes-IO-re-execution-tp12568.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
