Hello Sparkers, I'm currently running load tests on a Spark Streaming job. When the task duration increases beyond the batchDuration the job become unstable. In the logs I see tasks failed with the following message:
Job aborted due to stage failure: Task 266.0:1 failed 4 times, most recent failure: Exception failure in TID 19929 on host dnode-0.hdfs.private: java.lang.Exception: Could not compute split, block input-2-1409835930000 not found org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:51) I understand it's not healthy that the task execution duration is longer than the batchDuration, but I guess we should be able to support peaks. I'm wondering whether this is this spark streaming 'graceful degradation' or is data being lost that that moment? What is the reason for the block lost and what is the recommended approach to deal with this? Thanks in advance, Gerard.