Hm, I don't see what partition failure means here.
You can have a node or executor failure etc. So let us look at a scenario here irrespective of being a streaming or micro-batch Spark replicates the partitions among multiple nodes. *If one executor fails*, it moves the processing over to the other executor. However, if the data is lost, it re-executes the processing that generated the data, and it will have to go back to the source. *I**n case of failure, there will be delay in getting results*. The amount of delay depends on how much reprocessing Spark has to do. - Driver executing an action When the driver executes an action, it submits a job to the Cluster Manager. The Cluster Manager starts submitting tasks to executors and monitoring them. In case, if an executor dies, the Cluster Manager does the work of reassigning the tasks. - Scaling Spark does not add executors when executors fail. It just moves the tasks to other executors. If you are installing Spark on your own cluster, you will need to figure out how to bring back executors. For example Spark on Kubernete <https://www.linkedin.com/pulse/spark-kubernetes-practitioners-guide-mich-talebzadeh-ph-d-/> s will replace the failed nodes. However, if the driver dies, the Spark job dies and there is no recovery from that. The only way to recover is to run the job again. Batch jobs do not have benchmarking, so, they will need to reprocess everything from the beginning and be idempotent. Streaming jobs have benchmarking (write intermediate progress to persistent storage like a directory etc) and they will start from the last microbatch. This means that they might have to repeat the last microbatch. - What to run in case of task(s) failure RDD lineage has all the history of what is run. In a multi-stage job, it may have to rerun all the stages again. For example, if you have done a groupBy, you will have 2 stages. After the first stage, the data will be shuffled by hashing the groupBy key , so that data for the same value of key lands in the same partition. Now, if one of those partitions is lost during execution of the second stage, Spark will have to go back and re-execute all the tasks in the first stage. HTH view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Fri, 21 Jan 2022 at 17:54, Siddhesh Kalgaonkar < kalgaonkarsiddh...@gmail.com> wrote: > Hello team, > > I am aware that in case of memory issues when a task fails, it will try to > restart 4 times since it is a default number and if it still fails then it > will cause the entire job to fail. > > But suppose if I am reading a file that is distributed across nodes in > partitions. So, what will happen if a partition fails that holds some data? > Will it re-read the entire file and get that specific subset of data since > the driver has the complete information? or will it copy the data to the > other working nodes or tasks and try to run it? >