Thank you so much for the detailed explanation. I was able to recollect a few things about Spark.
Thanks for your time once again :) On Mon, Jan 24, 2022 at 2:20 PM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Hm, > > I don't see what partition failure means here. > > You can have a node or executor failure etc. So let us look at a scenario > here irrespective of being a streaming or micro-batch > > Spark replicates the partitions among multiple nodes. *If one executor > fails*, it moves the processing over to the other executor. However, if > the data is lost, it re-executes the processing that generated the data, and > it will have to go back to the source. *I**n case of failure, there will > be delay in getting results*. The amount of delay depends on how much > reprocessing Spark has to do. > > > - Driver executing an action > > > When the driver executes an action, it submits a job to the Cluster > Manager. The Cluster Manager starts submitting tasks to executors and > monitoring them. In case, if an executor dies, the Cluster Manager does the > work of reassigning the tasks. > > > - Scaling > > > Spark does not add executors when executors fail. It just moves the tasks > to other executors. If you are installing Spark on your own cluster, you > will need to figure out how to bring back executors. For example Spark on > Kubernete > <https://www.linkedin.com/pulse/spark-kubernetes-practitioners-guide-mich-talebzadeh-ph-d-/> > s will replace the failed nodes. However, if the driver dies, the Spark > job dies and there is no recovery from that. The only way to recover is > to run the job again. Batch jobs do not have benchmarking, so, they will > need to reprocess everything from the beginning and be idempotent. > Streaming jobs have benchmarking (write intermediate progress to persistent > storage like a directory etc) and they will start from the last microbatch. > This means that they might have to repeat the last microbatch. > > > > - What to run in case of task(s) failure > > > RDD lineage has all the history of what is run. In a multi-stage job, it > may have to rerun all the stages again. For example, if you have done a > groupBy, you will have 2 stages. After the first stage, the data will be > shuffled by hashing the groupBy key , so that data for the same value of > key lands in the same partition. Now, if one of those partitions is lost > during > > execution of the second stage, Spark will have to go back and re-execute > all the tasks in the first stage. > > > HTH > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Fri, 21 Jan 2022 at 17:54, Siddhesh Kalgaonkar < > kalgaonkarsiddh...@gmail.com> wrote: > >> Hello team, >> >> I am aware that in case of memory issues when a task fails, it will try >> to restart 4 times since it is a default number and if it still fails then >> it will cause the entire job to fail. >> >> But suppose if I am reading a file that is distributed across nodes in >> partitions. So, what will happen if a partition fails that holds some data? >> Will it re-read the entire file and get that specific subset of data since >> the driver has the complete information? or will it copy the data to the >> other working nodes or tasks and try to run it? >> >