Re: What happens when a partition that holds data under a task fails

Siddhesh Kalgaonkar Tue, 25 Jan 2022 08:07:55 -0800

Thank you so much for the detailed explanation. I was able to recollect a
few things about Spark.


Thanks for your time once again :)


On Mon, Jan 24, 2022 at 2:20 PM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hm,
>
> I don't see what partition failure means here.
>
> You can have a node or executor failure etc. So let us look at a scenario
> here irrespective of being a streaming or micro-batch
>
> Spark replicates the partitions among multiple nodes. *If one executor
> fails*, it moves the processing over to the other executor. However, if
> the data is lost, it re-executes the processing that generated the data, and
> it will have to go back to the source. *I**n case of failure, there will
> be delay in getting results*. The amount of delay depends on how much
> reprocessing Spark has to do.
>
>
> - Driver executing an action
>
>
> When the driver executes an action, it submits a job to the Cluster
> Manager. The Cluster Manager starts submitting tasks to executors and
> monitoring them. In case, if an executor dies, the Cluster Manager does the
> work of reassigning the tasks.
>
>
> - Scaling
>
>
> Spark does not add executors when executors fail. It just moves the tasks
> to other executors. If you are installing  Spark on your own cluster, you
> will need to figure out how to bring back executors. For example Spark on
> Kubernete
> <https://www.linkedin.com/pulse/spark-kubernetes-practitioners-guide-mich-talebzadeh-ph-d-/>
> s will replace the failed nodes. However, if the driver dies, the Spark
> job dies and there is no recovery from that. The only way to recover is
> to run the job again. Batch jobs do not have benchmarking, so, they will
> need to reprocess everything from the beginning and be idempotent.
> Streaming jobs have benchmarking (write intermediate progress to persistent
> storage like a directory etc) and they will start from the last microbatch.
> This means that they might have to repeat the last microbatch.
>
>
>
> - What to run in case of task(s) failure
>
>
> RDD lineage has all the history of what is run.  In a multi-stage job, it
> may have to rerun all the stages again. For example, if you have done a
> groupBy, you will have 2 stages. After the first stage, the data will be
> shuffled by hashing the groupBy key , so that data for the same value of
> key lands in the same partition. Now, if one of those partitions is lost
> during
>
> execution of the second stage, Spark will have to go back and re-execute
> all the tasks in the first stage.
>
>
> HTH
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 21 Jan 2022 at 17:54, Siddhesh Kalgaonkar <
> kalgaonkarsiddh...@gmail.com> wrote:
>
>> Hello team,
>>
>> I am aware that in case of memory issues when a task fails, it will try
>> to restart 4 times since it is a default number and if it still fails then
>> it will cause the entire job to fail.
>>
>> But suppose if I am reading a file that is distributed across nodes in
>> partitions. So, what will happen if a partition fails that holds some data?
>> Will it re-read the entire file and get that specific subset of data since
>> the driver has the complete information? or will it copy the data to the
>> other working nodes or tasks and try to run it?
>>
>

Re: What happens when a partition that holds data under a task fails

Reply via email to