Re: [Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

Mich Talebzadeh Mon, 19 Feb 2024 23:19:08 -0800

Thanks for your kind words Sri

Well it is true that as yet spark on kubernetes is not on-par with spark on
YARN in maturity and essentially spark on kubernetes is still work in
progress.* So in the first place IMO one needs to think why executors are
failing. What causes this behaviour? Is it the code or some inadequate
set-up? *These things come to my mind



   - Resource Allocation: Insufficient resources (CPU, memory) can lead to
   executor failures.
   - Mis-configuration Issues: Verify that the configurations are
   appropriate for your workload.
   - External Dependencies: If your Spark job relies on external services
   or data sources, ensure they are accessible. Issues such as network
   problems or unavailability of external services can lead to executor
   failures.
   - Data Skew: Uneven distribution of data across partitions can lead to
   data skew and cause some executors to process significantly more data than
   others. This can lead to resource exhaustion on specific executors.
   - Spark Version and Kubernetes Compatibility: Is Spark running on EKS or
   GKE -- that you are using a Spark version that is compatible with your
   Kubernetes environment. These vendors normally run older, more stable
   versions of Spark. Compatibility issues can arise when using your newer
   version of Spark.
   - How up-to-date are your docker images on container registries (ECR,
   GCR).Is there any incompatibility between docker images built on a Spark
   version and the host spark version you are submitting your spark-submit
   from?

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Mon, 19 Feb 2024 at 23:18, Sri Potluri <pssp...@gmail.com> wrote:

> Dear Mich,
>
> Thank you for your detailed response and the suggested approach to
> handling retry logic. I appreciate you taking the time to outline the
> method of embedding custom retry mechanisms directly into the application
> code.
>
> While the solution of wrapping the main logic of the Spark job in a loop
> for controlling the number of retries is technically sound and offers a
> workaround, it may not be the most efficient or maintainable solution for
> organizations running a large number of Spark jobs. Modifying each
> application to include custom retry logic can be a significant undertaking,
> introducing variability in how retries are handled across different jobs,
> and require additional testing and maintenance.
>
> Ideally, operational concerns like retry behavior in response to
> infrastructure failures should be decoupled from the business logic of
> Spark applications. This separation allows data engineers and scientists to
> focus on the application logic without needing to implement and test
> infrastructure resilience mechanisms.
>
> Thank you again for your time and assistance.
>
> Best regards,
> Sri Potluri
>
> On Mon, Feb 19, 2024 at 5:03 PM Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> Went through your issue with the code running on k8s
>>
>> When an executor of a Spark application fails, the system attempts to
>> maintain the desired level of parallelism by automatically recreating a new
>> executor to replace the failed one. While this behavior is beneficial for
>> transient errors, ensuring that the application continues to run, it
>> becomes problematic in cases where the failure is due to a persistent issue
>> (such as misconfiguration, inaccessible external resources, or incompatible
>> environment settings). In such scenarios, the application enters a loop,
>> continuously trying to recreate executors, which leads to resource wastage
>> and complicates application management.
>>
>> Well fault tolerance is built especially in k8s cluster. You can
>> implement your own logic to control the retry attempts. You can do this
>> by wrapping the main logic of your Spark job in a loop and controlling the
>> number of retries. If a persistent issue is detected, you can choose to
>> stop the job. Today is the third time that looping control has come up :)
>>
>> Take this code
>>
>> import time
>> max_retries = 5 retries = 0 while retries < max_retries: try: # Your
>> Spark job logic here except Exception as e: # Log the exception
>> print(f"Exception in Spark job: {str(e)}") # Increment the retry count
>> retries += 1 # Sleep time.sleep(60) else: # Break out of the loop if the
>> job completes successfully break
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>
>>
>> On Mon, 19 Feb 2024 at 19:21, Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>>> Not that I am aware of any configuration parameter in Spark classic to
>>> limit executor creation. Because of fault tolerance Spark will try to
>>> recreate failed executors. Not really that familiar with the Spark operator
>>> for k8s. There may be something there.
>>>
>>> Have you considered custom monitoring and handling within Spark itself
>>> using max_retries = 5  etc?
>>>
>>> HTH
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Dad | Technologist | Solutions Architect | Engineer
>>> London
>>> United Kingdom
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner
>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>
>>>
>>> On Mon, 19 Feb 2024 at 18:34, Sri Potluri <pssp...@gmail.com> wrote:
>>>
>>>> Hello Spark Community,
>>>>
>>>> I am currently leveraging Spark on Kubernetes, managed by the Spark
>>>> Operator, for running various Spark applications. While the system
>>>> generally works well, I've encountered a challenge related to how Spark
>>>> applications handle executor failures, specifically in scenarios where
>>>> executors enter an error state due to persistent issues.
>>>>
>>>> *Problem Description*
>>>>
>>>> When an executor of a Spark application fails, the system attempts to
>>>> maintain the desired level of parallelism by automatically recreating a new
>>>> executor to replace the failed one. While this behavior is beneficial for
>>>> transient errors, ensuring that the application continues to run, it
>>>> becomes problematic in cases where the failure is due to a persistent issue
>>>> (such as misconfiguration, inaccessible external resources, or incompatible
>>>> environment settings). In such scenarios, the application enters a loop,
>>>> continuously trying to recreate executors, which leads to resource wastage
>>>> and complicates application management.
>>>>
>>>> *Desired Behavior*
>>>>
>>>> Ideally, I would like to have a mechanism to limit the number of
>>>> retries for executor recreation. If the system fails to successfully create
>>>> an executor more than a specified number of times (e.g., 5 attempts), the
>>>> entire Spark application should fail and stop trying to recreate the
>>>> executor. This behavior would help in efficiently managing resources and
>>>> avoiding prolonged failure states.
>>>>
>>>> *Questions for the Community*
>>>>
>>>> 1. Is there an existing configuration or method within Spark or the
>>>> Spark Operator to limit executor recreation attempts and fail the job after
>>>> reaching a threshold?
>>>>
>>>> 2. Has anyone else encountered similar challenges and found workarounds
>>>> or solutions that could be applied in this context?
>>>>
>>>>
>>>> *Additional Context*
>>>>
>>>> I have explored Spark's task and stage retry configurations
>>>> (`spark.task.maxFailures`, `spark.stage.maxConsecutiveAttempts`), but these
>>>> do not directly address the issue of limiting executor creation retries.
>>>> Implementing a custom monitoring solution to track executor failures and
>>>> manually stop the application is a potential workaround, but it would be
>>>> preferable to have a more integrated solution.
>>>>
>>>> I appreciate any guidance, insights, or feedback you can provide on
>>>> this matter.
>>>>
>>>> Thank you for your time and support.
>>>>
>>>> Best regards,
>>>> Sri P
>>>>
>>>
>
> --
> Thank you,
> Sindhu P.
>

Re: [Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

Reply via email to