Re: Checkpointing in foreachPartition in Spark batck

daniel williams Thu, 17 Apr 2025 04:54:25 -0700

Good call out. Yeah, once you take your work out of Spark it’s all on you.
Any level partitions operations (e.g. map, flat map, foreach) ends up as a
lambda in catalyst. I’ve found, however, not using explode and doing things
procedurally at this point with a sufficient amount of unit testing helps a
ton on coverage and stability. If you’re not using spark-testing-base I
highly advise it. I’ve built many streaming and processing topologies on
spark for several orgs now and Holden’s framework has saved me more times
than I can imagine.


-dan


On Thu, Apr 17, 2025 at 12:09 AM Ángel Álvarez Pascua <
angel.alvarez.pas...@gmail.com> wrote:

> Just a quick note on working at the RDD level in Spark — once you go down
> to that level, it’s entirely up to you to handle everything. You gain more
> control and flexibility, but Spark steps back and hands you the steering
> wheel. If tasks fail, it's usually because you're allowing them to — by not
> properly capturing or handling all potential errors.
>
> Last year, I worked on a Spark Streaming project that had to interact with
> various external systems (OpenSearch, Salesforce, Microsoft Office 365, a
> ticketing service, ...) through HTTP requests. We handled those
> interactions the way I told you, and it’s been running in production
> without issues since last summer.
>
> That said, a word of caution: while we didn’t face any issues, some
> colleagues on a different project ran into a problem when calling a model
> inside a UDF. In certain cases — especially after using the built-in
> explode function — Spark ended up calling the model twice per row. Their
> workaround was to cache the DataFrame before invoking the model. (I don’t
> think Spark speculation was enabled in their case, by the way.)
>
> I still need to dig deeper into the root cause, but just a heads-up —
> under certain conditions, Spark might trigger multiple executions. We also
> used the explode function in our project, but didn’t observe any
> duplicate calls... so for now, it remains a bit of a mystery.
>
>
>
> El jue, 17 abr 2025 a las 2:02, daniel williams (<
> daniel.willi...@gmail.com>) escribió:
>
>> Apologies. I’m imagining a pure Kafka application and imagining a
>> transactional consumer on processing. Angel’s suggestion is the correct
>> one, mapPartition to maintain state across your broadcast Kafka producers
>> to retry or introduce back pressure given your producer and retry in its
>> threadpool given the producers built in backoff. This would allow for you
>> to retry n times and then upon final (hopefully not) failure update your
>> dataset for further processing.
>>
>> -dan
>>
>>
>> On Wed, Apr 16, 2025 at 5:04 PM Abhishek Singla <
>> abhisheksingla...@gmail.com> wrote:
>>
>>> @daniel williams <daniel.willi...@gmail.com>
>>>
>>> > Operate your producer in transactional mode.
>>> Could you elaborate more on this. How to achieve this with
>>> foreachPartition and HTTP Client.
>>>
>>> > Checkpointing is an abstract concept only applicable to streaming.
>>> That means checkpointing is not supported in batch mode, right?
>>>
>>> @angel.alvarez.pas...@gmail.com <angel.alvarez.pas...@gmail.com>
>>>
>>> > What about returning the HTTP codes as a dataframe result in the
>>> foreachPartition
>>> I believe you are referring to mapParitions, that could be done. The
>>> main concern here is that the issue is not only with failures due to the
>>> third system but also with task/job failures. I wanted to know if there is
>>> an existing way in spark batch to checkpoint already processed rows of a
>>> partition if using foreachPartition or mapParitions, so that they are not
>>> processed again on rescheduling of task due to failure or retriggering of
>>> job due to failures.
>>>
>>> Regards,
>>> Abhishek Singla
>>>
>>> On Wed, Apr 16, 2025 at 7:38 PM Ángel Álvarez Pascua <
>>> angel.alvarez.pas...@gmail.com> wrote:
>>>
>>>> What about returning the HTTP codes as a dataframe result in the
>>>> foreachPartition, saving it to files/table/whatever and then performing a
>>>> join to discard already ok processed rows when you it try again?
>>>>
>>>> El mié, 16 abr 2025 a las 15:01, Abhishek Singla (<
>>>> abhisheksingla...@gmail.com>) escribió:
>>>>
>>>>> Hi Team,
>>>>>
>>>>> We are using foreachPartition to send dataset row data to third system
>>>>> via HTTP client. The operation is not idempotent. I wanna ensure that in
>>>>> case of failures the previously processed dataset should not get processed
>>>>> again.
>>>>>
>>>>> Is there a way to checkpoint in Spark batch
>>>>> 1. checkpoint processed partitions so that if there are 1000
>>>>> partitions and 100 were processed in the previous batch, they should not
>>>>> get processed again.
>>>>> 2. checkpoint partial partition in foreachPartition, if I have
>>>>> processed 100 records from a partition which have 1000 total records, is
>>>>> there a way to checkpoint offset so that those 100 should not get 
>>>>> processed
>>>>> again.
>>>>>
>>>>> Regards,
>>>>> Abhishek Singla
>>>>>
>>>>

Re: Checkpointing in foreachPartition in Spark batck

Reply via email to