Re: Checkpointing in foreachPartition in Spark batck

daniel williams Thu, 17 Apr 2025 11:29:50 -0700

I have not. Most of my work and development on Spark has been on the scala
side of the house and I've built a suite of tools for Kafka integration
with Spark for stream analytics along with spark-testing-base
<https://github.com/holdenk/spark-testing-base>


On Thu, Apr 17, 2025 at 12:13 PM Ángel Álvarez Pascua <
angel.alvarez.pas...@gmail.com> wrote:

> Have you used the new equality functions introduced in Spark 3.5?
>
>
> https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.testing.assertDataFrameEqual.html
>
> El jue, 17 abr 2025, 13:18, daniel williams <daniel.willi...@gmail.com>
> escribió:
>
>> Good call out. Yeah, once you take your work out of Spark it’s all on
>> you. Any level partitions operations (e.g. map, flat map, foreach) ends up
>> as a lambda in catalyst. I’ve found, however, not using explode and doing
>> things procedurally at this point with a sufficient amount of unit testing
>> helps a ton on coverage and stability. If you’re not using
>> spark-testing-base I highly advise it. I’ve built many streaming and
>> processing topologies on spark for several orgs now and Holden’s framework
>> has saved me more times than I can imagine.
>>
>> -dan
>>
>>
>> On Thu, Apr 17, 2025 at 12:09 AM Ángel Álvarez Pascua <
>> angel.alvarez.pas...@gmail.com> wrote:
>>
>>> Just a quick note on working at the RDD level in Spark — once you go
>>> down to that level, it’s entirely up to you to handle everything. You gain
>>> more control and flexibility, but Spark steps back and hands you the
>>> steering wheel. If tasks fail, it's usually because you're allowing them to
>>> — by not properly capturing or handling all potential errors.
>>>
>>> Last year, I worked on a Spark Streaming project that had to interact
>>> with various external systems (OpenSearch, Salesforce, Microsoft Office
>>> 365, a ticketing service, ...) through HTTP requests. We handled those
>>> interactions the way I told you, and it’s been running in production
>>> without issues since last summer.
>>>
>>> That said, a word of caution: while we didn’t face any issues, some
>>> colleagues on a different project ran into a problem when calling a model
>>> inside a UDF. In certain cases — especially after using the built-in
>>> explode function — Spark ended up calling the model twice per row.
>>> Their workaround was to cache the DataFrame before invoking the model. (I
>>> don’t think Spark speculation was enabled in their case, by the way.)
>>>
>>> I still need to dig deeper into the root cause, but just a heads-up —
>>> under certain conditions, Spark might trigger multiple executions. We also
>>> used the explode function in our project, but didn’t observe any
>>> duplicate calls... so for now, it remains a bit of a mystery.
>>>
>>>
>>>
>>> El jue, 17 abr 2025 a las 2:02, daniel williams (<
>>> daniel.willi...@gmail.com>) escribió:
>>>
>>>> Apologies. I’m imagining a pure Kafka application and imagining a
>>>> transactional consumer on processing. Angel’s suggestion is the correct
>>>> one, mapPartition to maintain state across your broadcast Kafka producers
>>>> to retry or introduce back pressure given your producer and retry in its
>>>> threadpool given the producers built in backoff. This would allow for you
>>>> to retry n times and then upon final (hopefully not) failure update your
>>>> dataset for further processing.
>>>>
>>>> -dan
>>>>
>>>>
>>>> On Wed, Apr 16, 2025 at 5:04 PM Abhishek Singla <
>>>> abhisheksingla...@gmail.com> wrote:
>>>>
>>>>> @daniel williams <daniel.willi...@gmail.com>
>>>>>
>>>>> > Operate your producer in transactional mode.
>>>>> Could you elaborate more on this. How to achieve this with
>>>>> foreachPartition and HTTP Client.
>>>>>
>>>>> > Checkpointing is an abstract concept only applicable to streaming.
>>>>> That means checkpointing is not supported in batch mode, right?
>>>>>
>>>>> @angel.alvarez.pas...@gmail.com <angel.alvarez.pas...@gmail.com>
>>>>>
>>>>> > What about returning the HTTP codes as a dataframe result in the
>>>>> foreachPartition
>>>>> I believe you are referring to mapParitions, that could be done. The
>>>>> main concern here is that the issue is not only with failures due to the
>>>>> third system but also with task/job failures. I wanted to know if there is
>>>>> an existing way in spark batch to checkpoint already processed rows of a
>>>>> partition if using foreachPartition or mapParitions, so that they are not
>>>>> processed again on rescheduling of task due to failure or retriggering of
>>>>> job due to failures.
>>>>>
>>>>> Regards,
>>>>> Abhishek Singla
>>>>>
>>>>> On Wed, Apr 16, 2025 at 7:38 PM Ángel Álvarez Pascua <
>>>>> angel.alvarez.pas...@gmail.com> wrote:
>>>>>
>>>>>> What about returning the HTTP codes as a dataframe result in the
>>>>>> foreachPartition, saving it to files/table/whatever and then performing a
>>>>>> join to discard already ok processed rows when you it try again?
>>>>>>
>>>>>> El mié, 16 abr 2025 a las 15:01, Abhishek Singla (<
>>>>>> abhisheksingla...@gmail.com>) escribió:
>>>>>>
>>>>>>> Hi Team,
>>>>>>>
>>>>>>> We are using foreachPartition to send dataset row data to third
>>>>>>> system via HTTP client. The operation is not idempotent. I wanna ensure
>>>>>>> that in case of failures the previously processed dataset should not get
>>>>>>> processed again.
>>>>>>>
>>>>>>> Is there a way to checkpoint in Spark batch
>>>>>>> 1. checkpoint processed partitions so that if there are 1000
>>>>>>> partitions and 100 were processed in the previous batch, they should not
>>>>>>> get processed again.
>>>>>>> 2. checkpoint partial partition in foreachPartition, if I have
>>>>>>> processed 100 records from a partition which have 1000 total records, is
>>>>>>> there a way to checkpoint offset so that those 100 should not get 
>>>>>>> processed
>>>>>>> again.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Abhishek Singla
>>>>>>>
>>>>>>

-- 
-dan

Re: Checkpointing in foreachPartition in Spark batck

Reply via email to