Re: Checkpointing in foreachPartition in Spark batck

Abhishek Singla Thu, 17 Apr 2025 15:54:43 -0700

@Ángel Álvarez Pascua <angel.alvarez.pas...@gmail.com>
Thanks, however I am thinking of some other solution which does not involve
saving the dataframe result. Will update this thread with details soon.


@daniel williams <daniel.willi...@gmail.com>
Thanks, I will surely check spark-testing-base out.

Regards,
Abhishek Singla

On Thu, Apr 17, 2025 at 11:59 PM daniel williams <daniel.willi...@gmail.com>
wrote:

> I have not. Most of my work and development on Spark has been on the scala
> side of the house and I've built a suite of tools for Kafka integration
> with Spark for stream analytics along with spark-testing-base
> <https://github.com/holdenk/spark-testing-base>
>
> On Thu, Apr 17, 2025 at 12:13 PM Ángel Álvarez Pascua <
> angel.alvarez.pas...@gmail.com> wrote:
>
>> Have you used the new equality functions introduced in Spark 3.5?
>>
>>
>> https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.testing.assertDataFrameEqual.html
>>
>> El jue, 17 abr 2025, 13:18, daniel williams <daniel.willi...@gmail.com>
>> escribió:
>>
>>> Good call out. Yeah, once you take your work out of Spark it’s all on
>>> you. Any level partitions operations (e.g. map, flat map, foreach) ends up
>>> as a lambda in catalyst. I’ve found, however, not using explode and doing
>>> things procedurally at this point with a sufficient amount of unit testing
>>> helps a ton on coverage and stability. If you’re not using
>>> spark-testing-base I highly advise it. I’ve built many streaming and
>>> processing topologies on spark for several orgs now and Holden’s framework
>>> has saved me more times than I can imagine.
>>>
>>> -dan
>>>
>>>
>>> On Thu, Apr 17, 2025 at 12:09 AM Ángel Álvarez Pascua <
>>> angel.alvarez.pas...@gmail.com> wrote:
>>>
>>>> Just a quick note on working at the RDD level in Spark — once you go
>>>> down to that level, it’s entirely up to you to handle everything. You gain
>>>> more control and flexibility, but Spark steps back and hands you the
>>>> steering wheel. If tasks fail, it's usually because you're allowing them to
>>>> — by not properly capturing or handling all potential errors.
>>>>
>>>> Last year, I worked on a Spark Streaming project that had to interact
>>>> with various external systems (OpenSearch, Salesforce, Microsoft Office
>>>> 365, a ticketing service, ...) through HTTP requests. We handled those
>>>> interactions the way I told you, and it’s been running in production
>>>> without issues since last summer.
>>>>
>>>> That said, a word of caution: while we didn’t face any issues, some
>>>> colleagues on a different project ran into a problem when calling a model
>>>> inside a UDF. In certain cases — especially after using the built-in
>>>> explode function — Spark ended up calling the model twice per row.
>>>> Their workaround was to cache the DataFrame before invoking the model. (I
>>>> don’t think Spark speculation was enabled in their case, by the way.)
>>>>
>>>> I still need to dig deeper into the root cause, but just a heads-up —
>>>> under certain conditions, Spark might trigger multiple executions. We also
>>>> used the explode function in our project, but didn’t observe any
>>>> duplicate calls... so for now, it remains a bit of a mystery.
>>>>
>>>>
>>>>
>>>> El jue, 17 abr 2025 a las 2:02, daniel williams (<
>>>> daniel.willi...@gmail.com>) escribió:
>>>>
>>>>> Apologies. I’m imagining a pure Kafka application and imagining a
>>>>> transactional consumer on processing. Angel’s suggestion is the correct
>>>>> one, mapPartition to maintain state across your broadcast Kafka producers
>>>>> to retry or introduce back pressure given your producer and retry in its
>>>>> threadpool given the producers built in backoff. This would allow for you
>>>>> to retry n times and then upon final (hopefully not) failure update your
>>>>> dataset for further processing.
>>>>>
>>>>> -dan
>>>>>
>>>>>
>>>>> On Wed, Apr 16, 2025 at 5:04 PM Abhishek Singla <
>>>>> abhisheksingla...@gmail.com> wrote:
>>>>>
>>>>>> @daniel williams <daniel.willi...@gmail.com>
>>>>>>
>>>>>> > Operate your producer in transactional mode.
>>>>>> Could you elaborate more on this. How to achieve this with
>>>>>> foreachPartition and HTTP Client.
>>>>>>
>>>>>> > Checkpointing is an abstract concept only applicable to streaming.
>>>>>> That means checkpointing is not supported in batch mode, right?
>>>>>>
>>>>>> @angel.alvarez.pas...@gmail.com <angel.alvarez.pas...@gmail.com>
>>>>>>
>>>>>> > What about returning the HTTP codes as a dataframe result in the
>>>>>> foreachPartition
>>>>>> I believe you are referring to mapParitions, that could be done. The
>>>>>> main concern here is that the issue is not only with failures due to the
>>>>>> third system but also with task/job failures. I wanted to know if there 
>>>>>> is
>>>>>> an existing way in spark batch to checkpoint already processed rows of a
>>>>>> partition if using foreachPartition or mapParitions, so that they are not
>>>>>> processed again on rescheduling of task due to failure or retriggering of
>>>>>> job due to failures.
>>>>>>
>>>>>> Regards,
>>>>>> Abhishek Singla
>>>>>>
>>>>>> On Wed, Apr 16, 2025 at 7:38 PM Ángel Álvarez Pascua <
>>>>>> angel.alvarez.pas...@gmail.com> wrote:
>>>>>>
>>>>>>> What about returning the HTTP codes as a dataframe result in the
>>>>>>> foreachPartition, saving it to files/table/whatever and then performing 
>>>>>>> a
>>>>>>> join to discard already ok processed rows when you it try again?
>>>>>>>
>>>>>>> El mié, 16 abr 2025 a las 15:01, Abhishek Singla (<
>>>>>>> abhisheksingla...@gmail.com>) escribió:
>>>>>>>
>>>>>>>> Hi Team,
>>>>>>>>
>>>>>>>> We are using foreachPartition to send dataset row data to third
>>>>>>>> system via HTTP client. The operation is not idempotent. I wanna ensure
>>>>>>>> that in case of failures the previously processed dataset should not 
>>>>>>>> get
>>>>>>>> processed again.
>>>>>>>>
>>>>>>>> Is there a way to checkpoint in Spark batch
>>>>>>>> 1. checkpoint processed partitions so that if there are 1000
>>>>>>>> partitions and 100 were processed in the previous batch, they should 
>>>>>>>> not
>>>>>>>> get processed again.
>>>>>>>> 2. checkpoint partial partition in foreachPartition, if I have
>>>>>>>> processed 100 records from a partition which have 1000 total records, 
>>>>>>>> is
>>>>>>>> there a way to checkpoint offset so that those 100 should not get 
>>>>>>>> processed
>>>>>>>> again.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Abhishek Singla
>>>>>>>>
>>>>>>>
>
> --
> -dan
>

Re: Checkpointing in foreachPartition in Spark batck

Reply via email to