Re: Checkpointing in foreachPartition in Spark batck

2025-04-17 Thread Abhishek Singla
@Ángel Álvarez Pascua Thanks, however I am thinking of some other solution which does not involve saving the dataframe result. Will update this thread with details soon. @daniel williams Thanks, I will surely check spark-testing-base out. Regards, Abhishek Singla On Thu, Apr 17, 2025 at 11:59 

Re: Checkpointing in foreachPartition in Spark batck

2025-04-17 Thread daniel williams
I have not. Most of my work and development on Spark has been on the scala side of the house and I've built a suite of tools for Kafka integration with Spark for stream analytics along with spark-testing-base On Thu, Apr 17, 2025 at 12:13 PM Ángel Ál

Re: Checkpointing in foreachPartition in Spark batck

2025-04-17 Thread Ángel Álvarez Pascua
Have you used the new equality functions introduced in Spark 3.5? https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.testing.assertDataFrameEqual.html El jue, 17 abr 2025, 13:18, daniel williams escribió: > Good call out. Yeah, once you take your work out of Spark it’s all on

Re: Checkpointing in foreachPartition in Spark batck

2025-04-17 Thread daniel williams
Good call out. Yeah, once you take your work out of Spark it’s all on you. Any level partitions operations (e.g. map, flat map, foreach) ends up as a lambda in catalyst. I’ve found, however, not using explode and doing things procedurally at this point with a sufficient amount of unit testing helps

Re: Checkpointing in foreachPartition in Spark batck

2025-04-16 Thread Ángel Álvarez Pascua
Just a quick note on working at the RDD level in Spark — once you go down to that level, it’s entirely up to you to handle everything. You gain more control and flexibility, but Spark steps back and hands you the steering wheel. If tasks fail, it's usually because you're allowing them to — by not p

Re: Checkpointing in foreachPartition in Spark batck

2025-04-16 Thread Abhishek Singla
@daniel williams > Operate your producer in transactional mode. Could you elaborate more on this. How to achieve this with foreachPartition and HTTP Client. > Checkpointing is an abstract concept only applicable to streaming. That means checkpointing is not supported in batch mode, right? @ange

Re: Checkpointing in foreachPartition in Spark batck

2025-04-16 Thread daniel williams
Apologies. I’m imagining a pure Kafka application and imagining a transactional consumer on processing. Angel’s suggestion is the correct one, mapPartition to maintain state across your broadcast Kafka producers to retry or introduce back pressure given your producer and retry in its threadpool giv

Re: Checkpointing in foreachPartition in Spark batck

2025-04-16 Thread Ángel Álvarez Pascua
What about returning the HTTP codes as a dataframe result in the foreachPartition, saving it to files/table/whatever and then performing a join to discard already ok processed rows when you it try again? El mié, 16 abr 2025 a las 15:01, Abhishek Singla (< abhisheksingla...@gmail.com>) escribió: >

Re: Checkpointing in foreachPartition in Spark batck

2025-04-16 Thread daniel williams
Yes. Operate your producer in transactional mode. Checkpointing is an abstract concept only applicable to streaming. -dan On Wed, Apr 16, 2025 at 7:02 AM Abhishek Singla wrote: > Hi Team, > > We are using foreachPartition to send dataset row data to third system via > HTTP client. The operatio