@Ángel Álvarez Pascua <angel.alvarez.pas...@gmail.com> Thanks, however I am thinking of some other solution which does not involve saving the dataframe result. Will update this thread with details soon.
@daniel williams <daniel.willi...@gmail.com> Thanks, I will surely check spark-testing-base out. Regards, Abhishek Singla On Thu, Apr 17, 2025 at 11:59 PM daniel williams <daniel.willi...@gmail.com> wrote: > I have not. Most of my work and development on Spark has been on the scala > side of the house and I've built a suite of tools for Kafka integration > with Spark for stream analytics along with spark-testing-base > <https://github.com/holdenk/spark-testing-base> > > On Thu, Apr 17, 2025 at 12:13 PM Ángel Álvarez Pascua < > angel.alvarez.pas...@gmail.com> wrote: > >> Have you used the new equality functions introduced in Spark 3.5? >> >> >> https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.testing.assertDataFrameEqual.html >> >> El jue, 17 abr 2025, 13:18, daniel williams <daniel.willi...@gmail.com> >> escribió: >> >>> Good call out. Yeah, once you take your work out of Spark it’s all on >>> you. Any level partitions operations (e.g. map, flat map, foreach) ends up >>> as a lambda in catalyst. I’ve found, however, not using explode and doing >>> things procedurally at this point with a sufficient amount of unit testing >>> helps a ton on coverage and stability. If you’re not using >>> spark-testing-base I highly advise it. I’ve built many streaming and >>> processing topologies on spark for several orgs now and Holden’s framework >>> has saved me more times than I can imagine. >>> >>> -dan >>> >>> >>> On Thu, Apr 17, 2025 at 12:09 AM Ángel Álvarez Pascua < >>> angel.alvarez.pas...@gmail.com> wrote: >>> >>>> Just a quick note on working at the RDD level in Spark — once you go >>>> down to that level, it’s entirely up to you to handle everything. You gain >>>> more control and flexibility, but Spark steps back and hands you the >>>> steering wheel. If tasks fail, it's usually because you're allowing them to >>>> — by not properly capturing or handling all potential errors. >>>> >>>> Last year, I worked on a Spark Streaming project that had to interact >>>> with various external systems (OpenSearch, Salesforce, Microsoft Office >>>> 365, a ticketing service, ...) through HTTP requests. We handled those >>>> interactions the way I told you, and it’s been running in production >>>> without issues since last summer. >>>> >>>> That said, a word of caution: while we didn’t face any issues, some >>>> colleagues on a different project ran into a problem when calling a model >>>> inside a UDF. In certain cases — especially after using the built-in >>>> explode function — Spark ended up calling the model twice per row. >>>> Their workaround was to cache the DataFrame before invoking the model. (I >>>> don’t think Spark speculation was enabled in their case, by the way.) >>>> >>>> I still need to dig deeper into the root cause, but just a heads-up — >>>> under certain conditions, Spark might trigger multiple executions. We also >>>> used the explode function in our project, but didn’t observe any >>>> duplicate calls... so for now, it remains a bit of a mystery. >>>> >>>> >>>> >>>> El jue, 17 abr 2025 a las 2:02, daniel williams (< >>>> daniel.willi...@gmail.com>) escribió: >>>> >>>>> Apologies. I’m imagining a pure Kafka application and imagining a >>>>> transactional consumer on processing. Angel’s suggestion is the correct >>>>> one, mapPartition to maintain state across your broadcast Kafka producers >>>>> to retry or introduce back pressure given your producer and retry in its >>>>> threadpool given the producers built in backoff. This would allow for you >>>>> to retry n times and then upon final (hopefully not) failure update your >>>>> dataset for further processing. >>>>> >>>>> -dan >>>>> >>>>> >>>>> On Wed, Apr 16, 2025 at 5:04 PM Abhishek Singla < >>>>> abhisheksingla...@gmail.com> wrote: >>>>> >>>>>> @daniel williams <daniel.willi...@gmail.com> >>>>>> >>>>>> > Operate your producer in transactional mode. >>>>>> Could you elaborate more on this. How to achieve this with >>>>>> foreachPartition and HTTP Client. >>>>>> >>>>>> > Checkpointing is an abstract concept only applicable to streaming. >>>>>> That means checkpointing is not supported in batch mode, right? >>>>>> >>>>>> @angel.alvarez.pas...@gmail.com <angel.alvarez.pas...@gmail.com> >>>>>> >>>>>> > What about returning the HTTP codes as a dataframe result in the >>>>>> foreachPartition >>>>>> I believe you are referring to mapParitions, that could be done. The >>>>>> main concern here is that the issue is not only with failures due to the >>>>>> third system but also with task/job failures. I wanted to know if there >>>>>> is >>>>>> an existing way in spark batch to checkpoint already processed rows of a >>>>>> partition if using foreachPartition or mapParitions, so that they are not >>>>>> processed again on rescheduling of task due to failure or retriggering of >>>>>> job due to failures. >>>>>> >>>>>> Regards, >>>>>> Abhishek Singla >>>>>> >>>>>> On Wed, Apr 16, 2025 at 7:38 PM Ángel Álvarez Pascua < >>>>>> angel.alvarez.pas...@gmail.com> wrote: >>>>>> >>>>>>> What about returning the HTTP codes as a dataframe result in the >>>>>>> foreachPartition, saving it to files/table/whatever and then performing >>>>>>> a >>>>>>> join to discard already ok processed rows when you it try again? >>>>>>> >>>>>>> El mié, 16 abr 2025 a las 15:01, Abhishek Singla (< >>>>>>> abhisheksingla...@gmail.com>) escribió: >>>>>>> >>>>>>>> Hi Team, >>>>>>>> >>>>>>>> We are using foreachPartition to send dataset row data to third >>>>>>>> system via HTTP client. The operation is not idempotent. I wanna ensure >>>>>>>> that in case of failures the previously processed dataset should not >>>>>>>> get >>>>>>>> processed again. >>>>>>>> >>>>>>>> Is there a way to checkpoint in Spark batch >>>>>>>> 1. checkpoint processed partitions so that if there are 1000 >>>>>>>> partitions and 100 were processed in the previous batch, they should >>>>>>>> not >>>>>>>> get processed again. >>>>>>>> 2. checkpoint partial partition in foreachPartition, if I have >>>>>>>> processed 100 records from a partition which have 1000 total records, >>>>>>>> is >>>>>>>> there a way to checkpoint offset so that those 100 should not get >>>>>>>> processed >>>>>>>> again. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Abhishek Singla >>>>>>>> >>>>>>> > > -- > -dan >