Re: Checkpointing in foreachPartition in Spark batck

daniel williams Wed, 16 Apr 2025 17:02:46 -0700

Apologies. I’m imagining a pure Kafka application and imagining a
transactional consumer on processing. Angel’s suggestion is the correct
one, mapPartition to maintain state across your broadcast Kafka producers
to retry or introduce back pressure given your producer and retry in its
threadpool given the producers built in backoff. This would allow for you
to retry n times and then upon final (hopefully not) failure update your
dataset for further processing.


-dan


On Wed, Apr 16, 2025 at 5:04 PM Abhishek Singla <abhisheksingla...@gmail.com>
wrote:

> @daniel williams <daniel.willi...@gmail.com>
>
> > Operate your producer in transactional mode.
> Could you elaborate more on this. How to achieve this with
> foreachPartition and HTTP Client.
>
> > Checkpointing is an abstract concept only applicable to streaming.
> That means checkpointing is not supported in batch mode, right?
>
> @angel.alvarez.pas...@gmail.com <angel.alvarez.pas...@gmail.com>
>
> > What about returning the HTTP codes as a dataframe result in the
> foreachPartition
> I believe you are referring to mapParitions, that could be done. The main
> concern here is that the issue is not only with failures due to the third
> system but also with task/job failures. I wanted to know if there is an
> existing way in spark batch to checkpoint already processed rows of a
> partition if using foreachPartition or mapParitions, so that they are not
> processed again on rescheduling of task due to failure or retriggering of
> job due to failures.
>
> Regards,
> Abhishek Singla
>
> On Wed, Apr 16, 2025 at 7:38 PM Ángel Álvarez Pascua <
> angel.alvarez.pas...@gmail.com> wrote:
>
>> What about returning the HTTP codes as a dataframe result in the
>> foreachPartition, saving it to files/table/whatever and then performing a
>> join to discard already ok processed rows when you it try again?
>>
>> El mié, 16 abr 2025 a las 15:01, Abhishek Singla (<
>> abhisheksingla...@gmail.com>) escribió:
>>
>>> Hi Team,
>>>
>>> We are using foreachPartition to send dataset row data to third system
>>> via HTTP client. The operation is not idempotent. I wanna ensure that in
>>> case of failures the previously processed dataset should not get processed
>>> again.
>>>
>>> Is there a way to checkpoint in Spark batch
>>> 1. checkpoint processed partitions so that if there are 1000 partitions
>>> and 100 were processed in the previous batch, they should not get processed
>>> again.
>>> 2. checkpoint partial partition in foreachPartition, if I have processed
>>> 100 records from a partition which have 1000 total records, is there a way
>>> to checkpoint offset so that those 100 should not get processed again.
>>>
>>> Regards,
>>> Abhishek Singla
>>>
>>

Re: Checkpointing in foreachPartition in Spark batck

Reply via email to