Re:Re: Re: Long Initial Checkpoint Start Delay in PyFlink FlatMap Operator

Hirson Zhang Mon, 19 May 2025 01:04:45 -0700

Hello Fu, I have tried "unalign checkpoint" before I ask, but it seems to 
effect only in "AT_LEAST_ONCE" mode.


in my thought, "AT_LEAST_ONCE" mode makes slower checkpointing right?

And I restart the job still, the first checkpoint is slower indeed.




















At 2025-05-16 13:52:44, "Dian Fu" <[email protected]> wrote:
>It uses aligned checkpoint by default in Flink which needs to process
>all the data buffered in the pipeline(network and operators) during
>checkpointing. In your use case, as the process speed is very slow and
>so it may take too long to process the buffered data. You could try to
>enable unalign checkpoint (via configuration
>`execution.checkpointing.unaligned.enabled` which is false by default)
>and lower the PyFlink bundle size (via configuration
>python.fn-execution.bundle.size which is 1000 by default).
>
>
>On Wed, May 14, 2025 at 1:53 PM Hirson Zhang <[email protected]> wrote:
>>
>> Hello,
>>
>> I tried the configuration you mentioned, but it doesn't seem to work. Still, 
>> thank you for your response!
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> At 2025-05-13 17:54:03, "Sharath" <[email protected]> wrote:
>> >Hello,
>> >
>> >Have you tried enabling the buffer debloating feature to improve checkpoint
>> >times? Refer taskmanager.network.memory.buffer-debloat.enabled in
>> >https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/
>> >
>> >Regards,
>> >Sharath
>> >
>> >On Tue, May 13, 2025 at 1:59 AM 张河川 <[email protected]> wrote:
>> >
>> >> Hi Flink community,
>> >>
>> >> I’m encountering an issue with PyFlink where a FlatMap operator invokes an
>> >> external service (using a PyTorch model to generate embedding vectors). 
>> >> The
>> >> operator processes data very slowly, leading to an extremely long initial
>> >> checkpoint start delay, which eventually causes checkpoint failures.The
>> >> external service has strict concurrency limits and cannot handle increased
>> >> parallel requests,increasing the parallelism of the operator did not
>> >> improve performance due to this bottleneck.
>> >>
>> >> Besides, when I use flink1.20.0, the operator processing speed seems to be
>> >> faster than that of flink2.0.0.
>> >>
>> >> Does anyone have any clue?
>> >>
>> >> Thank you for your insights!

Re:Re: Re: Long Initial Checkpoint Start Delay in PyFlink FlatMap Operator

Reply via email to