Hello Fu, I have tried "unalign checkpoint" before I ask, but it seems to effect only in "AT_LEAST_ONCE" mode.
in my thought, "AT_LEAST_ONCE" mode makes slower checkpointing right? And I restart the job still, the first checkpoint is slower indeed. At 2025-05-16 13:52:44, "Dian Fu" <dian0511...@gmail.com> wrote: >It uses aligned checkpoint by default in Flink which needs to process >all the data buffered in the pipeline(network and operators) during >checkpointing. In your use case, as the process speed is very slow and >so it may take too long to process the buffered data. You could try to >enable unalign checkpoint (via configuration >`execution.checkpointing.unaligned.enabled` which is false by default) >and lower the PyFlink bundle size (via configuration >python.fn-execution.bundle.size which is 1000 by default). > > >On Wed, May 14, 2025 at 1:53 PM Hirson Zhang <milesian...@163.com> wrote: >> >> Hello, >> >> I tried the configuration you mentioned, but it doesn't seem to work. Still, >> thank you for your response! >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> At 2025-05-13 17:54:03, "Sharath" <dsaishar...@gmail.com> wrote: >> >Hello, >> > >> >Have you tried enabling the buffer debloating feature to improve checkpoint >> >times? Refer taskmanager.network.memory.buffer-debloat.enabled in >> >https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/ >> > >> >Regards, >> >Sharath >> > >> >On Tue, May 13, 2025 at 1:59 AM 张河川 <milesian...@163.com> wrote: >> > >> >> Hi Flink community, >> >> >> >> I’m encountering an issue with PyFlink where a FlatMap operator invokes an >> >> external service (using a PyTorch model to generate embedding vectors). >> >> The >> >> operator processes data very slowly, leading to an extremely long initial >> >> checkpoint start delay, which eventually causes checkpoint failures.The >> >> external service has strict concurrency limits and cannot handle increased >> >> parallel requests,increasing the parallelism of the operator did not >> >> improve performance due to this bottleneck. >> >> >> >> Besides, when I use flink1.20.0, the operator processing speed seems to be >> >> faster than that of flink2.0.0. >> >> >> >> Does anyone have any clue? >> >> >> >> Thank you for your insights!