Hi Ashish, Can you check a few things. 1. Is your source broker count also 20 for both topics? 2. You can try increasing the state operation memory and reduce the disk I/O.
- - Increase the number of CU resources in a single slot. - Set optimization parameters: - taskmanager.memory.managed.fraction=x - state.backend.rocksdb.block.cache-size=x - state.backend.rocksdb.writebuffer.size=x - 3. If possible, try left window join for your streams - - Please, share what sink you are using. Also, the per-operator, source and sink throughput, if possible? On Mon, Jun 24, 2024 at 3:32 PM Ashish Khatkar via user < user@flink.apache.org> wrote: > Hi all, > > We are facing backpressure in the flink sql job from the sink and the > backpressure only comes from a single task. This causes the checkpoint to > fail despite enabling unaligned checkpoints and using debloating buffers. > We enabled flamegraph and the task spends most of the time doing rocksdb > get and put. The sql job does a left join over two streams with a > parallelism of 20. The total data the topics have is 540Gb for one topic > and roughly 60Gb in the second topic. We are running 20 taskmanagers with 1 > slot each with each taskmanager having 72G mem and 9 cpu. > Can you provide any help on how to go about fixing the pipeline? We are > using Flink 1.17.2. The issue is similar to this stackoverflow thread > <https://stackoverflow.com/questions/77762119/flink-sql-job-stops-with-backpressure-after-a-week-of-execution>, > instead of week it starts facing back pressure as soon as the lag comes > down to 4-5%. > > [image: image.png] >