Hi Ashish,
Can you check a few things.
1. Is your source broker count also 20 for both topics?
2. You can try increasing the state operation memory and reduce the disk
I/O.

   -
      - Increase the number of CU resources in a single slot.
         - Set optimization parameters:
            - taskmanager.memory.managed.fraction=x
            - state.backend.rocksdb.block.cache-size=x
            - state.backend.rocksdb.writebuffer.size=x
         - 3. If possible, try left window join for your streams
   -
   - Please, share what sink you are using. Also, the per-operator, source
   and sink throughput, if possible?


On Mon, Jun 24, 2024 at 3:32 PM Ashish Khatkar via user <
user@flink.apache.org> wrote:

> Hi all,
>
> We are facing backpressure in the flink sql job from the sink and the
> backpressure only comes from a single task. This causes the checkpoint to
> fail despite enabling unaligned checkpoints and using debloating buffers.
> We enabled flamegraph and the task spends most of the time doing rocksdb
> get and put. The sql job does a left join over two streams with a
> parallelism of 20. The total data the topics have is 540Gb for one topic
> and roughly 60Gb in the second topic. We are running 20 taskmanagers with 1
> slot each with each taskmanager having 72G mem and 9 cpu.
> Can you provide any help on how to go about fixing the pipeline? We are
> using Flink 1.17.2. The issue is similar to this stackoverflow thread
> <https://stackoverflow.com/questions/77762119/flink-sql-job-stops-with-backpressure-after-a-week-of-execution>,
> instead of week it starts facing back pressure as soon as the lag comes
> down to 4-5%.
>
> [image: image.png]
>

Reply via email to