Hi Jake, Thanks for the suggestion. I'm actually using PyFlink and it seems that the flame graph can only account for Java methods. Is there any other methods to debug this? I was curious about the network buffers tuning. Given the situation that there's a surge of input data at times while other times there's a more or less contant flow of data, is there any specific network buffer settings that I might try to tweak and observe the changes?
________________________________ From: Jake.zhang <ft20...@qq.com> Sent: Tuesday, October 22, 2024 2:40 PM To: Raihan Sunny <raihan.su...@selisegroup.com>; user <user@flink.apache.org> Subject: Re:Backpressure causing operators to stop ingestion completely Hi, you can use flame_graphs to debug https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/ops/debugging/flame_graphs/ ------------------ Original ------------------ From: "Raihan Sunny" <user@flink.apache.org>; Date: Tue, Oct 22, 2024 01:00 PM To: "user"<user@flink.apache.org>; Subject: Backpressure causing operators to stop ingestion completely Hello, I have an aggregator job that experiences backpressure after running for a while and completely stops processing. It doesn't take any further input from the source. Here's a bit of context: - There are 3 producer jobs, all of which write data to a common Kafka topic - The aggregator job reads in from that topic, aggregates the data once it finds 3 matching records for the same ID and emits a single record into another Kafka topic - The logic is quite simple: the aggregator stores the data in its state for each ID until all 3 matches arrive and then the state is cleared for that ID - 2 of the producer jobs emit records at an almost constant pace, while the 3rd producer emits data in chunks of variable length. The emission of data is also not periodic for this 3rd producer; meaning it can hold up to 15 seconds worth of data in one instance and 2 minutes of data in another for example After running for a while, the Kafka source and the key-by operator following it are at 100% backpressure and the actual aggregator operator is shown in the UI as 100% busy even though it's doing absolutely nothing. All the jobs have a parallelism of 1. If anything, I think it's the 3rd producer job which emits data chunks of variable length is causing the problem. However, I can't think of a valid explanation so far from what I've read about the Flink backpressure handling and how it's supposed to work. Can anyone please provide some pointers as to where I might look to figure out the issue? Thanks, Sunny