Hi Jake,

Thanks for the suggestion. I'm actually using PyFlink and it seems that the 
flame graph can only account for Java methods. Is there any other methods to 
debug this? I was curious about the network buffers tuning. Given the situation 
that there's a surge of input data at times while other times there's a more or 
less contant flow of data, is there any specific network buffer settings that I 
might try to tweak and observe the changes?



________________________________
From: Jake.zhang <ft20...@qq.com>
Sent: Tuesday, October 22, 2024 2:40 PM
To: Raihan Sunny <raihan.su...@selisegroup.com>; user <user@flink.apache.org>
Subject: Re:Backpressure causing operators to stop ingestion completely


Hi,  you can use flame_graphs to debug

https://nightlies.apache.org/flink/flink-docs-release-1.20/docs/ops/debugging/flame_graphs/

------------------ Original ------------------
From: "Raihan Sunny" <user@flink.apache.org>;
Date: Tue, Oct 22, 2024 01:00 PM
To: "user"<user@flink.apache.org>;
Subject: Backpressure causing operators to stop ingestion completely

Hello,

I have an aggregator job that experiences backpressure after running for a 
while and completely stops processing. It doesn't take any further input from 
the source. Here's a bit of context:
- There are 3 producer jobs, all of which write data to a common Kafka topic
- The aggregator job reads in from that topic, aggregates the data once it 
finds 3 matching records for the same ID and emits a single record into another 
Kafka topic
- The logic is quite simple: the aggregator stores the data in its state for 
each ID until all 3 matches arrive and then the state is cleared for that ID
- 2 of the producer jobs emit records at an almost constant pace, while the 3rd 
producer emits data in chunks of variable length. The emission of data is also 
not periodic for this 3rd producer; meaning it can hold up to 15 seconds worth 
of data in one instance and 2 minutes of data in another for example

After running for a while, the Kafka source and the key-by operator following 
it are at 100% backpressure and the actual aggregator operator is shown in the 
UI as 100% busy even though it's doing absolutely nothing. All the jobs have a 
parallelism of 1.

If anything, I think it's the 3rd producer job which emits data chunks of 
variable length is causing the problem. However, I can't think of a valid 
explanation so far from what I've read about the Flink backpressure handling 
and how it's supposed to work. Can anyone please provide some pointers as to 
where I might look to figure out the issue?


Thanks,
Sunny

Reply via email to