Hi Antonis,

Did you try to profile the “bad” taskmanager to see what the task thread was 
busy doing?

And a possible culprit might be gc, if you haven't checked that. I’ve seen gc 
threads eating up 30% of cpu.

Best,
Paul Lam

> 2020年12月14日 06:24,Antonis Papaioannou <papai...@ics.forth.gr> 写道:
> 
> Hi,
> 
> I experience a strange behaviour with our Flink application. So I created a 
> very simple sample application to demonstrate the problem.
> A simple Flink application reads data from Kakfa, perfoms a simple 
> transformation and accesses an external Redis database to read data within a 
> FlatMap operator. When running the application with parallelism higher than 
> 1, there is an unexpected high latency only on one operator instance (the 
> “bad” instance is not always the same, it is randomly “selected” across 
> multiple runs) that accesses the external database. There multiple Redis 
> instances, all running in standalone mode, so each Redis request is served by 
> the local instance. To demonstrate that the latency is not related to the 
> Redis, I completely removed the database access and simulated its latency 
> with a sleep operation for about 0.1 ms, resulting to the same strange 
> behavior.
> 
> Profiling the application by enabling the Flink monitoring mechanism, we see 
> that all instances of the upstream operator is backpressured and the input 
> buffer pool (and the input exclusive buffer pool) usage on the “bad” node are 
> 100% during the whole run.
> 
> There is no skew in the dataset. I also replaces the keyBy with rebalance 
> which follows a round-robbin data distribution but there is no difference. 
> 
> I expected all nodes to exhibit similar (either low or high) latency. So the 
> question is why only one operator instance exhibits high latency? Is there 
> any change there is a starvation problem due to credit-based flow control?
> 
> Removing the keyBy between the operators, the system exhibits the expected 
> behaviour.
> 
> I also attach a pdf with more details about the application and graphs with 
> monitoring data.
> 
> I hope someone could have an idea about this unexpected behaviour.
> 
> Thank you,
> Antonis
> 
> <unexpected_latency_report.pdf>
> 

Reply via email to