Hi Antonis, Did you try to profile the “bad” taskmanager to see what the task thread was busy doing?
And a possible culprit might be gc, if you haven't checked that. I’ve seen gc threads eating up 30% of cpu. Best, Paul Lam > 2020年12月14日 06:24,Antonis Papaioannou <papai...@ics.forth.gr> 写道: > > Hi, > > I experience a strange behaviour with our Flink application. So I created a > very simple sample application to demonstrate the problem. > A simple Flink application reads data from Kakfa, perfoms a simple > transformation and accesses an external Redis database to read data within a > FlatMap operator. When running the application with parallelism higher than > 1, there is an unexpected high latency only on one operator instance (the > “bad” instance is not always the same, it is randomly “selected” across > multiple runs) that accesses the external database. There multiple Redis > instances, all running in standalone mode, so each Redis request is served by > the local instance. To demonstrate that the latency is not related to the > Redis, I completely removed the database access and simulated its latency > with a sleep operation for about 0.1 ms, resulting to the same strange > behavior. > > Profiling the application by enabling the Flink monitoring mechanism, we see > that all instances of the upstream operator is backpressured and the input > buffer pool (and the input exclusive buffer pool) usage on the “bad” node are > 100% during the whole run. > > There is no skew in the dataset. I also replaces the keyBy with rebalance > which follows a round-robbin data distribution but there is no difference. > > I expected all nodes to exhibit similar (either low or high) latency. So the > question is why only one operator instance exhibits high latency? Is there > any change there is a starvation problem due to credit-based flow control? > > Removing the keyBy between the operators, the system exhibits the expected > behaviour. > > I also attach a pdf with more details about the application and graphs with > monitoring data. > > I hope someone could have an idea about this unexpected behaviour. > > Thank you, > Antonis > > <unexpected_latency_report.pdf> >