Dear community,

We have a Flink job which does some parsing, a join and a window.
When we increase the load, CPU increases gradually with the throughput. But 
around 65% CPU, there is suddenly a jump to 98%.
The job starts experiencing backpressure and becomes unstable (increasing 
latency, memory doesn't get cleaned up well anymore).
When profiling CPU, we notice that most CPU time is going to epollwait from 
netty (40-60%). We see this before and after the job becomes unstable.
Does this mean it has something to do with network saturation?
We also see checkpointing taking around a second at this point (160MB).

What are some avenues we can explore to improve this?

Thank you for any help provided!

Giselle

Reply via email to