Dear community, We have a Flink job which does some parsing, a join and a window. When we increase the load, CPU increases gradually with the throughput. But around 65% CPU, there is suddenly a jump to 98%. The job starts experiencing backpressure and becomes unstable (increasing latency, memory doesn't get cleaned up well anymore). When profiling CPU, we notice that most CPU time is going to epollwait from netty (40-60%). We see this before and after the job becomes unstable. Does this mean it has something to do with network saturation? We also see checkpointing taking around a second at this point (160MB).
What are some avenues we can explore to improve this? Thank you for any help provided! Giselle