Hi fellow Flink enthusiasts! Over the past weeks, several community members have been working hard on a thread of issues to make Flink more stable and scalable for some demanding large scale deployments.
It was quite a lot of work digging through the masses of logs, setting up clusters to reproduce issues, debugging, and so on. Because of that intensive time involvement, some of us spent considerably less time on Pull Requests - as a result we have quite a backlog of pull requests. We are nearing the end of this effort, and I believe the pull requests can expect to get more attention again in the near future. The *good news* is that this effort really pushed Flink a lot further. We addressed a lot of issues that now make Flink run a lot smoother in various setups. Here is a sample of issues we addressed as part of that work: - The Network Stack had a quite serious issue with (un)fairness of stream transport under backpressure. - Checkpoints are more robust now - they decline better when they cannot complete and can have a limit for how much data may be buffered/spilled during alignment - Cleanup of old checkpoint state happens more reliably and scalable - Robustness of HA recovery is increased, in the presence of ZooKeeper problems and inconsistent state in ZooKeeper - Improving the cancellation behavior of checkpoints and connectors. Adding a safeguard against tasks and libraries that block the TaskManagers during cancellation. - We were diagnosing a RocksDB bug and updating to fixed version - Deployment scalability: Getting rid of some RPC messages, handling a large RPC volume on deployment better - Memory leaks in the JobManager in the presence of many restarts - Some bugs in the life cycle of tasks and the execution graph to prevent issues with reliable fault tolerance. I think we'll all enjoy the results of this effort, especially those that have to deploy and maintain large setups. Greetings, Stephan