This is a stateful stream join application using RocksDB state backend with incremental checkpoint enabled.
- JVM heap usage is pretty similar. Main difference is in non-heap usage, probably related to RocksDB state. - Also observed cgroup memory failure count showing up in the 1.10 job, while 1.7 job has zero memory failure count (due to enough free memory). - The 1.10 job looks stable otherwise. Is there any default config change for rocksdb state backend btw 1.7 and 1.10 due to FLIP-49? or maybe there are some implications of the FLIP-49 change that we don't understand? I can confirm both 1.7.2 and 1.10.0 jobs have the same state size via various metrics - rocksdb.estimate-num-keys - rocksdb.estimate-live-data-size - rocksdb.total-sst-files-size - lastCheckpointSize 1.7.2 job setup - Container memory: 128 GB - -Xms20G -Xmx20G -XX:MaxDirectMemorySize=6G - taskmanager.network.memory.max=4 gb 1.10.0 job setup - Container memory: 128 GB - -Xmx20G -Xms20G -XX:MaxDirectMemorySize=5.5G - taskmanager.memory.network.max=4 gb - taskmanager.memory.process.size=128 gb - taskmanager.memory.jvm-overhead.max=10 gb - taskmanager.memory.managed.size=90 gb I tried different combinations of "taskmanager.memory.jvm-overhead.max" and "taskmanager.memory.managed.size". They all lead to similar result. 1.7.2 job memory usage free output vmstat output top output 1.10.0 job memory usage free output vmstat output top output