This is a stateful stream join application using RocksDB state backend with
incremental checkpoint enabled.

   -

   JVM heap usage is pretty similar. Main difference is in non-heap usage,
   probably related to RocksDB state.
   -

    Also observed cgroup memory failure count showing up in the 1.10 job,
   while 1.7 job has zero memory failure count (due to enough free memory).
   -

   The 1.10 job looks stable otherwise.


Is there any default config change for rocksdb state backend btw 1.7 and
1.10 due to FLIP-49? or maybe there are some implications of the FLIP-49
change that we don't understand?

I can confirm both 1.7.2 and 1.10.0 jobs have the same state size via
various metrics

   -

   rocksdb.estimate-num-keys
   -

   rocksdb.estimate-live-data-size
   -

   rocksdb.total-sst-files-size
   -

   lastCheckpointSize


1.7.2 job setup

   -

   Container memory: 128 GB
   -

   -Xms20G -Xmx20G -XX:MaxDirectMemorySize=6G
   -

   taskmanager.network.memory.max=4 gb


1.10.0 job setup

   -

   Container memory: 128 GB
   -

   -Xmx20G -Xms20G -XX:MaxDirectMemorySize=5.5G
   -

   taskmanager.memory.network.max=4 gb
   -

   taskmanager.memory.process.size=128 gb
   -

   taskmanager.memory.jvm-overhead.max=10 gb
   -

   taskmanager.memory.managed.size=90 gb


I tried different combinations of "taskmanager.memory.jvm-overhead.max" and
"taskmanager.memory.managed.size". They all lead to similar result.
1.7.2 job memory usage

free output


vmstat output


top output


1.10.0 job memory usage

free output


vmstat output


top output

Reply via email to