Non-heap memory usage jumped after 1.7 -> 1.10 upgrade

Steven Wu Sun, 12 Apr 2020 15:08:26 -0700

This is a stateful stream join application using RocksDB state backend with
incremental checkpoint enabled.


   -

   JVM heap usage is pretty similar. Main difference is in non-heap usage,
   probably related to RocksDB state.
   -

    Also observed cgroup memory failure count showing up in the 1.10 job,
   while 1.7 job has zero memory failure count (due to enough free memory).
   -

   The 1.10 job looks stable otherwise.


Is there any default config change for rocksdb state backend btw 1.7 and
1.10 due to FLIP-49? or maybe there are some implications of the FLIP-49
change that we don't understand?

I can confirm both 1.7.2 and 1.10.0 jobs have the same state size via
various metrics

   -

   rocksdb.estimate-num-keys
   -

   rocksdb.estimate-live-data-size
   -

   rocksdb.total-sst-files-size
   -

   lastCheckpointSize


1.7.2 job setup

   -

   Container memory: 128 GB
   -

   -Xms20G -Xmx20G -XX:MaxDirectMemorySize=6G
   -

   taskmanager.network.memory.max=4 gb


1.10.0 job setup

   -

   Container memory: 128 GB
   -

   -Xmx20G -Xms20G -XX:MaxDirectMemorySize=5.5G
   -

   taskmanager.memory.network.max=4 gb
   -

   taskmanager.memory.process.size=128 gb
   -

   taskmanager.memory.jvm-overhead.max=10 gb
   -

   taskmanager.memory.managed.size=90 gb


I tried different combinations of "taskmanager.memory.jvm-overhead.max" and
"taskmanager.memory.managed.size". They all lead to similar result.
1.7.2 job memory usage

free output


vmstat output


top output


1.10.0 job memory usage

free output


vmstat output


top output

Non-heap memory usage jumped after 1.7 -> 1.10 upgrade

Reply via email to