Thanks a lot for the prompt response, please see below information.

1. how much memory assign to JM pod?
6g for container memory limit, 5g for jobmanager.heap.size, I think this is
the only available jm memory configuration for flink 1.10.2

2. Have you tried with newer Flink versions?
I am actually using Apache Beam, so the latest version they support for
Flink is 1.10

3. What statebackend is used?
FsStateBackend, and the checkpoint size is around 12MB from checkpoint
metrics, so I think it is not get inlined

4. What is state.checkpoints.num-retained?
I did not configure this explicitly, so by default only 1 should be retained

5. Anything suspicious from JM log?
There is no Exception nor Error, the only thing I see is the below logs
keeps on repeating

threads for Delete operation as thread count 0 is <=

6. JVM args obtained vis jcmd

-Xms5120m -Xmx5120m -XX:MaxGCPauseMillis=20 -XX:-OmitStackTraceInFastThrow

7. Heap info returned by jcmd <pid> GC.heap_info

it suggested only about 1G of the heap is used

garbage-first heap   total 5242880K, used 1123073K [0x00000006c0000000,

  region size 2048K, 117 young (239616K), 15 survivors (30720K)

 Metaspace       used 108072K, capacity 110544K, committed 110720K,
reserved 1146880K

  class space    used 12963K, capacity 13875K, committed 13952K, reserved

8. top -p <pid>

it suggested for flink job manager java process 4.8G of physical memory is


    1 root      20   0 13.356g 4.802g  22676 S   6.0  7.6  37:48.62 java

> how much memory did you assign to the JM pod? Maybe the limit is so high
> that it takes a bit of time until GC is triggered. Have you tried whether
> the same problem also occurs with newer Flink versions?
> The difference between checkpoints enabled and disabled is that the JM
> needs to do a bit more bookkeeping in order to track the completed
> checkpoints. If you are using the HeapStateBackend, then all states smaller
> than state.backend.fs.memory-threshold will get inlined, meaning that they
> are sent to the JM and stored in the checkpoint meta file. This can
> increase the memory usage of the JM process. Depending on
> state.checkpoints.num-retained this can grow as large as number retained
> checkpoints times the checkpoint size. However, I doubt that this adds up
> to several GB of additional space.
> In order to better understand the problem, the debug logs of your JM could
> be helpful. Also a heap dump might be able to point us towards the
> component which is eating up so much memory.
>> I have a flink job running version 1.10.2, it simply read from a kafka
>> topic with 96 partitions, and output to another kafka topic.
>> It is running in k8s, with 1 JM (not in HA mode), 12 task managers each
>> has 4 slots.
>> The checkpoint persists the snapshot to azure blob storage, checkpoints
>> interval every 3 seconds, with 10 seconds timeout and minimum pause of 1
>> second.
>> I observed that the job manager pod memory usage grows over time, any
>> hints on why this is the case? And the memory usage for JM is significantly
>> more compared to no checkpoint enabled.
