Hi, Thanks for the answer. I will try the documents you have shared. But still it would be great if you can take a look at the numbers below and give some tips.
At the moment RSS is 46.6GB although taskmanager.memory.process.size is set to 40000m GC Statistics: 2023-09-06 15:15:03,785 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner [] - Memory usage stats: [HEAP: 3703/18208/18208 MB, NON HEAP: 154/175/744 MB (used/committed/max)] 2023-09-06 15:15:03,785 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner [] - Direct memory stats: Count: 33620, Total Capacity: 1102003811, Used Memory: 1102003812 2023-09-06 15:15:03,785 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner [] - Off-heap pool stats: [CodeHeap 'non-nmethods': 1/3/7 MB (used/committed/max)], [Metaspace: 87/99/256 MB (used/committed/max)], [CodeHeap 'profiled nmethods': 32/35/116 MB (used/committed/max)], [Compressed Class Space: 10/14/248 MB (used/committed/max)], [CodeHeap 'non-profiled nmethods': 21/22/116 MB (used/committed/max)] 2023-09-06 15:15:03,785 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner [] - Garbage collector stats: [G1 Young Generation, GC TIME (ms): 30452, GC COUNT: 351], [G1 Old Generation, GC TIME (ms): 0, GC COUNT: 0] my configuration: INFO [] - Final TaskExecutor Memory configuration: INFO [] - Total Process Memory: 39.063gb (41943040000 bytes) INFO [] - Total Flink Memory: 37.813gb (40600862720 bytes) INFO [] - Total JVM Heap Memory: 17.781gb (19092471808 bytes) INFO [] - Framework: 128.000mb (134217728 bytes) INFO [] - Task: 17.656gb (18958254080 bytes) INFO [] - Total Off-heap Memory: 20.031gb (21508390912 bytes) INFO [] - Managed: 18.906gb (20300431360 bytes) INFO [] - Total JVM Direct Memory: 1.125gb (1207959552 bytes) INFO [] - Framework: 128.000mb (134217728 bytes) INFO [] - Task: 0 bytes INFO [] - Network: 1024.000mb (1073741824 bytes) INFO [] - JVM Metaspace: 256.000mb (268435456 bytes) INFO [] - JVM Overhead: 1024.000mb (1073741824 bytes) jcmd output: ubuntu@dzs-tef-test-01:~/flink/log$ jcmd 1035173 VM.native_memory summary 1035173: Native Memory Tracking: Total: reserved=23615554KB, committed=21049538KB - Java Heap (reserved=18644992KB, committed=18644992KB) (mmap: reserved=18644992KB, committed=18644992KB) - Class (reserved=347038KB, committed=106970KB) (classes #15959) ( instance classes #15140, array classes #819) (malloc=5022KB #72815) (mmap: reserved=342016KB, committed=101948KB) ( Metadata: ) ( reserved=88064KB, committed=86948KB) ( used=79128KB) ( free=7820KB) ( waste=0KB =0.00%) ( Class space:) ( reserved=253952KB, committed=15000KB) ( used=11278KB) ( free=3722KB) ( waste=0KB =0.00%) - Thread (reserved=2404259KB, committed=262791KB) (thread #2328) (stack: reserved=2393052KB, committed=251584KB) (malloc=8481KB #13970) (arena=2726KB #4654) - Code (reserved=252334KB, committed=67866KB) (malloc=4650KB #21507) (mmap: reserved=247684KB, committed=63216KB) - GC (reserved=800181KB, committed=800181KB) (malloc=74637KB #63221) (mmap: reserved=725544KB, committed=725544KB) - Compiler (reserved=20432KB, committed=20432KB) (malloc=20300KB #8557) (arena=133KB #5) - Internal (reserved=21883KB, committed=21871KB) (malloc=21839KB #29146) (mmap: reserved=44KB, committed=32KB) - Other (reserved=1082212KB, committed=1082212KB) (malloc=1082212KB #34463) - Symbol (reserved=17581KB, committed=17581KB) (malloc=16678KB #187368) (arena=903KB #1) - Native Memory Tracking (reserved=9173KB, committed=9173KB) (malloc=1656KB #23012) (tracking overhead=7517KB) - Shared class space (reserved=10904KB, committed=10904KB) (mmap: reserved=10904KB, committed=10904KB) - Arena Chunk (reserved=288KB, committed=288KB) (malloc=288KB) - Logging (reserved=4KB, committed=4KB) (malloc=4KB #193) - Arguments (reserved=22KB, committed=22KB) (malloc=22KB #534) - Module (reserved=2726KB, committed=2726KB) (malloc=2726KB #9625) - Synchronizer (reserved=1515KB, committed=1515KB) (malloc=1515KB #12006) - Safepoint (reserved=8KB, committed=8KB) (mmap: reserved=8KB, committed=8KB) On Wed, Sep 6, 2023 at 5:06 PM Biao Geng <biaoge...@gmail.com> wrote: > Hi Kenan, > If you have confirmed the heap memory is ok(e.g. no Java OOM exception and > no frequent GC), then the cause may be off-heap memory over usage, > especially when your flink job uses some native library. > To diagnose such problem, you can refer to [1][2] for more details about > using NMT and jeprof. > > [1] > https://erikwramner.files.wordpress.com/2017/10/native-memory-leaks-in-java.pdf > [2] https://www.evanjones.ca/java-native-leak-bug.html > Best, > Biao Geng > > Kenan Kılıçtepe <kkilict...@gmail.com> 于2023年9月6日周三 20:32写道: > >> Hi, >> >> I have Flink 1.16.2 on a single server with 64GB Ram. >> >> Although taskmanager.memory.process.size is set to 40000m, I can see >> memory usage of the task manager exceed 59GB and OS kills it because of >> OOM. >> I check the RSS column of application top for memory usage. >> >> I don`t see any heap memory problem. >> >> taskmanager.memory.process.size: 40000m >> taskmanager.memory.managed.fraction: 0.53 >> state.backend.rocksdb.memory.managed: true >> >> Any help is appreciated for analyzing the problem. >> >> Thanks >> >>