cassandra node was put down with oom error

Ayub M Sat, 26 Jan 2019 12:11:56 -0800

Cassandra node went down due to OOM, and checking the /var/log/message I
see below.


```
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: java invoked oom-killer:
gfp_mask=0x280da, order=0, oom_score_adj=0
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: java cpuset=/ mems_allowed=0
....
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 DMA: 1*4kB (U) 0*8kB
0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U)
1*2048kB (M) 3*4096kB (M) = 15908kB
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 DMA32: 1294*4kB (UM)
932*8kB (UEM) 897*16kB (UEM) 483*32kB (UEM) 224*64kB (UEM) 114*128kB (UEM)
41*256kB (UEM) 12*512kB (UEM) 7*1024kB (UE
M) 2*2048kB (EM) 35*4096kB (UM) = 242632kB
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 Normal: 5319*4kB (UE)
3233*8kB (UEM) 960*16kB (UE) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB
0*2048kB 0*4096kB = 62500kB
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 hugepages_total=0
hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 hugepages_total=0
hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 38109 total pagecache pages
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 0 pages in swap cache
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Swap cache stats: add 0, delete
0, find 0/0
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Free swap  = 0kB
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Total swap = 0kB
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 16394647 pages RAM
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 0 pages HighMem/MovableOnly
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 310559 pages reserved
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ pid ]   uid  tgid total_vm
 rss nr_ptes swapents oom_score_adj name
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 2634]     0  2634    41614
 326      82        0             0 systemd-journal
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 2690]     0  2690    29793
 541      27        0             0 lvmetad
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 2710]     0  2710    11892
 762      25        0         -1000 systemd-udevd
.....
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [13774]     0 13774   459778
 97729     429        0             0 Scan Factory
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14506]     0 14506    21628
5340      24        0             0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14586]     0 14586    21628
5340      24        0             0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14588]     0 14588    21628
5340      24        0             0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14589]     0 14589    21628
5340      24        0             0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14598]     0 14598    21628
5340      24        0             0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14599]     0 14599    21628
5340      24        0             0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14600]     0 14600    21628
5340      24        0             0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14601]     0 14601    21628
5340      24        0             0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [19679]     0 19679    21628
5340      24        0             0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [19680]     0 19680    21628
5340      24        0             0 macompatsvc
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 9084]  1007  9084  2822449
260291     810        0             0 java
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 8509]  1007  8509 17223585
14908485   32510        0             0 java
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [21877]     0 21877   461828
 97716     318        0             0 ScanAction Mgr
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [21884]     0 21884   496653
 98605     340        0             0 OAS Manager
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [31718]    89 31718    25474
 486      48        0             0 pickup
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 4891]  1007  4891    26999
 191       9        0             0 iostat
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 4957]  1007  4957    26999
 192      10        0             0 iostat
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Out of memory: Kill process 8509
(java) score 928 or sacrifice child
Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Killed process 8509 (java)
total-vm:68894340kB, anon-rss:59496344kB, file-rss:137596kB, shmem-rss:0kB
```

Nothing else runs on this host except dse cassandra with search and
monitoring agents. Max heap size is set to 31g, the cassandra java process
seems to be using ~57gb (ram is 62gb) at the time of error.
So I am guess the jvm started using lots of memory and triggered oom error.
Is my understanding correct?
That this is linux triggered jvm kill as the jvm was consuming more than
available memory?

So in this case jvm was using max of 31g and remaining 26gb its using is
non-heap memory. Normally this process takes around 42g and the fact that
at the time of oom moment it was consuming 57g I am suspecting the java
process to be the culprit rather than victim.

At the time of issue there was no heap dump taken, I have configured it
now. But even if heap dump was taken would it have help figure out who is
consuming more memory. Heapdump would only dump heap memory area, what
should be used to dump non-heapdump? Native memory tracking is one thing I
came across.
Any way to have native memory dumped when oom occurs?
Whats the best way to monitor the jvm memory to diagnose oom errors?

cassandra node was put down with oom error

Reply via email to