One thing that doesn't show up is fs cache, which is likely the cause here. We went through this on our SSDs and had to add the following to stop the crashes. I believe vm.vfs_cache_pressure and min_free_kbytes were the really helpful things in getting the crashes to stop. HTH!
sysctl_param 'vm.vfs_cache_pressure' do value 400 end sysctl_param 'vm.dirty_ratio' do value 20 end sysctl_param 'vm.dirty_background_ratio' do value 2 end sysctl_param 'vm.min_free_kbytes' do value 4194304 end On Fri, Nov 17, 2017 at 4:24 PM, Sam Huracan <nowitzki.sa...@gmail.com> wrote: > I see some more logs about memory in syslog: > > Nov 17 10:47:17 ceph1 kernel: [2810698.553749] Node 0 DMA free:14828kB > min:32kB low:40kB high:48kB active_anon:0kB inactive_anon:0kB > active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB > isolated(file):0kB present:15996kB managed:15896kB mlocked:0kB dirty:0kB > writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB > slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB > bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB > pages_scanned:0 all_unreclaimable? yes > Nov 17 10:47:17 ceph1 kernel: [2810698.553753] lowmem_reserve[]: 0 1830 > 32011 32011 32011 > Nov 17 10:47:17 ceph1 kernel: [2810698.553755] Node 0 DMA32 free:134848kB > min:3860kB low:4824kB high:5788kB active_anon:120628kB > inactive_anon:121792kB active_file:653404kB inactive_file:399272kB > unevictable:0kB isolated(anon):0kB isolated(file):36kB present:1981184kB > managed:1900752kB mlocked:0kB dirty:344kB writeback:0kB mapped:1200kB > shmem:0kB slab_reclaimable:239900kB slab_unreclaimable:154560kB > kernel_stack:43376kB pagetables:1176kB unstable:0kB bounce:0kB free_pcp:0kB > local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 > all_unreclaimable? no > Nov 17 10:47:17 ceph1 kernel: [2810698.553758] lowmem_reserve[]: 0 0 30180 > 30180 30180 > Nov 17 10:47:17 ceph1 kernel: [2810698.553760] Node 0 Normal free:237232kB > min:63684kB low:79604kB high:95524kB active_anon:3075228kB > inactive_anon:629052kB active_file:12544336kB inactive_file:12570716kB > unevictable:0kB isolated(anon):0kB isolated(file):0kB present:31457280kB > managed:30905084kB mlocked:0kB dirty:74368kB writeback:3796kB > mapped:48516kB shmem:1276kB slab_reclaimable:713684kB > slab_unreclaimable:416404kB kernel_stack:40896kB pagetables:25288kB > unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB > writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no > Nov 17 10:47:17 ceph1 kernel: [2810698.553763] lowmem_reserve[]: 0 0 0 0 0 > Nov 17 10:47:17 ceph1 kernel: [2810698.553765] Node 0 DMA: 1*4kB (U) 1*8kB > (U) 0*16kB 1*32kB (U) 3*64kB (U) 2*128kB (U) 0*256kB 0*512kB 0*1024kB > 1*2048kB (M) 3*4096kB (M) = 14828kB > Nov 17 10:47:17 ceph1 kernel: [2810698.553772] Node 0 DMA32: 10756*4kB > (UME) 11442*8kB (UME) 25*16kB (UE) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB > 0*1024kB 0*2048kB 0*4096kB = 134960kB > Nov 17 10:47:17 ceph1 kernel: [2810698.553777] Node 0 Normal: 59473*4kB > (UME) 77*8kB (U) 10*16kB (H) 8*32kB (H) 6*64kB (H) 2*128kB (H) 1*256kB (H) > 1*512kB (H) 0*1024kB 0*2048kB 0*4096kB = 240332kB > Nov 17 10:47:17 ceph1 kernel: [2810698.553784] Node 0 hugepages_total=0 > hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB > Nov 17 10:47:17 ceph1 kernel: [2810698.553784] Node 0 hugepages_total=0 > hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB > Nov 17 10:47:17 ceph1 kernel: [2810698.553785] 6544612 total pagecache > pages > Nov 17 10:47:17 ceph1 kernel: [2810698.553786] 2347 pages in swap cache > Nov 17 10:47:17 ceph1 kernel: [2810698.553787] Swap cache stats: add > 2697126, delete 2694779, find 38874122/39241548 > Nov 17 10:47:17 ceph1 kernel: [2810698.553788] Free swap = 61498092kB > Nov 17 10:47:17 ceph1 kernel: [2810698.553788] Total swap = 62498812kB > Nov 17 10:47:17 ceph1 kernel: [2810698.553789] 8363615 pages RAM > Nov 17 10:47:17 ceph1 kernel: [2810698.553790] 0 pages HighMem/MovableOnly > Nov 17 10:47:17 ceph1 kernel: [2810698.553790] 158182 pages reserved > Nov 17 10:47:17 ceph1 kernel: [2810698.553791] 0 pages cma reserved > > Is it relate to page caches? > > 2017-11-18 7:22 GMT+07:00 Sam Huracan <nowitzki.sa...@gmail.com>: > >> Today, one of our Ceph OSDs was down, I've check syslog and see this OSD >> process was killed by OMM >> >> >> Nov 17 10:01:06 ceph1 kernel: [2807926.762304] Out of memory: Kill >> process 3330 (ceph-osd) score 7 or sacrifice child >> Nov 17 10:01:06 ceph1 kernel: [2807926.763745] Killed process 3330 >> (ceph-osd) total-vm:2372392kB, anon-rss:559084kB, file-rss:7268kB >> Nov 17 10:01:06 ceph1 kernel: [2807926.985830] init: ceph-osd (ceph/6) >> main process (3330) killed by KILL signal >> Nov 17 10:01:06 ceph1 kernel: [2807926.985844] init: ceph-osd (ceph/6) >> main process ended, respawning >> Nov 17 10:03:39 ceph1 bash: root [15524]: sudo ceph health detail [0] >> Nov 17 10:03:40 ceph1 bash: root [15524]: sudo ceph health detail [0] >> Nov 17 10:17:01 ceph1 CRON[75167]: (root) CMD ( cd / && run-parts >> --report /etc/cron.hourly) >> Nov 17 10:47:17 ceph1 kernel: [2810698.553690] ceph-status.sh invoked >> oom-killer: gfp_mask=0x26000c0, order=2, oom_score_adj=0 >> Nov 17 10:47:17 ceph1 kernel: [2810698.553693] ceph-status.sh cpuset=/ >> mems_allowed=0 >> Nov 17 10:47:17 ceph1 kernel: [2810698.553697] CPU: 5 PID: 194271 Comm: >> ceph-status.sh Not tainted 4.4.0-62-generic #83~14.04.1-Ubuntu >> Nov 17 10:47:17 ceph1 kernel: [2810698.553698] Hardware name: Dell Inc. >> PowerEdge R730xd/072T6D, BIOS 2.5.5 08/16/2017 >> Nov 17 10:47:17 ceph1 kernel: [2810698.553699] 0000000000000000 >> ffff88001a857b38 ffffffff813dc4ac ffff88001a857cf0 >> Nov 17 10:47:17 ceph1 kernel: [2810698.553701] 0000000000000000 >> ffff88001a857bc8 ffffffff811fb066 ffff88083d29e200 >> Nov 17 10:47:17 ceph1 kernel: [2810698.553703] ffff88001a857cf0 >> ffff88001a857c00 ffff8808453cf000 0000000000000000 >> Nov 17 10:47:17 ceph1 kernel: [2810698.553704] Call Trace: >> Nov 17 10:47:17 ceph1 kernel: [2810698.553709] [<ffffffff813dc4ac>] >> dump_stack+0x63/0x87 >> Nov 17 10:47:17 ceph1 kernel: [2810698.553713] [<ffffffff811fb066>] >> dump_header+0x5b/0x1d5 >> Nov 17 10:47:17 ceph1 kernel: [2810698.553717] [<ffffffff81188b75>] >> oom_kill_process+0x205/0x3d0 >> Nov 17 10:47:17 ceph1 kernel: [2810698.553718] [<ffffffff8118857e>] ? >> oom_unkillable_task+0x9e/0xc0 >> Nov 17 10:47:17 ceph1 kernel: [2810698.553720] [<ffffffff811891ab>] >> out_of_memory+0x40b/0x460 >> Nov 17 10:47:17 ceph1 kernel: [2810698.553722] [<ffffffff811fbb1f>] >> __alloc_pages_slowpath.constprop.87+0x742/0x7ad >> Nov 17 10:47:17 ceph1 kernel: [2810698.553725] [<ffffffff8118e1a7>] >> __alloc_pages_nodemask+0x237/0x240 >> Nov 17 10:47:17 ceph1 kernel: [2810698.553727] [<ffffffff8118e36d>] >> alloc_kmem_pages_node+0x4d/0xd0 >> Nov 17 10:47:17 ceph1 kernel: [2810698.553730] [<ffffffff8107c125>] >> copy_process+0x185/0x1ce0 >> Nov 17 10:47:17 ceph1 kernel: [2810698.553732] [<ffffffff813320f3>] ? >> security_file_alloc+0x33/0x50 >> Nov 17 10:47:17 ceph1 kernel: [2810698.553734] [<ffffffff8107de1a>] >> _do_fork+0x8a/0x310 >> Nov 17 10:47:17 ceph1 kernel: [2810698.553737] [<ffffffff8108dc01>] ? >> sigprocmask+0x51/0x80 >> Nov 17 10:47:17 ceph1 kernel: [2810698.553738] [<ffffffff8107e149>] >> SyS_clone+0x19/0x20 >> Nov 17 10:47:17 ceph1 kernel: [2810698.553743] [<ffffffff81802d36>] >> entry_SYSCALL_64_fastpath+0x16/0x75 >> Nov 17 10:47:17 ceph1 kernel: [2810698.553744] Mem-Info: >> Nov 17 10:47:17 ceph1 kernel: [2810698.553747] active_anon:798964 >> inactive_anon:187711 isolated_anon:0 >> Nov 17 10:47:17 ceph1 kernel: [2810698.553747] active_file:3299435 >> inactive_file:3242497 isolated_file:9 >> Nov 17 10:47:17 ceph1 kernel: [2810698.553747] unevictable:0 dirty:18678 >> writeback:949 unstable:0 >> Nov 17 10:47:17 ceph1 kernel: [2810698.553747] slab_reclaimable:238396 >> slab_unreclaimable:142741 >> Nov 17 10:47:17 ceph1 kernel: [2810698.553747] mapped:12429 shmem:319 >> pagetables:6616 bounce:0 >> Nov 17 10:47:17 ceph1 kernel: [2810698.553747] free:96727 free_pcp:0 >> free_cma:0 >> Nov 17 10:47:17 ceph1 kernel: [2810698.553749] Node 0 DMA free:14828kB >> min:32kB low:40kB high:48kB active_anon:0kB inactive_anon:0kB >> active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB >> isolated(file):0kB present:15996kB managed:15896kB mlocked:0kB dirty:0kB >> writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB >> slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB >> bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB >> pages_scanned:0 all_unreclaimable? yes >> Nov 17 10:47:17 ceph1 kernel: [2810698.553753] lowmem_reserve[]: 0 1830 >> 32011 32011 32011 >> Nov 17 10:47:17 ceph1 kernel: [2810698.553755] Node 0 DMA32 free:134848kB >> min:3860kB low:4824kB high:5788kB active_anon:120628kB >> inactive_anon:121792kB active_file:653404kB inactive_file:399272kB >> unevictable:0kB isolated(anon):0kB isolated(file):36kB present:1981184kB >> managed:1900752kB mlocked:0kB dirty:344kB writeback:0kB mapped:1200kB >> shmem:0kB slab_reclaimable:239900kB slab_unreclaimable:154560kB >> kernel_stack:43376kB pagetables:1176kB unstable:0kB bounce:0kB free_pcp:0kB >> local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 >> all_unreclaimable? no >> >> >> We check free Memory, Cache is free 24,705 MB >> >> root@ceph1:/var/log# free -m >> total used free shared buffers cached >> Mem: 32052 31729 323 1 125 24256 >> -/+ buffers/cache: 7347 24705 >> Swap: 61033 120 60913 >> >> >> >> It's so weird. Could you help me solve this problem, I'm afraid it 'll >> come again. >> >> Thanks in advance. >> >> >> >> > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com