Hello, (please let me know if this is more appropriate somewhere else, e.g. on ebian-kernel)
I need help debugging/solving a weird memory problem. The symptoms are the usual ones for high memory usage: free/available memory is getting low, systems start swapping, disk I/O increases, performance drops. However, from what I can see, the memory is not used up by user space processes but from the Kernel (NOT caches/buffers), see commands output at the end. I'm still puzzled about what exactly eats all the RAM and how to reclaim it (without rebooting the machine, of course!). Any help would be highly appreciated! Some findings so far: - same problem on many systems, all Debian 9 Stretch, all running stock 4.9 kernel from the official package, all amd64 virtual machines on several (different) VMware ESXi hosts. - not all Stretch systems seem to be affected, but we haven't yet found the common ground. - problem can occur after some days or some weeks, not at the same time on all affected machines. And not at the same time for all VMs on the same host - problem only occurs on Stretch systems, not Jessie, even running on the same host. - we haven't yet seen the problem on real hardware machines, only VMs (but since the vast majority of our systems are VMs, this may not be relevant) - problem seems not directly related to the machine's load. it occurs on machines that are mostly idle as well as on more heavily-loaded systems - problem occurs the same on single-core VMs as well as on multi-core VMs - problem occurs the same on VMs running on single-socket hosts as well as on multi-socket hosts - problem occurs the same on VMs running on hosts with different hypervisor releases, both VMware ESXi 5.5 and 6.5, both standalone and in a vSphere cluster. Here's the output from some commands I hope to be helpful: The machine in this example is a RADIUS server but has not even gone productive ... no incoming client requests yet. (But the problem is not related to the RADIUS server software - OSC Radiator - since the same symptoms show on different machines: not only RADIUS servers but also nameservers, shell servers or jumphosts, etc.) [values while the problem persists:] ------------------------------------------------------------------------ root@rad-m2m-srv02:~# free -thwl total used free shared buffers cache available Mem: 987M 910M 59M 0B 704K 16M 13M Low: 987M 927M 59M High: 0B 0B 0B Swap: 2,0G 345M 1,7G Total: 3,0G 1,2G 1,7G root@rad-m2m-srv02:~# smem -twk Area Used Cache Noncache firmware/hardware 0 0 0 kernel image 0 0 0 kernel dynamic memory 914.9M 11.1M 903.8M userspace memory 13.0M 5.5M 7.4M free memory 59.4M 59.4M 0 ---------------------------------------------------------- 987.3M 76.1M 911.2M root@rad-m2m-srv02:~# smem -uktr User Count Swap USS PSS RSS root 39 332.8M 10.4M 12.4M 44.7M msch 6 7.0M 0 607.0K 8.3M _chrony 1 360.0K 4.0K 20.0K 572.0K messagebus 1 580.0K 4.0K 17.0K 480.0K postfix 2 1.6M 0 13.0K 568.0K daemon 1 208.0K 4.0K 6.0K 72.0K --------------------------------------------------- 50 342.5M 10.4M 13.0M 54.7M root@rad-m2m-srv02:~# sort -k2,2nr /proc/meminfo VmallocTotal: 34359738367 kB CommitLimit: 2602636 kB SwapTotal: 2097148 kB SwapFree: 1741028 kB MemTotal: 1010976 kB DirectMap4k: 1007488 kB Committed_AS: 465128 kB Slab: 79680 kB SUnreclaim: 69268 kB MemFree: 61068 kB DirectMap2M: 40960 kB SReclaimable: 10412 kB Active: 6944 kB Inactive: 6660 kB AnonPages: 6608 kB PageTables: 5804 kB Cached: 5748 kB Mapped: 4660 kB SwapCached: 3988 kB Active(file): 3920 kB Inactive(anon): 3828 kB Active(anon): 3024 kB KernelStack: 2992 kB Inactive(file): 2832 kB Hugepagesize: 2048 kB Buffers: 1020 kB Dirty: 8 kB AnonHugePages: 0 kB Bounce: 0 kB HardwareCorrupted: 0 kB HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 HugePages_Total: 0 MemAvailable: 0 kB Mlocked: 0 kB NFS_Unstable: 0 kB Shmem: 0 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB Unevictable: 0 kB VmallocChunk: 0 kB VmallocUsed: 0 kB Writeback: 0 kB WritebackTmp: 0 kB root@rad-m2m-srv02:~# ps aux --sort=-rss | head -15 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 34718 12.0 0.5 29596 5672 ? D 09:01 0:00 /usr/bin/python3 -Es /usr/bin/lsb_release --short --description root 26491 3.1 0.2 79328 2860 ? D 08:04 1:50 apt-get update -qq root 32551 6.8 0.2 119036 2800 ? D 08:51 0:43 /usr/bin/python3 /usr/bin/unattended-upgrade root 34719 0.0 0.2 41164 2232 pts/1 R+ 09:02 0:00 ps aux --sort=-rss msch 33960 0.1 0.1 23720 1844 pts/0 Ss 08:58 0:00 -bash root 34492 0.2 0.1 23816 1812 pts/1 S 09:00 0:00 -bash msch 33996 0.0 0.1 23576 1768 pts/1 Ss 08:58 0:00 bash -i root 12792 2.2 0.1 159720 1748 ? D 06:06 3:54 /usr/bin/perl -w /usr/bin/apt-show-versions -i root 34521 0.7 0.1 95180 1712 ? Ss 09:01 0:00 sshd: root@notty root 15502 2.4 0.1 167660 1608 ? D 06:25 3:51 /usr/bin/perl -w /usr/bin/apt-show-versions -i root 34527 1.7 0.1 14096 1596 ? Ss 09:01 0:00 /bin/bash /usr/bin/check_mk_agent root 33947 0.0 0.1 95180 1564 ? Ss 08:58 0:00 sshd: msch [priv] root 26486 0.0 0.1 9600 1436 ? S 08:04 0:00 /bin/bash 3600/mk_apt root 26483 0.0 0.1 9588 1424 ? S 08:04 0:00 /bin/bash root@rad-m2m-srv02:~# lsof | wc -l 1943 root@rad-m2m-srv02:~# df -Th -t tmpfs Filesystem Type Size Used Avail Use% Mounted on tmpfs tmpfs 99M 12M 87M 12% /run tmpfs tmpfs 494M 0 494M 0% /dev/shm tmpfs tmpfs 5,0M 0 5,0M 0% /run/lock tmpfs tmpfs 494M 0 494M 0% /sys/fs/cgroup tmpfs tmpfs 1,0G 0 1,0G 0% /tmp tmpfs tmpfs 99M 0 99M 0% /run/user/0 tmpfs tmpfs 99M 0 99M 0% /run/user/2029 root@rad-m2m-srv02:~# vmware-toolbox-cmd stat balloon 0 MB root@rad-m2m-srv02:~# cat /sys/kernel/debug/vmmemctl balloon capabilities: 0x1e used capabilities: 0x1e is resetting: n target: 0 pages current: 0 pages rateSleepAlloc: 2048 pages/sec timer: 3968363 doorbell: 0 start: 7 ( 0 failed) guestType: 7 ( 0 failed) 2m-lock: 0 ( 0 failed) lock: 0 ( 0 failed) 2m-unlock: 0 ( 0 failed) unlock: 0 ( 0 failed) target: 3968363 ( 6 failed) prim2mAlloc: 0 ( 0 failed) primNoSleepAlloc: 0 ( 0 failed) primCanSleepAlloc: 0 ( 0 failed) prim2mFree: 0 primFree: 0 err2mAlloc: 0 errAlloc: 0 err2mFree: 0 errFree: 0 doorbellSet: 6 doorbellUnset: 7 root@rad-m2m-srv02:~# nice vmstat -w 1 10 procs -----------------------memory---------------------- ---swap-- -----io---- -system-- --------cpu-------- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 5 356620 60868 1140 16280 37 19 704 31 3 2 1 2 97 1 0 1 4 356180 60372 320 16224 3008 624 6180 1236 1109 1915 2 18 0 80 0 2 5 356632 61476 320 15568 2776 1452 3128 2012 1146 1802 1 14 0 85 0 1 3 356592 62228 324 15244 2848 952 3784 1564 1029 1780 0 11 0 89 0 2 4 356732 61492 612 15544 2864 1144 3932 1720 1164 1839 2 9 0 89 0 1 4 357252 62836 556 15248 4000 1800 4432 3048 1398 2359 1 15 0 84 0 0 4 356700 61744 448 15248 3368 668 3368 1276 1093 2039 0 9 0 91 0 2 4 356708 61372 456 16272 1940 868 4744 888 876 1377 0 12 0 88 0 0 4 356704 61744 1156 14700 2740 660 4828 1940 1123 1768 0 14 0 86 0 0 4 357556 62240 680 15568 2908 1476 5436 2064 1062 1804 1 15 0 84 0 root@rad-m2m-srv02:~# lsb_release -a No LSB modules are available. Distributor ID: Debian Description: Debian GNU/Linux 9.8 (stretch) Release: 9.8 Codename: stretch root@rad-m2m-srv02:~# uname -a Linux rad-m2m-srv02 4.9.0-8-amd64 #1 SMP Debian 4.9.144-3 (2019-02-02) x86_64 GNU/Linux root@rad-m2m-srv02:~# w 09:02:30 up 45 days, 22:20, 1 user, load average: 5,13, 5,03, 6,58 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT msch pts/0 10.208.105.87 08:58 4.00s 0.26s 0.03s script memdebug root@rad-m2m-srv02:~# [values directly after rebooting:] ------------------------------------------------------------------------ root@rad-m2m-srv02:~# w 09:23:02 up 4 min, 1 user, load average: 0,01, 0,08, 0,04 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT msch pts/0 10.208.105.87 09:21 6.00s 0.26s 0.02s sshd: msch [priv] root@rad-m2m-srv02:~# free -thwl total used free shared buffers cache available Mem: 987M 112M 610M 4,3M 16M 247M 735M Low: 987M 377M 610M High: 0B 0B 0B Swap: 2,0G 0B 2,0G Total: 3,0G 112M 2,6G root@rad-m2m-srv02:~# smem -twk Area Used Cache Noncache firmware/hardware 0 0 0 kernel image 0 0 0 kernel dynamic memory 287.1M 226.6M 60.5M userspace memory 93.8M 37.8M 56.0M free memory 606.4M 606.4M 0 ---------------------------------------------------------- 987.3M 870.8M 116.5M root@rad-m2m-srv02:~# smem -uktr User Count Swap USS PSS RSS root 19 0 62.9M 72.8M 128.7M postfix 6 0 7.9M 12.1M 42.9M msch 4 0 3.7M 7.3M 19.4M messagebus 1 0 1.2M 1.5M 3.8M _chrony 1 0 896.0K 1020.0K 2.8M daemon 1 0 228.0K 309.0K 2.1M --------------------------------------------------- 32 0 76.9M 95.0M 199.7M root@rad-m2m-srv02:~# sort -k2,2nr /proc/meminfo VmallocTotal: 34359738367 kB CommitLimit: 2602636 kB SwapFree: 2097148 kB SwapTotal: 2097148 kB MemTotal: 1010976 kB DirectMap2M: 983040 kB MemAvailable: 753520 kB MemFree: 624520 kB Cached: 234508 kB Active: 161672 kB Inactive: 142964 kB Inactive(file): 138936 kB Committed_AS: 124808 kB Active(file): 108028 kB DirectMap4k: 65408 kB Active(anon): 53644 kB AnonPages: 53300 kB Slab: 36968 kB Mapped: 36760 kB SReclaimable: 19424 kB SUnreclaim: 17544 kB Buffers: 16836 kB Shmem: 4392 kB Inactive(anon): 4028 kB PageTables: 3836 kB KernelStack: 2748 kB Hugepagesize: 2048 kB Dirty: 60 kB AnonHugePages: 0 kB Bounce: 0 kB HardwareCorrupted: 0 kB HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 HugePages_Total: 0 Mlocked: 0 kB NFS_Unstable: 0 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB SwapCached: 0 kB Unevictable: 0 kB VmallocChunk: 0 kB VmallocUsed: 0 kB Writeback: 0 kB WritebackTmp: 0 kB root@rad-m2m-srv02:~# ps aux --sort=-rss | head -15 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 651 0.1 2.6 78748 26992 ? S 09:18 0:00 /usr/bin/perl /opt/radiator/bin/radiusd -daemon -pid_file /var/run/radiator.pid -config_file /opt/radiator/etc/radiator.cfg -I /opt/radiator/share/perl/5.24.1/ root 411 0.0 1.7 153488 18144 ? Ss 09:18 0:00 /usr/bin/VGAuthService root 221 0.1 1.0 136488 10464 ? Ss 09:18 0:00 /usr/bin/vmtoolsd postfix 2033 0.1 0.8 88652 8968 ? S 09:22 0:00 smtp -t unix -u postfix 2034 0.0 0.8 87480 8132 ? S 09:22 0:00 tlsmgr -l -t unix -u root 1 0.3 0.6 57052 6736 ? Ss 09:18 0:00 /sbin/init root 1462 0.0 0.6 95180 6736 ? Ss 09:21 0:00 sshd: msch [priv] postfix 2031 0.0 0.6 83352 6700 ? S 09:22 0:00 cleanup -z -t unix -u postfix 649 0.0 0.6 83296 6600 ? S 09:18 0:00 qmgr -l -t unix -u postfix 2032 0.0 0.6 83260 6600 ? S 09:22 0:00 trivial-rewrite -n rewrite -t unix -u postfix 648 0.0 0.6 83248 6284 ? S 09:18 0:00 pickup -l -t unix -u root 527 0.0 0.6 69952 6168 ? Ss 09:18 0:00 /usr/sbin/sshd -D msch 1464 0.0 0.6 64832 6144 ? Ss 09:21 0:00 /lib/systemd/systemd --user root 251 0.0 0.5 47844 5872 ? Ss 09:18 0:00 /lib/systemd/systemd-udevd root@rad-m2m-srv02:~# lsof | wc -l 1605 root@rad-m2m-srv02:~# df -Th -t tmpfs Filesystem Type Size Used Avail Use% Mounted on tmpfs tmpfs 99M 4,3M 95M 5% /run tmpfs tmpfs 494M 0 494M 0% /dev/shm tmpfs tmpfs 5,0M 0 5,0M 0% /run/lock tmpfs tmpfs 494M 0 494M 0% /sys/fs/cgroup tmpfs tmpfs 1,0G 0 1,0G 0% /tmp tmpfs tmpfs 99M 0 99M 0% /run/user/2029 root@rad-m2m-srv02:~# vmware-toolbox-cmd stat balloon 0 MB root@rad-m2m-srv02:~# cat /sys/kernel/debug/vmmemctl balloon capabilities: 0x1e used capabilities: 0x1e is resetting: n target: 0 pages current: 0 pages rateSleepAlloc: 2048 pages/sec timer: 292 doorbell: 0 start: 1 ( 0 failed) guestType: 1 ( 0 failed) 2m-lock: 0 ( 0 failed) lock: 0 ( 0 failed) 2m-unlock: 0 ( 0 failed) unlock: 0 ( 0 failed) target: 292 ( 0 failed) prim2mAlloc: 0 ( 0 failed) primNoSleepAlloc: 0 ( 0 failed) primCanSleepAlloc: 0 ( 0 failed) prim2mFree: 0 primFree: 0 err2mAlloc: 0 errAlloc: 0 err2mFree: 0 errFree: 0 doorbellSet: 1 doorbellUnset: 1 root@rad-m2m-srv02:~# nice vmstat -w 1 10 procs -----------------------memory---------------------- ---swap-- -----io---- -system-- --------cpu-------- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 0 622948 16868 254624 0 0 728 254 104 231 4 2 88 5 0 0 0 0 622948 16868 254624 0 0 0 0 53 98 0 0 100 0 0 0 0 0 622948 16876 254600 0 0 0 20 50 96 0 0 100 0 0 0 0 0 622948 16876 254600 0 0 0 0 50 91 0 0 100 0 0 0 0 0 622948 16876 254600 0 0 0 0 43 84 0 0 100 0 0 0 0 0 622948 16876 254604 0 0 0 0 57 105 1 0 99 0 0 0 0 0 622948 16876 254600 0 0 0 0 53 106 0 1 99 0 0 0 0 0 622948 16876 254600 0 0 0 0 50 91 1 0 99 0 0 1 0 0 622948 16876 254600 0 0 0 0 49 96 0 0 100 0 0 0 0 0 622948 16876 254600 0 0 0 12 50 94 0 1 99 0 0 root@rad-m2m-srv02:~# ------------------------------------------------------------------------ Anything else I could check to help pinpoint the memory hog? Thanks in advance! Martin -- Martin Schwarz * Karlsruhe, Germany * http://kuroi.de/