PANIC: could not flush dirty data: Cannot allocate memory
Hi all! We have a setup with a master and plenty of logical replication slaves. Master and slaves are 12.12-1.pgdg22.04+1 runnning on Ubuntu 22.04. SELECT pg_size_pretty( pg_database_size('regdns') ); is from 25GB (fresh installed slave) to 42GB (probably bloat) Replication slaves VMs have between 22G and 48G RAM, most have 48G RAM. We are using: maintenance_work_mem = 128MB work_mem = 64MB and VMs with 48G RAM: effective_cache_size = 8192MB shared_buffers = 6144MB and VMs with 22G RAM: effective_cache_size = 4096MB shared_buffers = 2048MB On several servers we see the error message: PANIC: could not flush dirty data: Cannot allocate memory Unfortunately I do not find any reference to this kind of error. Can you please describe what happens here in detail? Is it related to server memory? Or our memory settings? I am not so surprised that it happens with the 22G RAM VM. It is not happening on our 32G RAM VMs. But it also happens on some of the 48G RAM VMs which should have plenty of RAM available: # free -h totalusedfree shared buff/cache available Mem:47Gi 9Gi 1.2Gi 6.1Gi35Gi 30Gi Swap: 7.8Gi 3.0Gi 4.9Gi Of course I could upgrade all our VMs and then wait and see if it solved the problem. But I would like to understand what is happening here before spending $$$. Thanks Klaus
Re: PANIC: could not flush dirty data: Cannot allocate memory
Thanks all for digging into this problem. AFAIU the problem is not related to the memory settings in postgresql.conf. It is the kernel that for whatever reasons report ENOMEM. Correct? Am 2022-11-14 22:54, schrieb Christoph Moench-Tegeder: ## klaus.mailingli...@pernau.at (klaus.mailingli...@pernau.at): On several servers we see the error message: PANIC: could not flush dirty data: Cannot allocate memory As far as I can see, that "could not flush dirty data" happens total three times in the code - there are other places where postgresql could PANIC on fsync()-and-stuff-related issues, but they have different messages. Of these three places, there's an sync_file_range(), an posix_fadvise() and an msync(), all in src/backend/storage/file/fd.c. "Cannot allocate memory" would be ENOMEM, which posix_fadvise() does not return (as per it's docs). So this would be sync_file_range(), which could run out of memory (as per the manual) or msync() where ENOMEM actually means "The indicated memory (or part of it) was not mapped". Both cases are somewhat WTF for this setup. What filesystem are you running? Filesystem is ext4. VM technology is mixed: VMware, KVM and XEN PV. Kernel is 5.15.0-52-generic. We have not seen this with Ubutnu 18.04 and 20.04 (although we might not have noticed it). I guess upgrading to postgresql 13/14/15 does not help as the problem happens in the kernel. Do you have any advice how to go further? Shall I lookout for certain kernel changes? In the kernel itself or in ext4 changelog? Thanks Klaus
Re: PANIC: could not flush dirty data: Cannot allocate memory
Hello all! Thanks for the many hints to look for. We did some tuning and further debugging and here are the outcomes, answering all questions in a single email. In the meantime, you could experiment with setting checkpoint_flush_after to 0 We did this: # SHOW checkpoint_flush_after; checkpoint_flush_after 0 (1 row) But we STILL have PANICs. I tried to understand the code but failed. I guess that there are some code paths which call pg_flush_data() without checking this settings, or the check does not work. Did this start after upgrading to 22.04? Or after a certain kernel upgrade? For sure it only started with Ubuntu 22.04. We did not had and still not have any issues on servers with Ubuntu 20.04 and 18.04. I would believe that the kernel would raise a bunch of printks if it hit ENOMEM in the commonly used paths, so you would see something in dmesg or wherever you collect your kernel log if it happened where it was expected. There is nothing in the kernel logs (dmesg) Do you use cgroups or such to limit memory usage of postgres? No Any uncommon options on the filesystem or the mount point? No. Also no Antivirus: /dev/xvda2 / ext4 noatime,nodiratime,errors=remount-ro 0 1 or LABEL=cloudimg-rootfs /ext4 discard,errors=remount-ro 0 1 does this happen on all the hosts, or is it limited to one host or one technology? It happens on XEN VMs, KVM VMs and VMware VMs. On Intel and AMD plattforms. Another interesting thing would be to know the mount and file system options for the FS that triggers the failures. E.g. # tune2fs -l /dev/sda1 tune2fs 1.46.5 (30-Dec-2021) Filesystem volume name: cloudimg-rootfs Last mounted on: / Filesystem UUID: 0522e6b3-8d40-4754-a87e-5678a6921e37 Filesystem magic number: 0xEF53 Filesystem revision #:1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent 64bit flex_bg encrypt sparse_super large_file huge_file dir_nlink extra_isize metadata_csum Filesystem flags: signed_directory_hash Default mount options:user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 12902400 Block count: 26185979 Reserved block count: 0 Overhead clusters:35096 Free blocks: 18451033 Free inodes: 12789946 First block: 0 Block size: 4096 Fragment size:4096 Group descriptor size:64 Reserved GDT blocks: 243 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 16128 Inode blocks per group: 1008 Flex block group size:16 Filesystem created: Wed Apr 20 18:31:24 2022 Last mount time: Thu Nov 10 09:49:34 2022 Last write time: Thu Nov 10 09:49:34 2022 Mount count: 7 Maximum mount count: -1 Last checked: Wed Apr 20 18:31:24 2022 Check interval: 0 () Lifetime writes: 252 GB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 32 Desired extra isize: 32 Journal inode:8 First orphan inode: 42571 Default directory hash: half_md4 Directory Hash Seed: c5ef129b-fbee-4f35-8f28-ad7cc93c1c43 Journal backup: inode blocks Checksum type:crc32c Checksum: 0xb74ebbc3 Thanks Klaus
Re: PANIC: could not flush dirty data: Cannot allocate memory
Some more updates Did this start after upgrading to 22.04? Or after a certain kernel upgrade? For sure it only started with Ubuntu 22.04. We did not had and still not have any issues on servers with Ubuntu 20.04 and 18.04. It also happens with Ubuntu 22.10 (Kernel 5.19.0-23-generic). We now try 6.0 mainline and 5.15. mainline kernel on some servers. I also forgot to mention that /var/lib/postgresql/12 directory is encrypted with fscrypt (ext4 encryption). So we also deactivated the directory encryption on one server to see if it is related to encryption. thanks Klaus