PANIC: could not flush dirty data: Cannot allocate memory

2022-11-14 Thread klaus . mailinglists

Hi all!

We have a setup with a master and plenty of logical replication slaves. 
Master and slaves are 12.12-1.pgdg22.04+1 runnning on Ubuntu 22.04.
SELECT pg_size_pretty( pg_database_size('regdns') ); is from 25GB (fresh 
installed slave) to 42GB (probably bloat)


Replication slaves VMs have between 22G and 48G RAM, most have 48G RAM.

We are using:
maintenance_work_mem = 128MB
work_mem = 64MB

and VMs with 48G RAM:
effective_cache_size = 8192MB
shared_buffers = 6144MB

and VMs with 22G RAM:
effective_cache_size = 4096MB
shared_buffers = 2048MB

On several servers we see the error message: PANIC:  could not flush 
dirty data: Cannot allocate memory


Unfortunately I do not find any reference to this kind of error. Can you 
please describe what happens here in detail? Is it related to server 
memory? Or our memory settings? I am not so surprised that it happens 
with the 22G RAM VM. It is not happening on our 32G RAM VMs. But it also 
happens on some of the 48G RAM VMs which should have plenty of RAM 
available:

# free -h
   totalusedfree  shared  buff/cache   
available
Mem:47Gi 9Gi   1.2Gi   6.1Gi35Gi 
   30Gi

Swap:  7.8Gi   3.0Gi   4.9Gi

Of course I could upgrade all our VMs and then wait and see if it solved 
the problem. But I would like to understand what is happening here 
before spending $$$.


Thanks
Klaus





Re: PANIC: could not flush dirty data: Cannot allocate memory

2022-11-15 Thread klaus . mailinglists

Thanks all for digging into this problem.

AFAIU the problem is not related to the memory settings in 
postgresql.conf. It is the kernel that

for whatever reasons report ENOMEM. Correct?

Am 2022-11-14 22:54, schrieb Christoph Moench-Tegeder:

## klaus.mailingli...@pernau.at (klaus.mailingli...@pernau.at):


On several servers we see the error message: PANIC:  could not flush
dirty data: Cannot allocate memory


As far as I can see, that "could not flush dirty data" happens total
three times in the code - there are other places where postgresql could
PANIC on fsync()-and-stuff-related issues, but they have different
messages.
Of these three places, there's an sync_file_range(), an posix_fadvise()
and an msync(), all in src/backend/storage/file/fd.c. "Cannot allocate
memory" would be ENOMEM, which posix_fadvise() does not return (as per
it's docs). So this would be sync_file_range(), which could run out
of memory (as per the manual) or msync() where ENOMEM actually means
"The indicated memory (or part of it) was not mapped". Both cases are
somewhat WTF for this setup.
What filesystem are you running?


Filesystem is ext4. VM technology is mixed: VMware, KVM and XEN PV. 
Kernel is 5.15.0-52-generic.


We have not seen this with Ubutnu 18.04 and 20.04 (although we might not 
have noticed it).


I guess upgrading to postgresql 13/14/15 does not help as the problem 
happens in the kernel.


Do you have any advice how to go further? Shall I lookout for certain 
kernel changes? In the kernel itself or in ext4 changelog?


Thanks
Klaus






Re: PANIC: could not flush dirty data: Cannot allocate memory

2022-11-29 Thread klaus . mailinglists

Hello all!


Thanks for the many hints to look for. We did some tuning and further 
debugging and here are the outcomes, answering all questions in a single 
email.



In the meantime, you could experiment with setting 
checkpoint_flush_after to 0

We did this:
# SHOW checkpoint_flush_after;
 checkpoint_flush_after

 0
(1 row)

But we STILL have PANICs. I tried to understand the code but failed. I 
guess that there are some code paths which call pg_flush_data() without 
checking this settings, or the check does not work.




Did this start after upgrading to 22.04? Or after a certain kernel 
upgrade?


For sure it only started with Ubuntu 22.04. We did not had and still not 
have any issues on servers with Ubuntu 20.04 and 18.04.




I would believe that the kernel would raise
a bunch of printks if it hit ENOMEM in the commonly used paths, so
you would see something in dmesg or wherever you collect your kernel
log if it happened where it was expected.


There is nothing in the kernel logs (dmesg)



Do you use cgroups or such to limit memory usage of postgres?


No



Any uncommon options on the filesystem or the mount point?

No. Also no Antivirus:
/dev/xvda2 / ext4 noatime,nodiratime,errors=remount-ro 0 1
or
LABEL=cloudimg-rootfs   /ext4   discard,errors=remount-ro   
0 1



does this happen on all the hosts, or is it limited to one host or one 
technology?


It happens on XEN VMs, KVM VMs and VMware VMs. On Intel and AMD 
plattforms.



Another interesting thing would be to know the mount and file system 
options

for the FS that triggers the failures. E.g.


# tune2fs -l /dev/sda1
tune2fs 1.46.5 (30-Dec-2021)
Filesystem volume name:   cloudimg-rootfs
Last mounted on:  /
Filesystem UUID:  0522e6b3-8d40-4754-a87e-5678a6921e37
Filesystem magic number:  0xEF53
Filesystem revision #:1 (dynamic)
Filesystem features:  has_journal ext_attr resize_inode dir_index 
filetype needs_recovery extent 64bit flex_bg encrypt sparse_super 
large_file huge_file dir_nlink extra_isize metadata_csum

Filesystem flags: signed_directory_hash
Default mount options:user_xattr acl
Filesystem state: clean
Errors behavior:  Continue
Filesystem OS type:   Linux
Inode count:  12902400
Block count:  26185979
Reserved block count: 0
Overhead clusters:35096
Free blocks:  18451033
Free inodes:  12789946
First block:  0
Block size:   4096
Fragment size:4096
Group descriptor size:64
Reserved GDT blocks:  243
Blocks per group: 32768
Fragments per group:  32768
Inodes per group: 16128
Inode blocks per group:   1008
Flex block group size:16
Filesystem created:   Wed Apr 20 18:31:24 2022
Last mount time:  Thu Nov 10 09:49:34 2022
Last write time:  Thu Nov 10 09:49:34 2022
Mount count:  7
Maximum mount count:  -1
Last checked: Wed Apr 20 18:31:24 2022
Check interval:   0 ()
Lifetime writes:  252 GB
Reserved blocks uid:  0 (user root)
Reserved blocks gid:  0 (group root)
First inode:  11
Inode size:   256
Required extra isize: 32
Desired extra isize:  32
Journal inode:8
First orphan inode:   42571
Default directory hash:   half_md4
Directory Hash Seed:  c5ef129b-fbee-4f35-8f28-ad7cc93c1c43
Journal backup:   inode blocks
Checksum type:crc32c
Checksum: 0xb74ebbc3


Thanks
Klaus





Re: PANIC: could not flush dirty data: Cannot allocate memory

2022-12-05 Thread klaus . mailinglists

Some more updates 

Did this start after upgrading to 22.04? Or after a certain kernel 
upgrade?


For sure it only started with Ubuntu 22.04. We did not had and still 
not have any issues on servers with Ubuntu 20.04 and 18.04.


It also happens with Ubuntu 22.10 (Kernel 5.19.0-23-generic). We now try 
6.0 mainline and 5.15. mainline kernel on some servers.


I also forgot to mention that /var/lib/postgresql/12 directory is 
encrypted with fscrypt (ext4 encryption). So we also deactivated the 
directory encryption on one server to see if it is related to 
encryption.


thanks
Klaus