On Wed, Dec 11, 2024 at 4:00 AM Michael Harris <har...@gmail.com> wrote:
> Hi Jakub > > On Tue, 10 Dec 2024 at 22:36, Jakub Wartak > <jakub.war...@enterprisedb.com> wrote: [..] > > > 3. Maybe somehow there is a bigger interaction between posix_fallocate() > and delayed XFS's dynamic speculative preallocation from many processes all > writing into different partitions ? Maybe try "allocsize=1m" mount option > for that /fs and see if that helps. I'm going to speculate about XFS > speculative :) pre allocations, but if we have fdcache and are *not* > closing fds, how XFS might know to abort its own speculation about > streaming write ? (multiply that up to potentially the number of opened fds > to get an avalanche of "preallocations"). > > I will try to organize that. They are production systems so it might > take some time. > Cool. > 4. You can also try compiling with patch from Alvaro from [2] > "0001-Add-some-debugging-around-mdzeroextend.patch", so we might end up > having more clarity in offsets involved. If not then you could use 'strace > -e fallocate -p <pid>' to get the exact syscall. > > I'll take a look at Alvaro's patch. strace sounds good, but how to > arrange to start it on the correct PG backends? There will be a > large-ish number of PG backends going at a time, only some of which > are performing imports, and they will be coming and going every so > often as the ETL application scales up and down with the load. > Yes, it sounds like mission impossible. Is there any chance you can get reproduced down to one or a small number of postgres backends doing the writes? > > > 5. Another idea could be catching the kernel side stacktrace of > fallocate() when it is hitting ENOSPC. E.g. with XFS fs and attached > bpftrace eBPF tracer I could get the source of the problem in my artificial > reproducer, e.g > > OK, I will look into that also. > > Hopefully that reveals some more. Somehow ENOSPC UNIX error reporting got one big pile of errors into 1 error category and that's not helpful at all (inode/extent/block allocation problems are all squeezed into 1 error) Anyway, if that helps others here are my notes so far on this thread including that useful file from subthread, hopefully I've did not misinterpreted something: - works in <PG16, but fails with >= PG16 due to posix_fallocate() rather than multiple separate(but adjacent) iovectors to pg_writev. It launched only in case of mdzeroextend() with numblocks > 8 - 179k or 414k files in single directory (0.3s - 0.5s just to list those) - OS/FS upgraded from earlier release - one AG with extreme low extent sizes compared to the others AGs (I bet that 2->3 22.73% below means small 8192b pg files in $PGDATA, but there are no large extents in that AG) from to extents blocks pct 1 1 4949 4949 0.65 2 3 86113 173452 22.73 4 7 19399 94558 12.39 8 15 23233 248602 32.58 16 31 12425 241421 31.64 total free extents 146119 total free blocks 762982 average free extent size 5.22165 (!) - note that the max extent size above (31) is very low when compared to the others AG which have 1024-8192. Therefore it looks there are no contiguous blocks for request sizes above 31*4096 = 126976 bytes within that AG (??). - we have logic of `extend_by_pages += extend_by_pages * waitcount;` capped up to 64 pg blocks maximum (and that's higher than the above) - but the fails where observed also using pg_upgrade --link -j/pg_restore -j (also concurrent posix_fallocate() to many independent files sharing the same AG, but that's 1 backend:1 file so no contention for waitcount in RelationAddBlocks()) - so maybe it's lots of backends doing independent concurrent posix_fallocate() that end up somehow coalesced ? Or hypothetically let's say 16-32 fallocates() hit the same AG initially, maybe it's some form of concurrency semi race-condition inside XFS where one of fallocate calls fails to find space in that one AG, but according to [1] it should fallback to another AGs. - and there's also additional XFS dynamic speculative preallocation that might cause space pressure during our normal writes.. Another workaround idea/test: create tablespace on the same XFS fs (but in a somewhat different directory if possible) and see if it still fails. -J. [1] - https://blogs.oracle.com/linux/post/extent-allocation-in-xfs