Re: FileFallocate misbehaving on XFS

Jakub Wartak Wed, 11 Dec 2024 04:05:55 -0800

On Wed, Dec 11, 2024 at 4:00 AM Michael Harris <har...@gmail.com> wrote:


> Hi Jakub
>
> On Tue, 10 Dec 2024 at 22:36, Jakub Wartak
> <jakub.war...@enterprisedb.com> wrote:

 [..]

>
> > 3. Maybe somehow there is a bigger interaction between posix_fallocate()
> and delayed XFS's dynamic speculative preallocation from many processes all
> writing into different partitions ? Maybe try "allocsize=1m" mount option
> for that /fs and see if that helps.  I'm going to speculate about XFS
> speculative :) pre allocations, but if we have fdcache and are *not*
> closing fds, how XFS might know to abort its own speculation about
> streaming write ? (multiply that up to potentially the number of opened fds
> to get an avalanche of "preallocations").
>
> I will try to organize that. They are production systems so it might
> take some time.
>

Cool.

> 4. You can also try compiling with patch from Alvaro from [2]
> "0001-Add-some-debugging-around-mdzeroextend.patch", so we might end up
> having more clarity in offsets involved. If not then you could use 'strace
> -e fallocate -p <pid>' to get the exact syscall.
>
> I'll take a look at Alvaro's patch. strace sounds good, but how to
> arrange to start it on the correct PG backends? There will be a
> large-ish number of PG backends going at a time, only some of which
> are performing imports, and they will be coming and going every so
> often as the ETL application scales up and down with the load.
>

Yes, it sounds like mission impossible. Is there any chance you can get
reproduced down to one or a small number of postgres backends doing the
writes?


>
> > 5. Another idea could be catching the kernel side stacktrace of
> fallocate() when it is hitting ENOSPC. E.g. with XFS fs and attached
> bpftrace eBPF tracer I could get the source of the problem in my artificial
> reproducer, e.g
>
> OK, I will look into that also.
>
>
Hopefully that reveals some more. Somehow ENOSPC UNIX error reporting got
one big pile of errors into 1 error category and that's not helpful at all
(inode/extent/block allocation problems are all squeezed into 1 error)

Anyway, if that helps others here are my notes so far on this thread
including that useful file from subthread, hopefully I've did not
misinterpreted something:

- works in <PG16, but fails with >= PG16 due to posix_fallocate() rather
than multiple separate(but adjacent) iovectors to pg_writev. It launched
only in case of mdzeroextend() with numblocks > 8
- 179k or 414k files in single directory (0.3s - 0.5s just to list those)
- OS/FS upgraded from earlier release
- one AG with extreme low extent sizes compared to the others AGs (I bet
that 2->3 22.73% below means small 8192b pg files in $PGDATA, but there are
no large extents in that AG)
   from      to extents  blocks    pct
      1       1    4949    4949   0.65
      2       3   86113  173452  22.73
      4       7   19399   94558  12.39
      8      15   23233  248602  32.58
     16      31   12425  241421  31.64
   total free extents 146119
   total free blocks 762982
   average free extent size 5.22165 (!)
- note that the max extent size above (31) is very low when compared to the
others AG which have 1024-8192. Therefore it looks there are no contiguous
blocks for request sizes above 31*4096 = 126976 bytes within that AG (??).
- we have logic of `extend_by_pages += extend_by_pages * waitcount;` capped
up to 64 pg blocks maximum (and that's higher than the above)
- but the fails where observed also using pg_upgrade --link -j/pg_restore
-j (also concurrent posix_fallocate() to many independent files sharing the
same AG, but that's 1 backend:1 file so no contention for waitcount in
RelationAddBlocks())
- so maybe it's lots of backends doing independent concurrent
posix_fallocate() that end up somehow coalesced ? Or hypothetically let's
say 16-32 fallocates() hit the same AG initially, maybe it's some form of
concurrency semi race-condition inside XFS where one of fallocate calls
fails to find space in that one AG, but according to [1] it should fallback
to another AGs.
- and there's also additional XFS dynamic speculative preallocation that
might cause space pressure during our normal writes..

Another workaround idea/test: create tablespace on the same XFS fs (but in
a somewhat different directory if possible) and see if it still fails.

-J.

[1] - https://blogs.oracle.com/linux/post/extent-allocation-in-xfs

Re: FileFallocate misbehaving on XFS

Reply via email to