Thinking back I can see now why disabling WAL writes with wal_level=minimal in COPY resulted in 3X better write performance instead of expected 2x -
With wal_level=minimal only the heap page writes were needed, whereas with WAL writes the same page was written 3x - (heap + WAL zero-fill + WAL). -- Hannu On Mon, Jan 20, 2025 at 12:06 PM Hannu Krosing <han...@google.com> wrote: > > On Fri, Jan 17, 2025 at 10:29 PM Andres Freund <and...@anarazel.de> wrote: > ... > > > I see, PG once had fallocate [1] (which was reverted by [2] due to some > > > performance regression concern). The original OSS discussion was in [3]. > > > The perf regression was reported in [4]. Looks like this was due to how > > > ext4 handled extents and uninitialized data[5] and that seems to be fixed > > > in [6]. I'll check with Theodore Ts'o to confirm on [6]. > > > > > > Could we consider adding back fallocate? > > > > Fallocate doesn't really help unfortunately. On common filesystems (like > > ext4/xfs) it just allocates filespace without zeroing out the underlying > > blocks. > > @Theodore Tso - can you confirm that ext4 (and xfs?) does not use the > low-level WRITE ZEROS commands for initializing the newly allocated > blocks? > > And that the new blocks will be written twice - once for zero-filling > and then with the actual data . > > For WAL we really don't need to zero out anything - we already do WAL > file recycling without zero-filling the recycled segments, so > obviously it is all right to have random garbage in the pages. > > > To make that correct, those filesystems keep a bitmap indicating which > > blocks in the range are not yet written. Unfortunately updating those blocks > > is a metadata operation and thus requires journaling. > > > > I've seen some mild speedups by first using fallocate and then zeroing out > > the > > file, particularly with larger segment sizes. > > Did you just write a single zero page per file page to avoid > duplicating the work ? > > > I think mainly due to avoiding > > delayed allocation in the filesystem, rather than actually reducing > > fragmentation. But it really isn't a whole lot. > > > > I've in the past tried to get the linux filesytem developers to add an > > fallocate mode that doesn't utilize the "unwritten extents" "optimization", > > but didn't have luck with that. > > Are you saying that the first write to a newly allocated empty block > currently will do two writes to the disk - first writing the zeros and > then writing the actual data written ? > > Or just that the overhead from journalling the change to > not-yet-written bitmap cancels out the win from not writing the page > twice ? > > > The block layer in linux actually does have > > support for zeroing out regions of blocks without having to do actually > > write > > the data, but it's only used in some narrow cases (don't remember the > > details). > > For WAL files we should be ok by either using the declarative no-write > zero fill in the block layer, or just using the pages as-is without > any zero-filling at all (though this is likely not possible because of > required Linux filesystem semantics) > > > Greetings, > > > > Andres Freund > > > >