On Sun, Aug 7, 2022 at 7:56 PM Michael Paquier <mich...@paquier.xyz> wrote: > FWIW, when it comes to that we have a couple of routines that just use > '0' to mean such a thing, aka palloc0(). I find 0002 confusing, as it > introduces in fe_utils.c a new wrapper > (pg_pwritev_with_retry_and_write_zeros) on what's already a wrapper > (pg_pwritev_with_retry) for pwrite().
What about pg_write_initial_zeros(fd, size)? How it writes zeros (ie using vector I/O, and retrying) seems to be an internal detail, no? "initial" to make it clearer that it's at offset 0, or maybe pg_pwrite_zeros(fd, size, offset). > A second thing is that pg_pwritev_with_retry_and_write_zeros() is > designed to work on WAL segments initialization and it uses > XLOG_BLCKSZ and PGAlignedXLogBlock for the job, but there is nothing > in its name that tells us so. This makes me question whether > file_utils.c is a good location for this second thing. Could a new > file be a better location? We have a xlogutils.c in the backend, and > a name similar to that in src/common/ would be one possibility. Yeah, I think it should probably be disconnected from XLOG_BLCKSZ, or maybe it's OK to use BLCKSZ with a comment to say that it's a bit arbitrary, or maybe it's better to define a new zero buffer of some arbitrary size just in this code if that is too strange. We could experiment with different size buffers to see how it performs, bearing in mind that every time we double it you halve the number of system calls, but also bearing in mind that at some point it's too much for the stack. I can tell you that the way that code works today was not really written with performance in mind (unlike, say, the code reverted from 9.4 that tried to do this with posix_fallocate()), it was just finding an excuse to call pwritev(), to exercise new fallback code being committed for use by later AIO stuff (more patches coming soon). The retry support was added because it seemed plausible that some system out there would start to do short writes as we cranked up the sizes for some implementation reason other than ENOSPC, so we should make a reusable retry routine. I think this should also handle the remainder after processing whole blocks, just for completeness. If I call the code as presented with size 8193, I think this code will only write 8192 bytes. I think if this ever needs to work on O_DIRECT files there would be an alignment constraint on the buffer and size, but I don't think we have to worry about that for now.