On Wed, Nov 2, 2022 at 11:54 AM Andres Freund <and...@anarazel.de> wrote: > On 2022-11-02 09:44:30 +1300, Thomas Munro wrote: > > On Wed, Nov 2, 2022 at 2:33 AM Justin Pryzby <pry...@telsasoft.com> wrote: > > > On Tue, Nov 01, 2022 at 08:36:18PM +1300, Thomas Munro wrote: > > > > io_data_direct = whether to use O_DIRECT for main data files > > > > io_wal_direct = ... for WAL > > > > io_wal_init_direct = ... for WAL-file initialisation > > > > > > You added 3 booleans, but I wonder if it's better to add a string GUC > > > which is parsed for comma separated strings.
Done as io_direct=data,wal,wal_init. Thanks Justin, this is better. I resisted the urge to invent a meaning for 'on' and 'off', mainly because it's not clear what values 'on' should enable and it'd be strange to have off without on, so for now an empty string means off. I suppose the meaning of this string could evolve over time: the names of forks, etc. > Perhaps we could use the guc assignment hook to transform the input value into > a bitmask? Makes sense. The only tricky question was where to store the GUC. I went for fd.c for now, but it doesn't seem quite right... > > > DIO is slower, but not so much that it can't run under CI. I suggest to > > > add an 099 commit to enable the feature during development. > > > > Good idea, will do. Done. The tests take 2-3x as long depending on the OS. > Might be worth to additionally have a short tap test that does some basic > stuff with DIO and leave that enabled? I think it'd be good to have > check-world exercise DIO on dev machines, to reduce the likelihood of finding > problems only in CI, which is somewhat painful. Done. > > > Note that this fails under linux with fsanitize=align: > > > ../src/backend/storage/file/buffile.c:117:17: runtime error: member > > > access within misaligned address 0x561a4a8e40f8 for type 'struct > > > BufFile', which requires 4096 byte alignment > > > > Oh, so BufFile is palloc'd and contains one of these. BufFile is not > > even using direct I/O, but by these rules it would need to be > > palloc_io_align'd. I will think about what to do about that... > > It might be worth having two different versions of the struct, so we don't > impose unnecessarily high alignment everywhere? Done. I now have PGAlignedBlock (unchanged) and PGIOAlignedBlock. You have to use the latter for SMgr, because I added alignment assertions there. We might as well use it for any other I/O such as frontend code too for a chance of a small performance boost as you showed. For now I have not use PGIOAlignedBlock for BufFile, even though it would be a great candidate for a potential speedup, only because I am afraid of adding padding to every BufFile in scenarios where we allocate many (could be avoided, a subject for separate research). V2 comprises: 0001 -- David's palloc_aligned() patch https://commitfest.postgresql.org/41/3999/ 0002 -- I/O-align almost all buffers used for I/O 0003 -- Add the GUCs 0004 -- Throwaway hack to make cfbot turn the GUCs on
From 9af1dcc3ce36ce18e011183d5f2a97cdc07fe396 Mon Sep 17 00:00:00 2001 From: Andres Freund <and...@anarazel.de> Date: Tue, 18 Oct 2022 09:47:45 -0700 Subject: [PATCH v2 1/4] Add allocator support for larger allocation alignment & use for IO --- src/backend/utils/cache/catcache.c | 5 +- src/backend/utils/mmgr/Makefile | 1 + src/backend/utils/mmgr/alignedalloc.c | 110 ++++++++++++++++++ src/backend/utils/mmgr/mcxt.c | 141 +++++++++++++++++++++-- src/backend/utils/mmgr/meson.build | 1 + src/include/utils/memutils_internal.h | 13 ++- src/include/utils/memutils_memorychunk.h | 2 +- src/include/utils/palloc.h | 3 + 8 files changed, 263 insertions(+), 13 deletions(-) create mode 100644 src/backend/utils/mmgr/alignedalloc.c diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 30ef0ba39c..9e635177c8 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -763,7 +763,6 @@ InitCatCache(int id, { CatCache *cp; MemoryContext oldcxt; - size_t sz; int i; /* @@ -807,8 +806,8 @@ InitCatCache(int id, * * Note: we rely on zeroing to initialize all the dlist headers correctly */ - sz = sizeof(CatCache) + PG_CACHE_LINE_SIZE; - cp = (CatCache *) CACHELINEALIGN(palloc0(sz)); + cp = (CatCache *) palloc_aligned(sizeof(CatCache), PG_CACHE_LINE_SIZE, + MCXT_ALLOC_ZERO); cp->cc_bucket = palloc0(nbuckets * sizeof(dlist_head)); /* diff --git a/src/backend/utils/mmgr/Makefile b/src/backend/utils/mmgr/Makefile index 3b4cfdbd52..dae3432c98 100644 --- a/src/backend/utils/mmgr/Makefile +++ b/src/backend/utils/mmgr/Makefile @@ -13,6 +13,7 @@ top_builddir = ../../../.. include $(top_builddir)/src/Makefile.global OBJS = \ + alignedalloc.o \ aset.o \ dsa.o \ freepage.o \ diff --git a/src/backend/utils/mmgr/alignedalloc.c b/src/backend/utils/mmgr/alignedalloc.c new file mode 100644 index 0000000000..97cb1d2b0d --- /dev/null +++ b/src/backend/utils/mmgr/alignedalloc.c @@ -0,0 +1,110 @@ +/*------------------------------------------------------------------------- + * + * alignedalloc.c + * Allocator functions to implement palloc_aligned + * + * This is not a fully fledged MemoryContext type as there is no means to + * create a MemoryContext of this type. The code here only serves to allow + * operations such as pfree() and repalloc() to work correctly on a memory + * chunk that was allocated by palloc_aligned(). + * + * Portions Copyright (c) 2022, PostgreSQL Global Development Group + * + * IDENTIFICATION + * src/backend/utils/mmgr/alignedalloc.c + * + *------------------------------------------------------------------------- + */ + +#include "postgres.h" + +#include "utils/memdebug.h" +#include "utils/memutils_memorychunk.h" + +void +AlignedAllocFree(void *pointer) +{ + MemoryChunk *chunk = PointerGetMemoryChunk(pointer); + void *unaligned; + +#ifdef MEMORY_CONTEXT_CHECKING + /* + * Test for someone scribbling on unused space in chunk. We don't have + * the ability to include the context name here, so just mention that it's + * an aligned chunk. + */ + if (!sentinel_ok(pointer, chunk->requested_size)) + elog(WARNING, "detected write past %zu-byte aligned chunk end at %p", + MemoryChunkGetValue(chunk), chunk); +#endif + + Assert(!MemoryChunkIsExternal(chunk)); + + /* obtain the original (unaligned) allocated pointer */ + unaligned = MemoryChunkGetBlock(chunk); + + pfree(unaligned); +} + +void * +AlignedAllocRealloc(void *pointer, Size size) +{ + MemoryChunk *redirchunk = PointerGetMemoryChunk(pointer); + Size alignto = MemoryChunkGetValue(redirchunk); + void *unaligned = MemoryChunkGetBlock(redirchunk); + MemoryContext ctx; + Size old_size; + void *newptr; + + /* sanity check this is a power of 2 value */ + Assert((alignto & (alignto - 1)) == 0); + + /* + * Determine the size of the original allocation. We can't determine this + * exactly as GetMemoryChunkSpace() returns the total space used for the + * allocation, which for contexts like aset includes rounding up to the + * next power of 2. However, this value is just used to memcpy() the old + * data into the new allocation, so we only need to concern ourselves with + * not reading beyond the end of the original allocation's memory. The + * drawback here is that we may copy more bytes than we need to, which + * amounts only to wasted effort. + */ +#ifndef MEMORY_CONTEXT_CHECKING + old_size = GetMemoryChunkSpace(unaligned) - + ((char *) pointer - (char *) PointerGetMemoryChunk(unaligned)); +#else + old_size = redirchunk->requested_size; +#endif + + ctx = GetMemoryChunkContext(unaligned); + newptr = MemoryContextAllocAligned(ctx, size, alignto, 0); + + /* + * We may memcpy beyond the end of the orignal allocation request size, so + * we must mark the entire allocation as defined. + */ + VALGRIND_MAKE_MEM_DEFINED(pointer, old_size); + memcpy(newptr, pointer, Min(size, old_size)); + pfree(unaligned); + + return newptr; +} + +MemoryContext +AlignedAllocGetChunkContext(void *pointer) +{ + MemoryChunk *chunk = PointerGetMemoryChunk(pointer); + + Assert(!MemoryChunkIsExternal(chunk)); + + return GetMemoryChunkContext(MemoryChunkGetBlock(chunk)); +} + +Size +AlignedGetChunkSpace(void *pointer) +{ + MemoryChunk *redirchunk = PointerGetMemoryChunk(pointer); + void *unaligned = MemoryChunkGetBlock(redirchunk); + + return GetMemoryChunkSpace(unaligned); +} diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c index 57bd6690ca..c1e3e88b49 100644 --- a/src/backend/utils/mmgr/mcxt.c +++ b/src/backend/utils/mmgr/mcxt.c @@ -30,6 +30,7 @@ #include "utils/memdebug.h" #include "utils/memutils.h" #include "utils/memutils_internal.h" +#include "utils/memutils_memorychunk.h" static void BogusFree(void *pointer); @@ -84,6 +85,21 @@ static const MemoryContextMethods mcxt_methods[] = { [MCTX_SLAB_ID].check = SlabCheck, #endif + /* alignedalloc.c */ + [MCTX_ALIGNED_REDIRECT_ID].alloc = NULL, /* not required */ + [MCTX_ALIGNED_REDIRECT_ID].free_p = AlignedAllocFree, + [MCTX_ALIGNED_REDIRECT_ID].realloc = AlignedAllocRealloc, + [MCTX_ALIGNED_REDIRECT_ID].reset = NULL, /* not required */ + [MCTX_ALIGNED_REDIRECT_ID].delete_context = NULL, /* not required */ + [MCTX_ALIGNED_REDIRECT_ID].get_chunk_context = AlignedAllocGetChunkContext, + [MCTX_ALIGNED_REDIRECT_ID].get_chunk_space = AlignedGetChunkSpace, + [MCTX_ALIGNED_REDIRECT_ID].is_empty = NULL, /* not required */ + [MCTX_ALIGNED_REDIRECT_ID].stats = NULL, /* not required */ +#ifdef MEMORY_CONTEXT_CHECKING + [MCTX_ALIGNED_REDIRECT_ID].check = NULL, /* not required */ +#endif + + /* * Unused (as yet) IDs should have dummy entries here. This allows us to * fail cleanly if a bogus pointer is passed to pfree or the like. It @@ -110,11 +126,6 @@ static const MemoryContextMethods mcxt_methods[] = { [MCTX_UNUSED4_ID].realloc = BogusRealloc, [MCTX_UNUSED4_ID].get_chunk_context = BogusGetChunkContext, [MCTX_UNUSED4_ID].get_chunk_space = BogusGetChunkSpace, - - [MCTX_UNUSED5_ID].free_p = BogusFree, - [MCTX_UNUSED5_ID].realloc = BogusRealloc, - [MCTX_UNUSED5_ID].get_chunk_context = BogusGetChunkContext, - [MCTX_UNUSED5_ID].get_chunk_space = BogusGetChunkSpace, }; /* @@ -1298,6 +1309,111 @@ palloc_extended(Size size, int flags) return ret; } +/* + * MemoryContextAllocAligned + * Allocate 'size' bytes of memory in 'context' aligned to 'alignto' + * bytes. + * + * 'alignto' must be a power of 2. + * 'flags' may be 0 or set the same as MemoryContextAllocExtended(). + */ +void * +MemoryContextAllocAligned(MemoryContext context, + Size size, Size alignto, int flags) +{ + MemoryChunk *alignedchunk; + Size alloc_size; + void *unaligned; + void *aligned; + + /* wouldn't make much sense to waste that much space */ + Assert(alignto < (128 * 1024 * 1024)); + + /* ensure alignto is a power of 2 */ + Assert((alignto & (alignto - 1)) == 0); + + /* + * If the alignment requirements are less than what we already guarantee + * then just use the standard allocation function. + */ + if (unlikely(alignto <= MAXIMUM_ALIGNOF)) + return MemoryContextAllocExtended(context, size, flags); + + /* + * We implement aligned pointers by simply allocating enough memory for + * the requested size plus the alignment and an additional "redirection" + * MemoryChunk. This additional MemoryChunk is required for operations + * such as pfree when used on the pointer returned by this function. We + * use this redirection MemoryChunk in order to find the pointer to the + * memory that was returned by the MemoryContextAllocExtended call below. + * We do that by "borrowing" the block offset field and instead of using + * that to find the offset into the owning block, we use it to find the + * original allocated address. + * + * Here we must allocate enough extra memory so that we can still align + * the pointer returned by MemoryContextAllocExtended and also have enough + * space for the redirection MemoryChunk. Since allocations will already + * be at least aligned by MAXIMUM_ALIGNOF, we can subtract that amount + * from the allocation size to save a little memory. + */ + alloc_size = size + alignto + sizeof(MemoryChunk) - MAXIMUM_ALIGNOF; + +#ifdef MEMORY_CONTEXT_CHECKING + /* ensure there's space for a sentinal byte */ + alloc_size += 1; +#endif + + /* perform the actual allocation */ + unaligned = MemoryContextAllocExtended(context, alloc_size, flags); + + /* set the aligned pointer */ + aligned = (void *) TYPEALIGN(alignto, (char *) unaligned + + sizeof(MemoryChunk)); + + alignedchunk = PointerGetMemoryChunk(aligned); + + /* + * We set the redirect MemoryChunk so that the block offset calculation is + * used to point back to the 'unaligned' allocated chunk. This allows us + * to use MemoryChunkGetBlock() to find the unaligned chunk when we need + * to perform operations such as pfree() and repalloc(). + * + * We store 'alignto' in the MemoryChunk's 'value' so that we know what + * the alignment was set to should we ever be asked to realloc this + * pointer. + */ + MemoryChunkSetHdrMask(alignedchunk, unaligned, alignto, + MCTX_ALIGNED_REDIRECT_ID); + + /* double check we produced a correctly aligned pointer */ + Assert((char *) TYPEALIGN(alignto, aligned) == aligned); + +#ifdef MEMORY_CONTEXT_CHECKING + alignedchunk->requested_size = size; + /* set mark to catch clobber of "unused" space */ + set_sentinel(aligned, size); +#endif + + /* Mark the bytes before the redirection header as noaccess */ + VALGRIND_MAKE_MEM_NOACCESS(unaligned, + (char *) alignedchunk - (char *) unaligned); + return aligned; +} + +/* + * palloc_aligned + * Allocate 'size' bytes returning a pointer that's aligned to the + * 'alignto' boundary. + * + * 'alignto' must be a power of 2. + * 'flags' may be 0 or set the same as MemoryContextAllocExtended(). + */ +void * +palloc_aligned(Size size, Size alignto, int flags) +{ + return MemoryContextAllocAligned(CurrentMemoryContext, size, alignto, flags); +} + /* * pfree * Release an allocated chunk. @@ -1306,11 +1422,16 @@ void pfree(void *pointer) { #ifdef USE_VALGRIND + MemoryContextMethodID method = GetMemoryChunkMethodID(pointer); MemoryContext context = GetMemoryChunkContext(pointer); #endif MCXT_METHOD(pointer, free_p) (pointer); - VALGRIND_MEMPOOL_FREE(context, pointer); + +#ifdef USE_VALGRIND + if (method != MCTX_ALIGNED_REDIRECT_ID) + VALGRIND_MEMPOOL_FREE(context, pointer); +#endif } /* @@ -1320,6 +1441,9 @@ pfree(void *pointer) void * repalloc(void *pointer, Size size) { +#ifdef USE_VALGRIND + MemoryContextMethodID method = GetMemoryChunkMethodID(pointer); +#endif #if defined(USE_ASSERT_CHECKING) || defined(USE_VALGRIND) MemoryContext context = GetMemoryChunkContext(pointer); #endif @@ -1346,7 +1470,10 @@ repalloc(void *pointer, Size size) size, cxt->name))); } - VALGRIND_MEMPOOL_CHANGE(context, pointer, ret, size); +#ifdef USE_VALGRIND + if (method != MCTX_ALIGNED_REDIRECT_ID) + VALGRIND_MEMPOOL_CHANGE(context, pointer, ret, size); +#endif return ret; } diff --git a/src/backend/utils/mmgr/meson.build b/src/backend/utils/mmgr/meson.build index 641bb181ba..7cf4d6dcc8 100644 --- a/src/backend/utils/mmgr/meson.build +++ b/src/backend/utils/mmgr/meson.build @@ -1,4 +1,5 @@ backend_sources += files( + 'alignedalloc.c', 'aset.c', 'dsa.c', 'freepage.c', diff --git a/src/include/utils/memutils_internal.h b/src/include/utils/memutils_internal.h index bc2cbdd506..450bcba3ed 100644 --- a/src/include/utils/memutils_internal.h +++ b/src/include/utils/memutils_internal.h @@ -70,6 +70,15 @@ extern void SlabStats(MemoryContext context, extern void SlabCheck(MemoryContext context); #endif +/* + * These functions support the implementation of palloc_aligned() and are not + * part of a fully-fledged MemoryContext type. + */ +extern void AlignedAllocFree(void *pointer); +extern void *AlignedAllocRealloc(void *pointer, Size size); +extern MemoryContext AlignedAllocGetChunkContext(void *pointer); +extern Size AlignedGetChunkSpace(void *pointer); + /* * MemoryContextMethodID * A unique identifier for each MemoryContext implementation which @@ -92,8 +101,8 @@ typedef enum MemoryContextMethodID MCTX_ASET_ID, MCTX_GENERATION_ID, MCTX_SLAB_ID, - MCTX_UNUSED4_ID, /* available */ - MCTX_UNUSED5_ID /* 111 occurs in wipe_mem'd memory */ + MCTX_ALIGNED_REDIRECT_ID, + MCTX_UNUSED4_ID /* 111 occurs in wipe_mem'd memory */ } MemoryContextMethodID; /* diff --git a/src/include/utils/memutils_memorychunk.h b/src/include/utils/memutils_memorychunk.h index 2eefc138e3..38702efc58 100644 --- a/src/include/utils/memutils_memorychunk.h +++ b/src/include/utils/memutils_memorychunk.h @@ -156,7 +156,7 @@ MemoryChunkSetHdrMask(MemoryChunk *chunk, void *block, { Size blockoffset = (char *) chunk - (char *) block; - Assert((char *) chunk > (char *) block); + Assert((char *) chunk >= (char *) block); Assert(blockoffset <= MEMORYCHUNK_MAX_BLOCKOFFSET); Assert(value <= MEMORYCHUNK_MAX_VALUE); Assert((int) methodid <= MEMORY_CONTEXT_METHODID_MASK); diff --git a/src/include/utils/palloc.h b/src/include/utils/palloc.h index 72d4e70dc6..b1ac63b2ee 100644 --- a/src/include/utils/palloc.h +++ b/src/include/utils/palloc.h @@ -73,10 +73,13 @@ extern void *MemoryContextAllocZero(MemoryContext context, Size size); extern void *MemoryContextAllocZeroAligned(MemoryContext context, Size size); extern void *MemoryContextAllocExtended(MemoryContext context, Size size, int flags); +extern void *MemoryContextAllocAligned(MemoryContext context, + Size size, Size alignto, int flags); extern void *palloc(Size size); extern void *palloc0(Size size); extern void *palloc_extended(Size size, int flags); +extern void *palloc_aligned(Size size, Size alignto, int flags); extern pg_nodiscard void *repalloc(void *pointer, Size size); extern pg_nodiscard void *repalloc_extended(void *pointer, Size size, int flags); -- 2.35.1
From caa6cbeb3b3f86c48c90513ee184aca500b1f703 Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.mu...@gmail.com> Date: Tue, 13 Dec 2022 16:25:59 +1300 Subject: [PATCH v2 2/4] Introduce PG_IO_ALIGN_SIZE and align all I/O buffers. In order to be allowed to use O_DIRECT in a later commit, we need to align buffers to the virtual memory page size. O_DIRECT would either fail to work or fail to work efficiently without that on various platforms. Even without O_DIRECT, aligning on memory pages improves traditional buffered I/O performance. The alignment size is set to 4096, which is enough for currently known systems. There is no standard governing the requirements for O_DIRECT so it's possible we might have to reconsider this approach or fail to work on some exotic system, but for now this simplistic approach works and it can be changed at compile time. Adjust all call sites that allocate heap memory for file I/O to use the new palloc_aligned() or MemoryContextAllocAligned() functions. For stack-allocated buffers, introduce PGIOAlignedBlock to respect PG_IO_ALIGN_SIZE, if possible with this compiler. Also align the main buffer pool in shared memory. If arbitrary alignment of stack objects is not possible with this compiler, then completely disable the use of O_DIRECT by setting PG_O_DIRECT to 0. (This avoids the need to consider systems that have O_DIRECT but don't have a compiler with an extension that can align stack objects the way we want; that could be done but we don't currently know of any such system, so it's easier to pretend there is no O_DIRECT support instead: that's an existing and tested class of system.) Add assertions that all buffers passed into smgrread(), smgrwrite(), smgrextend() are correctly aligned, if PG_O_DIRECT isn't 0. Author: Thomas Munro <thomas.mu...@gmail.com> Author: Andres Freund <and...@anarazel.de> Reviewed-by: Justin Pryzby <pry...@telsasoft.com> Discussion: https://postgr.es/m/ca+hukgk1x532hyqj_mzfwt0n1zt8trz980d79wbjwnt-yyl...@mail.gmail.com --- contrib/bloom/blinsert.c | 2 +- contrib/pg_prewarm/pg_prewarm.c | 2 +- src/backend/access/gist/gistbuild.c | 9 +++--- src/backend/access/hash/hashpage.c | 2 +- src/backend/access/heap/rewriteheap.c | 2 +- src/backend/access/heap/visibilitymap.c | 2 +- src/backend/access/nbtree/nbtree.c | 2 +- src/backend/access/nbtree/nbtsort.c | 8 ++++-- src/backend/access/spgist/spginsert.c | 2 +- src/backend/access/transam/generic_xlog.c | 13 ++++++--- src/backend/access/transam/xlog.c | 9 +++--- src/backend/catalog/storage.c | 2 +- src/backend/storage/buffer/buf_init.c | 10 +++++-- src/backend/storage/buffer/bufmgr.c | 2 +- src/backend/storage/buffer/localbuf.c | 7 +++-- src/backend/storage/file/buffile.c | 6 ++++ src/backend/storage/freespace/freespace.c | 2 +- src/backend/storage/page/bufpage.c | 5 +++- src/backend/storage/smgr/md.c | 15 +++++++++- src/backend/utils/sort/logtape.c | 2 +- src/bin/pg_checksums/pg_checksums.c | 2 +- src/bin/pg_rewind/local_source.c | 4 +-- src/bin/pg_upgrade/file.c | 4 +-- src/common/file_utils.c | 2 +- src/include/c.h | 34 +++++++++++++++++------ src/include/pg_config_manual.h | 7 +++++ src/include/storage/fd.h | 5 ++-- src/tools/pgindent/typedefs.list | 1 + 28 files changed, 114 insertions(+), 49 deletions(-) diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c index dd26d6ac29..53cc617a66 100644 --- a/contrib/bloom/blinsert.c +++ b/contrib/bloom/blinsert.c @@ -166,7 +166,7 @@ blbuildempty(Relation index) Page metapage; /* Construct metapage. */ - metapage = (Page) palloc(BLCKSZ); + metapage = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0); BloomFillMetapage(index, metapage); /* diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c index caff5c4a80..f50aa69eb2 100644 --- a/contrib/pg_prewarm/pg_prewarm.c +++ b/contrib/pg_prewarm/pg_prewarm.c @@ -36,7 +36,7 @@ typedef enum PREWARM_BUFFER } PrewarmType; -static PGAlignedBlock blockbuffer; +static PGIOAlignedBlock blockbuffer; /* * pg_prewarm(regclass, mode text, fork text, diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c index fb0f466708..d3d7d836e9 100644 --- a/src/backend/access/gist/gistbuild.c +++ b/src/backend/access/gist/gistbuild.c @@ -415,7 +415,7 @@ gist_indexsortbuild(GISTBuildState *state) * Write an empty page as a placeholder for the root page. It will be * replaced with the real root page at the end. */ - page = palloc0(BLCKSZ); + page = palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, MCXT_ALLOC_ZERO); smgrextend(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, GIST_ROOT_BLKNO, page, true); state->pages_allocated++; @@ -509,7 +509,8 @@ gist_indexsortbuild_levelstate_add(GISTBuildState *state, levelstate->current_page++; if (levelstate->pages[levelstate->current_page] == NULL) - levelstate->pages[levelstate->current_page] = palloc(BLCKSZ); + levelstate->pages[levelstate->current_page] = + palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0); newPage = levelstate->pages[levelstate->current_page]; gistinitpage(newPage, old_page_flags); @@ -579,7 +580,7 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state, /* Create page and copy data */ data = (char *) (dist->list); - target = palloc0(BLCKSZ); + target = palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, MCXT_ALLOC_ZERO); gistinitpage(target, isleaf ? F_LEAF : 0); for (int i = 0; i < dist->block.num; i++) { @@ -630,7 +631,7 @@ gist_indexsortbuild_levelstate_flush(GISTBuildState *state, if (parent == NULL) { parent = palloc0(sizeof(GistSortedBuildLevelState)); - parent->pages[0] = (Page) palloc(BLCKSZ); + parent->pages[0] = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0); parent->parent = NULL; gistinitpage(parent->pages[0], 0); diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c index 55b2929ad5..147af95e92 100644 --- a/src/backend/access/hash/hashpage.c +++ b/src/backend/access/hash/hashpage.c @@ -992,7 +992,7 @@ static bool _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks) { BlockNumber lastblock; - PGAlignedBlock zerobuf; + PGIOAlignedBlock zerobuf; Page page; HashPageOpaque ovflopaque; diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c index 2fe9e48e50..23d966940e 100644 --- a/src/backend/access/heap/rewriteheap.c +++ b/src/backend/access/heap/rewriteheap.c @@ -255,7 +255,7 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm state->rs_old_rel = old_heap; state->rs_new_rel = new_heap; - state->rs_buffer = (Page) palloc(BLCKSZ); + state->rs_buffer = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0); /* new_heap needn't be empty, just locked */ state->rs_blockno = RelationGetNumberOfBlocks(new_heap); state->rs_buffer_valid = false; diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c index 4ed70275e2..3bd65b275b 100644 --- a/src/backend/access/heap/visibilitymap.c +++ b/src/backend/access/heap/visibilitymap.c @@ -620,7 +620,7 @@ static void vm_extend(Relation rel, BlockNumber vm_nblocks) { BlockNumber vm_nblocks_now; - PGAlignedBlock pg; + PGIOAlignedBlock pg; SMgrRelation reln; PageInit((Page) pg.data, BLCKSZ, 0); diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c index b52eca8f38..e8ac7390ae 100644 --- a/src/backend/access/nbtree/nbtree.c +++ b/src/backend/access/nbtree/nbtree.c @@ -153,7 +153,7 @@ btbuildempty(Relation index) Page metapage; /* Construct metapage. */ - metapage = (Page) palloc(BLCKSZ); + metapage = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0); _bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false)); /* diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c index 501e011ce1..5e3c461f6f 100644 --- a/src/backend/access/nbtree/nbtsort.c +++ b/src/backend/access/nbtree/nbtsort.c @@ -619,7 +619,7 @@ _bt_blnewpage(uint32 level) Page page; BTPageOpaque opaque; - page = (Page) palloc(BLCKSZ); + page = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0); /* Zero the page and set up standard page header info */ _bt_pageinit(page, BLCKSZ); @@ -660,7 +660,9 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno) while (blkno > wstate->btws_pages_written) { if (!wstate->btws_zeropage) - wstate->btws_zeropage = (Page) palloc0(BLCKSZ); + wstate->btws_zeropage = (Page) palloc_aligned(BLCKSZ, + PG_IO_ALIGN_SIZE, + MCXT_ALLOC_ZERO); /* don't set checksum for all-zero page */ smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM, wstate->btws_pages_written++, @@ -1170,7 +1172,7 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state) * set to point to "P_NONE"). This changes the index to the "valid" state * by filling in a valid magic number in the metapage. */ - metapage = (Page) palloc(BLCKSZ); + metapage = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0); _bt_initmetapage(metapage, rootblkno, rootlevel, wstate->inskey->allequalimage); _bt_blwritepage(wstate, metapage, BTREE_METAPAGE); diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c index c6821b5952..6f14b41329 100644 --- a/src/backend/access/spgist/spginsert.c +++ b/src/backend/access/spgist/spginsert.c @@ -158,7 +158,7 @@ spgbuildempty(Relation index) Page page; /* Construct metapage. */ - page = (Page) palloc(BLCKSZ); + page = (Page) palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, 0); SpGistInitMetapage(page); /* diff --git a/src/backend/access/transam/generic_xlog.c b/src/backend/access/transam/generic_xlog.c index 6db9a1fca1..458e270d55 100644 --- a/src/backend/access/transam/generic_xlog.c +++ b/src/backend/access/transam/generic_xlog.c @@ -58,14 +58,17 @@ typedef struct char delta[MAX_DELTA_SIZE]; /* delta between page images */ } PageData; -/* State of generic xlog record construction */ +/* + * State of generic xlog record construction. Must be allocated at an I/O + * aligned address. + */ struct GenericXLogState { + /* Page images (properly aligned, must be first) */ + PGIOAlignedBlock images[MAX_GENERIC_XLOG_PAGES]; /* Info about each page, see above */ PageData pages[MAX_GENERIC_XLOG_PAGES]; bool isLogged; - /* Page images (properly aligned) */ - PGAlignedBlock images[MAX_GENERIC_XLOG_PAGES]; }; static void writeFragment(PageData *pageData, OffsetNumber offset, @@ -269,7 +272,9 @@ GenericXLogStart(Relation relation) GenericXLogState *state; int i; - state = (GenericXLogState *) palloc(sizeof(GenericXLogState)); + state = (GenericXLogState *) palloc_aligned(sizeof(GenericXLogState), + PG_IO_ALIGN_SIZE, + 0); state->isLogged = RelationNeedsWAL(relation); for (i = 0; i < MAX_GENERIC_XLOG_PAGES; i++) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index a31fbbff78..a75c6813a4 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -4511,7 +4511,7 @@ XLOGShmemSize(void) /* xlblocks array */ size = add_size(size, mul_size(sizeof(XLogRecPtr), XLOGbuffers)); /* extra alignment padding for XLOG I/O buffers */ - size = add_size(size, XLOG_BLCKSZ); + size = add_size(size, Max(XLOG_BLCKSZ, PG_IO_ALIGN_SIZE)); /* and the buffers themselves */ size = add_size(size, mul_size(XLOG_BLCKSZ, XLOGbuffers)); @@ -4608,10 +4608,11 @@ XLOGShmemInit(void) /* * Align the start of the page buffers to a full xlog block size boundary. - * This simplifies some calculations in XLOG insertion. It is also - * required for O_DIRECT. + * This simplifies some calculations in XLOG insertion. We also need I/O + * alignment for O_DIRECT, but that's also a power of two and usually + * smaller. Take the larger of the two alignment requirements. */ - allocptr = (char *) TYPEALIGN(XLOG_BLCKSZ, allocptr); + allocptr = (char *) TYPEALIGN(Max(XLOG_BLCKSZ, PG_IO_ALIGN_SIZE), allocptr); XLogCtl->pages = allocptr; memset(XLogCtl->pages, 0, (Size) XLOG_BLCKSZ * XLOGbuffers); diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c index d708af19ed..0c5ac1f94b 100644 --- a/src/backend/catalog/storage.c +++ b/src/backend/catalog/storage.c @@ -451,7 +451,7 @@ void RelationCopyStorage(SMgrRelation src, SMgrRelation dst, ForkNumber forkNum, char relpersistence) { - PGAlignedBlock buf; + PGIOAlignedBlock buf; Page page; bool use_wal; bool copying_initfork; diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c index 6b6264854e..76a30d44b7 100644 --- a/src/backend/storage/buffer/buf_init.c +++ b/src/backend/storage/buffer/buf_init.c @@ -78,9 +78,12 @@ InitBufferPool(void) NBuffers * sizeof(BufferDescPadded), &foundDescs); + /* Align buffer pool on IO page size boundary. */ BufferBlocks = (char *) - ShmemInitStruct("Buffer Blocks", - NBuffers * (Size) BLCKSZ, &foundBufs); + TYPEALIGN(PG_IO_ALIGN_SIZE, + ShmemInitStruct("Buffer Blocks", + NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE, + &foundBufs)); /* Align condition variables to cacheline boundary. */ BufferIOCVArray = (ConditionVariableMinimallyPadded *) @@ -163,7 +166,8 @@ BufferShmemSize(void) /* to allow aligning buffer descriptors */ size = add_size(size, PG_CACHE_LINE_SIZE); - /* size of data pages */ + /* size of data pages, plus alignment padding */ + size = add_size(size, PG_IO_ALIGN_SIZE); size = add_size(size, mul_size(NBuffers, BLCKSZ)); /* size of stuff controlled by freelist.c */ diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 73d30bf619..aba07e94c9 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -3717,7 +3717,7 @@ RelationCopyStorageUsingBuffer(RelFileLocator srclocator, bool use_wal; BlockNumber nblocks; BlockNumber blkno; - PGAlignedBlock buf; + PGIOAlignedBlock buf; BufferAccessStrategy bstrategy_src; BufferAccessStrategy bstrategy_dst; diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c index 30d67d1c40..735f7c6018 100644 --- a/src/backend/storage/buffer/localbuf.c +++ b/src/backend/storage/buffer/localbuf.c @@ -546,8 +546,11 @@ GetLocalBufferStorage(void) /* And don't overflow MaxAllocSize, either */ num_bufs = Min(num_bufs, MaxAllocSize / BLCKSZ); - cur_block = (char *) MemoryContextAlloc(LocalBufferContext, - num_bufs * BLCKSZ); + /* Buffers should be I/O aligned. */ + cur_block = (char *) + TYPEALIGN(PG_IO_ALIGN_SIZE, + MemoryContextAlloc(LocalBufferContext, + num_bufs * BLCKSZ + PG_IO_ALIGN_SIZE)); next_buf_in_block = 0; num_bufs_in_block = num_bufs; } diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c index b0b4eeb3bd..2261c3ebe3 100644 --- a/src/backend/storage/file/buffile.c +++ b/src/backend/storage/file/buffile.c @@ -95,6 +95,12 @@ struct BufFile off_t curOffset; /* offset part of current pos */ int pos; /* next read/write position in buffer */ int nbytes; /* total # of valid bytes in buffer */ + + /* + * XXX Should ideally us PGIOAlignedBlock, but might need a way to avoid + * wasting per-file alignment padding when some users create many + * files. + */ PGAlignedBlock buffer; }; diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c index a6b0533103..7230d538fd 100644 --- a/src/backend/storage/freespace/freespace.c +++ b/src/backend/storage/freespace/freespace.c @@ -608,7 +608,7 @@ static void fsm_extend(Relation rel, BlockNumber fsm_nblocks) { BlockNumber fsm_nblocks_now; - PGAlignedBlock pg; + PGIOAlignedBlock pg; SMgrRelation reln; PageInit((Page) pg.data, BLCKSZ, 0); diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c index 8b617c7e79..0728ce30c0 100644 --- a/src/backend/storage/page/bufpage.c +++ b/src/backend/storage/page/bufpage.c @@ -1522,7 +1522,10 @@ PageSetChecksumCopy(Page page, BlockNumber blkno) * and second to avoid wasting space in processes that never call this. */ if (pageCopy == NULL) - pageCopy = MemoryContextAlloc(TopMemoryContext, BLCKSZ); + pageCopy = MemoryContextAllocAligned(TopMemoryContext, + BLCKSZ, + PG_IO_ALIGN_SIZE, + 0); memcpy(pageCopy, (char *) page, BLCKSZ); ((PageHeader) pageCopy)->pd_checksum = pg_checksum_page(pageCopy, blkno); diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index 14b6fa0fd9..3e034afdf1 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -453,6 +453,10 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, int nbytes; MdfdVec *v; + /* If this build supports direct I/O, the buffer must be I/O aligned. */ + if (PG_O_DIRECT != 0) + Assert((uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer)); + /* This assert is too expensive to have on normally ... */ #ifdef CHECK_WRITE_VS_EXTEND Assert(blocknum >= mdnblocks(reln, forknum)); @@ -675,6 +679,10 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, int nbytes; MdfdVec *v; + /* If this build supports direct I/O, the buffer must be I/O aligned. */ + if (PG_O_DIRECT != 0) + Assert((uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer)); + TRACE_POSTGRESQL_SMGR_MD_READ_START(forknum, blocknum, reln->smgr_rlocator.locator.spcOid, reln->smgr_rlocator.locator.dbOid, @@ -740,6 +748,10 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, int nbytes; MdfdVec *v; + /* If this build supports direct I/O, the buffer must be I/O aligned. */ + if (PG_O_DIRECT != 0) + Assert((uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer)); + /* This assert is too expensive to have on normally ... */ #ifdef CHECK_WRITE_VS_EXTEND Assert(blocknum < mdnblocks(reln, forknum)); @@ -1294,7 +1306,8 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno, */ if (nblocks < ((BlockNumber) RELSEG_SIZE)) { - char *zerobuf = palloc0(BLCKSZ); + char *zerobuf = palloc_aligned(BLCKSZ, PG_IO_ALIGN_SIZE, + MCXT_ALLOC_ZERO); mdextend(reln, forknum, nextsegno * ((BlockNumber) RELSEG_SIZE) - 1, diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c index c384f98e13..6ba5030a5f 100644 --- a/src/backend/utils/sort/logtape.c +++ b/src/backend/utils/sort/logtape.c @@ -252,7 +252,7 @@ ltsWriteBlock(LogicalTapeSet *lts, long blocknum, void *buffer) */ while (blocknum > lts->nBlocksWritten) { - PGAlignedBlock zerobuf; + PGIOAlignedBlock zerobuf; MemSet(zerobuf.data, 0, sizeof(zerobuf)); diff --git a/src/bin/pg_checksums/pg_checksums.c b/src/bin/pg_checksums/pg_checksums.c index 7f3d5fc040..be6bae2b78 100644 --- a/src/bin/pg_checksums/pg_checksums.c +++ b/src/bin/pg_checksums/pg_checksums.c @@ -183,7 +183,7 @@ skipfile(const char *fn) static void scan_file(const char *fn, int segmentno) { - PGAlignedBlock buf; + PGIOAlignedBlock buf; PageHeader header = (PageHeader) buf.data; int f; BlockNumber blockno; diff --git a/src/bin/pg_rewind/local_source.c b/src/bin/pg_rewind/local_source.c index 2e50485c39..83b37a1e91 100644 --- a/src/bin/pg_rewind/local_source.c +++ b/src/bin/pg_rewind/local_source.c @@ -77,7 +77,7 @@ static void local_queue_fetch_file(rewind_source *source, const char *path, size_t len) { const char *datadir = ((local_source *) source)->datadir; - PGAlignedBlock buf; + PGIOAlignedBlock buf; char srcpath[MAXPGPATH]; int srcfd; size_t written_len; @@ -129,7 +129,7 @@ local_queue_fetch_range(rewind_source *source, const char *path, off_t off, size_t len) { const char *datadir = ((local_source *) source)->datadir; - PGAlignedBlock buf; + PGIOAlignedBlock buf; char srcpath[MAXPGPATH]; int srcfd; off_t begin = off; diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c index 079fbda838..b5809236f6 100644 --- a/src/bin/pg_upgrade/file.c +++ b/src/bin/pg_upgrade/file.c @@ -178,8 +178,8 @@ rewriteVisibilityMap(const char *fromfile, const char *tofile, { int src_fd; int dst_fd; - PGAlignedBlock buffer; - PGAlignedBlock new_vmbuf; + PGIOAlignedBlock buffer; + PGIOAlignedBlock new_vmbuf; ssize_t totalBytesRead = 0; ssize_t src_filesize; int rewriteVmBytesPerPage; diff --git a/src/common/file_utils.c b/src/common/file_utils.c index d8507d88a5..83ef0609a2 100644 --- a/src/common/file_utils.c +++ b/src/common/file_utils.c @@ -539,7 +539,7 @@ pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset) ssize_t pg_pwrite_zeros(int fd, size_t size) { - PGAlignedBlock zbuffer; /* worth BLCKSZ */ + PGIOAlignedBlock zbuffer; /* worth BLCKSZ */ size_t zbuffer_sz; struct iovec iov[PG_IOV_MAX]; int blocks; diff --git a/src/include/c.h b/src/include/c.h index 98cdd285dd..9df92fb40e 100644 --- a/src/include/c.h +++ b/src/include/c.h @@ -1068,14 +1068,11 @@ extern void ExceptionalCondition(const char *conditionName, /* * Use this, not "char buf[BLCKSZ]", to declare a field or local variable - * holding a page buffer, if that page might be accessed as a page and not - * just a string of bytes. Otherwise the variable might be under-aligned, - * causing problems on alignment-picky hardware. (In some places, we use - * this to declare buffers even though we only pass them to read() and - * write(), because copying to/from aligned buffers is usually faster than - * using unaligned buffers.) We include both "double" and "int64" in the - * union to ensure that the compiler knows the value must be MAXALIGN'ed - * (cf. configure's computation of MAXIMUM_ALIGNOF). + * holding a page buffer, if that page might be accessed as a page. Otherwise + * the variable might be under-aligned, causing problems on alignment-picky + * hardware. We include both "double" and "int64" in the union to ensure that + * the compiler knows the value must be MAXALIGN'ed (cf. configure's + * computation of MAXIMUM_ALIGNOF). */ typedef union PGAlignedBlock { @@ -1084,9 +1081,30 @@ typedef union PGAlignedBlock int64 force_align_i64; } PGAlignedBlock; +/* + * Use this to declare a field or local variable holding a page buffer, if that + * page might be accessed as a page or passed to an SMgr I/O function. If + * allocating using the MemoryContext API, the aligned allocation functions + * should be used with PG_IO_ALIGN_SIZE. This alignment may be more efficient + * for I/O in general, but may be strictly required on some platforms when + * using direct I/O. + */ +typedef union PGIOAlignedBlock +{ +#ifdef pg_attribute_aligned + pg_attribute_aligned(PG_IO_ALIGN_SIZE) +#endif + char data[BLCKSZ]; + double force_align_d; + int64 force_align_i64; +} PGIOAlignedBlock; + /* Same, but for an XLOG_BLCKSZ-sized buffer */ typedef union PGAlignedXLogBlock { +#ifdef pg_attribute_aligned + pg_attribute_aligned(PG_IO_ALIGN_SIZE) +#endif char data[XLOG_BLCKSZ]; double force_align_d; int64 force_align_i64; diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h index f2a106f983..323a4cfb4f 100644 --- a/src/include/pg_config_manual.h +++ b/src/include/pg_config_manual.h @@ -227,6 +227,13 @@ */ #define PG_CACHE_LINE_SIZE 128 +/* + * Assumed memory alignment requirement for direct I/O. On currently known + * systems this size applies, even for memory that is backed by larger virtual + * memory pages. + */ +#define PG_IO_ALIGN_SIZE 4096 + /* *------------------------------------------------------------------------ * The following symbols are for enabling debugging code, not for diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h index 7144fc9f60..85ef12c440 100644 --- a/src/include/storage/fd.h +++ b/src/include/storage/fd.h @@ -82,9 +82,10 @@ extern PGDLLIMPORT int max_safe_fds; * to the appropriate Windows flag in src/port/open.c. We simulate it with * fcntl(F_NOCACHE) on macOS inside fd.c's open() wrapper. We use the name * PG_O_DIRECT rather than defining O_DIRECT in that case (probably not a good - * idea on a Unix). + * idea on a Unix). We can only use it if the compiler will correctly align + * PGIOAlignedBlock for us, though. */ -#if defined(O_DIRECT) +#if defined(O_DIRECT) && defined(pg_attribute_aligned) #define PG_O_DIRECT O_DIRECT #elif defined(F_NOCACHE) #define PG_O_DIRECT 0x80000000 diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list index 60c71d05fe..9a77664154 100644 --- a/src/tools/pgindent/typedefs.list +++ b/src/tools/pgindent/typedefs.list @@ -1687,6 +1687,7 @@ PGEventResultDestroy PGFInfoFunction PGFileType PGFunction +PGIOAlignedBlock PGLZ_HistEntry PGLZ_Strategy PGMessageField -- 2.35.1
From e6692d744a7d041519e9c0998ce9f34aabc63c1e Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.mu...@gmail.com> Date: Tue, 13 Dec 2022 16:54:18 +1300 Subject: [PATCH v2 3/4] Add io_direct setting (developer-only). Provide a way to ask the kernel to use O_DIRECT (or local equivalent) for data and WAL files. This hurts performance currently and is not intended for end-users yet. Later proposed work would introduce our own I/O clustering, read-ahead, etc to replace the kernel features that are disabled with this option. This replaces the previous logic that would use O_DIRECT for the WAL in limited and obscure cases, now that there is an explicit setting. Author: Thomas Munro <thomas.mu...@gmail.com> Author: Andres Freund <and...@anarazel.de> Reviewed-by: Justin Pryzby <pry...@telsasoft.com> Discussion: https://postgr.es/m/CA%2BhUKGK1X532hYqJ_MzFWt0n1zt8trz980D79WbjwnT-yYLZpg%40mail.gmail.com --- doc/src/sgml/config.sgml | 33 +++++++ src/backend/access/transam/xlog.c | 37 ++++---- src/backend/access/transam/xlogprefetcher.c | 2 +- src/backend/storage/buffer/bufmgr.c | 16 ++-- src/backend/storage/buffer/localbuf.c | 7 +- src/backend/storage/file/fd.c | 88 +++++++++++++++++++ src/backend/storage/smgr/md.c | 24 +++-- src/backend/storage/smgr/smgr.c | 1 + src/backend/utils/misc/guc_tables.c | 12 +++ src/include/storage/fd.h | 8 ++ src/include/storage/smgr.h | 1 + src/include/utils/guc_hooks.h | 3 + src/test/modules/test_misc/meson.build | 1 + src/test/modules/test_misc/t/004_io_direct.pl | 40 +++++++++ 14 files changed, 239 insertions(+), 34 deletions(-) create mode 100644 src/test/modules/test_misc/t/004_io_direct.pl diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 8e4145979d..766d20f2ea 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -11033,6 +11033,39 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir' </listitem> </varlistentry> + <varlistentry id="guc-io-direct" xreflabel="io_direct"> + <term><varname>io_direct</varname> (<type>string</type>) + <indexterm> + <primary><varname>io_direct</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Ask the kernel to minimize caching effects for relation data and WAL + files using <literal>O_DIRECT</literal> (most Unix-like systems), + <literal>F_NOCACHE</literal> (macOS) or + <literal>FILE_FLAG_NO_BUFFERING</literal> (Windows). + </para> + <para> + May be set to an empty string (the default) to disable use of direct + I/O, or a comma-separated list of types of files for which direct I/O + is enabled. The valid types of file are <literal>data</literal> for + main data files, <literal>wal</literal> for WAL files, and + <literal>wal_init</literal> for WAL files when being initially + allocated. + </para> + <para> + Some operating systems and file systems do not support direct I/O, so + non-default settings may be rejected at startup, or produce I/O errors + at runtime. + </para> + <para> + Currently this feature reduces performance, and is intended for + developer testing only. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-post-auth-delay" xreflabel="post_auth_delay"> <term><varname>post_auth_delay</varname> (<type>integer</type>) <indexterm> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index a75c6813a4..08a2f7a558 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -2925,6 +2925,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli, XLogSegNo max_segno; int fd; int save_errno; + int open_flags = O_RDWR | O_CREAT | O_EXCL | PG_BINARY; Assert(logtli != 0); @@ -2957,8 +2958,11 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli, unlink(tmppath); + if (io_direct_flags & IO_DIRECT_WAL_INIT) + open_flags |= PG_O_DIRECT; + /* do not use get_sync_bit() here --- want to fsync only at end of fill */ - fd = BasicOpenFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY); + fd = BasicOpenFile(tmppath, open_flags); if (fd < 0) ereport(ERROR, (errcode_for_file_access(), @@ -3350,7 +3354,7 @@ XLogFileClose(void) * use the cache to read the WAL segment. */ #if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED) - if (!XLogIsNeeded()) + if (!XLogIsNeeded() && (io_direct_flags & IO_DIRECT_WAL) == 0) (void) posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED); #endif @@ -4450,7 +4454,6 @@ show_in_hot_standby(void) return RecoveryInProgress() ? "on" : "off"; } - /* * Read the control file, set respective GUCs. * @@ -8034,35 +8037,27 @@ xlog_redo(XLogReaderState *record) } /* - * Return the (possible) sync flag used for opening a file, depending on the - * value of the GUC wal_sync_method. + * Return the extra open flags used for opening a file, depending on the + * value of the GUCs wal_sync_method, fsync and io_direct. */ static int get_sync_bit(int method) { int o_direct_flag = 0; - /* If fsync is disabled, never open in sync mode */ - if (!enableFsync) - return 0; - /* - * Optimize writes by bypassing kernel cache with O_DIRECT when using - * O_SYNC and O_DSYNC. But only if archiving and streaming are disabled, - * otherwise the archive command or walsender process will read the WAL - * soon after writing it, which is guaranteed to cause a physical read if - * we bypassed the kernel cache. We also skip the - * posix_fadvise(POSIX_FADV_DONTNEED) call in XLogFileClose() for the same - * reason. - * - * Never use O_DIRECT in walreceiver process for similar reasons; the WAL + * Use O_DIRECT if requested, except in walreceiver process. The WAL * written by walreceiver is normally read by the startup process soon - * after it's written. Also, walreceiver performs unaligned writes, which + * after it's written. Also, walreceiver performs unaligned writes, which * don't work with O_DIRECT, so it is required for correctness too. */ - if (!XLogIsNeeded() && !AmWalReceiverProcess()) + if ((io_direct_flags & IO_DIRECT_WAL) && !AmWalReceiverProcess()) o_direct_flag = PG_O_DIRECT; + /* If fsync is disabled, never open in sync mode */ + if (!enableFsync) + return o_direct_flag; + switch (method) { /* @@ -8074,7 +8069,7 @@ get_sync_bit(int method) case SYNC_METHOD_FSYNC: case SYNC_METHOD_FSYNC_WRITETHROUGH: case SYNC_METHOD_FDATASYNC: - return 0; + return o_direct_flag; #ifdef O_SYNC case SYNC_METHOD_OPEN: return O_SYNC | o_direct_flag; diff --git a/src/backend/access/transam/xlogprefetcher.c b/src/backend/access/transam/xlogprefetcher.c index 0cf03945ee..992256dd09 100644 --- a/src/backend/access/transam/xlogprefetcher.c +++ b/src/backend/access/transam/xlogprefetcher.c @@ -785,7 +785,7 @@ XLogPrefetcherNextBlock(uintptr_t pgsr_private, XLogRecPtr *lsn) block->prefetch_buffer = InvalidBuffer; return LRQ_NEXT_IO; } - else + else if ((io_direct_flags & IO_DIRECT_DATA) == 0) { /* * This shouldn't be possible, because we already determined diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index aba07e94c9..11c8187a55 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -535,8 +535,11 @@ PrefetchSharedBuffer(SMgrRelation smgr_reln, * Try to initiate an asynchronous read. This returns false in * recovery if the relation file doesn't exist. */ - if (smgrprefetch(smgr_reln, forkNum, blockNum)) + if ((io_direct_flags & IO_DIRECT_DATA) == 0 && + smgrprefetch(smgr_reln, forkNum, blockNum)) + { result.initiated_io = true; + } #endif /* USE_PREFETCH */ } else @@ -582,11 +585,11 @@ PrefetchSharedBuffer(SMgrRelation smgr_reln, * the kernel and therefore didn't really initiate I/O, and no way to know when * the I/O completes other than using synchronous ReadBuffer(). * - * 3. Otherwise, the buffer wasn't already cached by PostgreSQL, and either + * 3. Otherwise, the buffer wasn't already cached by PostgreSQL, and * USE_PREFETCH is not defined (this build doesn't support prefetching due to - * lack of a kernel facility), or the underlying relation file wasn't found and - * we are in recovery. (If the relation file wasn't found and we are not in - * recovery, an error is raised). + * lack of a kernel facility), direct I/O is enabled, or the underlying + * relation file wasn't found and we are in recovery. (If the relation file + * wasn't found and we are not in recovery, an error is raised). */ PrefetchBufferResult PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum) @@ -4908,6 +4911,9 @@ ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *tag) { PendingWriteback *pending; + if (io_direct_flags & IO_DIRECT_DATA) + return; + /* * Add buffer to the pending writeback array, unless writeback control is * disabled. diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c index 735f7c6018..b01e319641 100644 --- a/src/backend/storage/buffer/localbuf.c +++ b/src/backend/storage/buffer/localbuf.c @@ -87,8 +87,11 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum, { #ifdef USE_PREFETCH /* Not in buffers, so initiate prefetch */ - smgrprefetch(smgr, forkNum, blockNum); - result.initiated_io = true; + if ((io_direct_flags & IO_DIRECT_DATA) == 0 && + smgrprefetch(smgr, forkNum, blockNum)) + { + result.initiated_io = true; + } #endif /* USE_PREFETCH */ } diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index f6c9382023..0829e9b8df 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -98,7 +98,9 @@ #include "storage/fd.h" #include "storage/ipc.h" #include "utils/guc.h" +#include "utils/guc_hooks.h" #include "utils/resowner_private.h" +#include "utils/varlena.h" /* Define PG_FLUSH_DATA_WORKS if we have an implementation for pg_flush_data */ #if defined(HAVE_SYNC_FILE_RANGE) @@ -162,6 +164,9 @@ bool data_sync_retry = false; /* How SyncDataDirectory() should do its job. */ int recovery_init_sync_method = RECOVERY_INIT_SYNC_METHOD_FSYNC; +/* Which kinds of files should be opened with PG_O_DIRECT. */ +int io_direct_flags; + /* Debugging.... */ #ifdef FDDEBUG @@ -2021,6 +2026,11 @@ FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info) if (nbytes <= 0) return; +#ifdef PG_O_DIRECT + if (VfdCache[file].fileFlags & PG_O_DIRECT) + return; +#endif + returnCode = FileAccess(file); if (returnCode < 0) return; @@ -3737,3 +3747,81 @@ data_sync_elevel(int elevel) { return data_sync_retry ? elevel : PANIC; } + +bool +check_io_direct(char **newval, void **extra, GucSource source) +{ +#if PG_O_DIRECT == 0 + if (*newval) + { + GUC_check_errdetail("io_direct is not supported on this platform."); + return false; + } +#else + List *list; + ListCell *l; + int *flags; + + if (!SplitGUCList(*newval, ',', &list)) + { + GUC_check_errdetail("invalid list syntax in parameter \"%s\"", + "io_direct"); + return false; + } + + flags = guc_malloc(ERROR, sizeof(*flags)); + *flags = 0; + foreach (l, list) + { + char *item = (char *) lfirst(l); + + if (pg_strcasecmp(item, "data") == 0) + *flags |= IO_DIRECT_DATA; + else if (pg_strcasecmp(item, "wal") == 0) + *flags |= IO_DIRECT_WAL; + else if (pg_strcasecmp(item, "wal_init") == 0) + *flags |= IO_DIRECT_WAL_INIT; + else + { + GUC_check_errdetail("invalid option \"%s\"", item); + return false; + } + } + + *extra = flags; + + return true; +#endif +} + +extern void +assign_io_direct(const char *newval, void *extra) +{ + int *flags = (int *) extra; + + io_direct_flags = *flags; +} + +extern const char * +show_io_direct(void) +{ + static char result[80]; + + result[0] = 0; + if (io_direct_flags & IO_DIRECT_DATA) + strcat(result, "data"); + if (io_direct_flags & IO_DIRECT_WAL) + { + if (result[0]) + strcat(result, ", "); + strcat(result, "wal"); + } + if (io_direct_flags & IO_DIRECT_WAL_INIT) + { + if (result[0]) + strcat(result, ", "); + strcat(result, "wal_init"); + } + + return result; +} diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index 3e034afdf1..38263f3d0f 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -142,6 +142,16 @@ static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forknum, static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg); +static inline int +_mdfd_open_flags(ForkNumber forkNum) +{ + int flags = O_RDWR | PG_BINARY; + + if (io_direct_flags & IO_DIRECT_DATA) + flags |= PG_O_DIRECT; + + return flags; +} /* * mdinit() -- Initialize private state for magnetic disk storage manager. @@ -205,14 +215,14 @@ mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo) path = relpath(reln->smgr_rlocator, forknum); - fd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_EXCL | PG_BINARY); + fd = PathNameOpenFile(path, _mdfd_open_flags(forknum) | O_CREAT | O_EXCL); if (fd < 0) { int save_errno = errno; if (isRedo) - fd = PathNameOpenFile(path, O_RDWR | PG_BINARY); + fd = PathNameOpenFile(path, _mdfd_open_flags(forknum)); if (fd < 0) { /* be sure to report the error reported by create, not open */ @@ -527,7 +537,7 @@ mdopenfork(SMgrRelation reln, ForkNumber forknum, int behavior) path = relpath(reln->smgr_rlocator, forknum); - fd = PathNameOpenFile(path, O_RDWR | PG_BINARY); + fd = PathNameOpenFile(path, _mdfd_open_flags(forknum)); if (fd < 0) { @@ -598,6 +608,8 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum) off_t seekpos; MdfdVec *v; + Assert((io_direct_flags & IO_DIRECT_DATA) == 0); + v = _mdfd_getseg(reln, forknum, blocknum, false, InRecovery ? EXTENSION_RETURN_NULL : EXTENSION_FAIL); if (v == NULL) @@ -623,6 +635,8 @@ void mdwriteback(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, BlockNumber nblocks) { + Assert((io_direct_flags & IO_DIRECT_DATA) == 0); + /* * Issue flush requests in as few requests as possible; have to split at * segment boundaries though, since those are actually separate files. @@ -1200,7 +1214,7 @@ _mdfd_openseg(SMgrRelation reln, ForkNumber forknum, BlockNumber segno, fullpath = _mdfd_segpath(reln, forknum, segno); /* open the file */ - fd = PathNameOpenFile(fullpath, O_RDWR | PG_BINARY | oflags); + fd = PathNameOpenFile(fullpath, _mdfd_open_flags(forknum) | oflags); pfree(fullpath); @@ -1410,7 +1424,7 @@ mdsyncfiletag(const FileTag *ftag, char *path) strlcpy(path, p, MAXPGPATH); pfree(p); - file = PathNameOpenFile(path, O_RDWR | PG_BINARY); + file = PathNameOpenFile(path, _mdfd_open_flags(ftag->forknum)); if (file < 0) return -1; need_to_close = true; diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c index c1a5febcbf..4892920812 100644 --- a/src/backend/storage/smgr/smgr.c +++ b/src/backend/storage/smgr/smgr.c @@ -20,6 +20,7 @@ #include "access/xlogutils.h" #include "lib/ilist.h" #include "storage/bufmgr.h" +#include "storage/fd.h" #include "storage/ipc.h" #include "storage/md.h" #include "storage/smgr.h" diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c index 1bf14eec66..1de30ebbf1 100644 --- a/src/backend/utils/misc/guc_tables.c +++ b/src/backend/utils/misc/guc_tables.c @@ -543,6 +543,7 @@ static char *locale_ctype; static char *server_encoding_string; static char *server_version_string; static int server_version_num; +static char *io_direct_string; #ifdef HAVE_SYSLOG #define DEFAULT_SYSLOG_FACILITY LOG_LOCAL0 @@ -4468,6 +4469,17 @@ struct config_string ConfigureNamesString[] = check_backtrace_functions, assign_backtrace_functions, NULL }, + { + {"io_direct", PGC_POSTMASTER, DEVELOPER_OPTIONS, + gettext_noop("Use direct I/O for file access."), + NULL, + GUC_NOT_IN_SAMPLE + }, + &io_direct_string, + "", + check_io_direct, assign_io_direct, show_io_direct + }, + /* End-of-list marker */ { {NULL, 0, 0, NULL, NULL}, NULL, NULL, NULL, NULL, NULL diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h index 85ef12c440..0d65cb3c80 100644 --- a/src/include/storage/fd.h +++ b/src/include/storage/fd.h @@ -44,6 +44,8 @@ #define FD_H #include <dirent.h> +#include <fcntl.h> + typedef enum RecoveryInitSyncMethod { @@ -54,10 +56,16 @@ typedef enum RecoveryInitSyncMethod typedef int File; +#define IO_DIRECT_DATA 0x01 +#define IO_DIRECT_WAL 0x02 +#define IO_DIRECT_WAL_INIT 0x04 + + /* GUC parameter */ extern PGDLLIMPORT int max_files_per_process; extern PGDLLIMPORT bool data_sync_retry; extern PGDLLIMPORT int recovery_init_sync_method; +extern PGDLLIMPORT int io_direct_flags; /* * This is private to fd.c, but exported for save/restore_backend_variables() diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h index a07715356b..ea7b3ff8dd 100644 --- a/src/include/storage/smgr.h +++ b/src/include/storage/smgr.h @@ -17,6 +17,7 @@ #include "lib/ilist.h" #include "storage/block.h" #include "storage/relfilelocator.h" +#include "utils/guc.h" /* * smgr.c maintains a table of SMgrRelation objects, which are essentially diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h index f1a9a183b4..61a7fd77b8 100644 --- a/src/include/utils/guc_hooks.h +++ b/src/include/utils/guc_hooks.h @@ -154,5 +154,8 @@ extern bool check_wal_consistency_checking(char **newval, void **extra, GucSource source); extern void assign_wal_consistency_checking(const char *newval, void *extra); extern void assign_xlog_sync_method(int new_sync_method, void *extra); +extern bool check_io_direct(char **newval, void **extra, GucSource source); +extern void assign_io_direct(const char *newval, void *extra); +extern const char *show_io_direct(void); #endif /* GUC_HOOKS_H */ diff --git a/src/test/modules/test_misc/meson.build b/src/test/modules/test_misc/meson.build index cfc830ff39..97162d2b8f 100644 --- a/src/test/modules/test_misc/meson.build +++ b/src/test/modules/test_misc/meson.build @@ -7,6 +7,7 @@ tests += { 't/001_constraint_validation.pl', 't/002_tablespace.pl', 't/003_check_guc.pl', + 't/004_io_direct.pl', ], }, } diff --git a/src/test/modules/test_misc/t/004_io_direct.pl b/src/test/modules/test_misc/t/004_io_direct.pl new file mode 100644 index 0000000000..9a79fc8f9d --- /dev/null +++ b/src/test/modules/test_misc/t/004_io_direct.pl @@ -0,0 +1,40 @@ +# Very simple exercise of direct I/O GUC. + +use strict; +use warnings; +use PostgreSQL::Test::Cluster; +use PostgreSQL::Test::Utils; +use Test::More; + +# Systems that we know to have direct I/O support, and whose typical local +# filesystems support it or at least won't fail with an error. (illumos should +# probably be in this list, but perl reports it as solaris. Solaris should not +# be in the list because we don't support its way of turning on direct I/O, and +# even if we did, its version of ZFS rejects it) and OpenBSD just doesn't have +# it.) +if (!grep { $^O eq $_ } qw(aix darwin dragonfly freebsd linux MSWin32 netbsd)) +{ + plan skip_all => "no direct I/O support"; +} + +my $node = PostgreSQL::Test::Cluster->new('main'); +$node->init; +$node->append_conf('io_direct', 'data,wal,wal_init'); +$node->append_conf('shared_buffers', '64kB'); # tiny to force I/O +$node->start; + +# Do some work that is bound to generate shared and local writes and reads as a +# simple exercise. +$node->safe_psql('postgres', 'create table t1 as select 1 as i from generate_series(1, 10000)'); +$node->safe_psql('postgres', 'create table t2count (i int)'); +$node->safe_psql('postgres', 'begin; create temporary table t2 as select 1 as i from generate_series(1, 10000); update t2 set i = i; insert into t2count select count(*) from t2; commit;'); +$node->safe_psql('postgres', 'update t1 set i = i'); +is('10000', $node->safe_psql('postgres', 'select count(*) from t1'), "read back from shared"); +is('10000', $node->safe_psql('postgres', 'select * from t2count'), "read back from local"); +$node->stop('immediate'); + +$node->start; +is('10000', $node->safe_psql('postgres', 'select count(*) from t1'), "read back from shared after crash recovery"); +$node->stop; + +done_testing(); -- 2.35.1
From 75dec1b3ffa91ca1279267092187191ca99fb713 Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.mu...@gmail.com> Date: Tue, 13 Dec 2022 16:55:09 +1300 Subject: [PATCH v2 4/4] XXX turn on direct I/O by default, just for CI --- src/backend/utils/misc/guc_tables.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c index 1de30ebbf1..0fc2185568 100644 --- a/src/backend/utils/misc/guc_tables.c +++ b/src/backend/utils/misc/guc_tables.c @@ -4476,7 +4476,7 @@ struct config_string ConfigureNamesString[] = GUC_NOT_IN_SAMPLE }, &io_direct_string, - "", + "data,wal,wal_init", check_io_direct, assign_io_direct, show_io_direct }, -- 2.35.1