On 2015-11-27 16:59:20 +0900, Michael Paquier wrote: > Attached is a patch that fixes the issue for me in master and 9.5. > Actually in the last patch I forgot a call to smgrwrite to ensure that > the INIT_FORKNUM is correctly synced to disk when those pages are > replayed at recovery, letting the reset routines for unlogged > relations do their job correctly. I have noticed as well that we need > to do the same for gin and brin relations. In this case I think that > we could limit the flush to unlogged relations, my patch does it > unconditionally though to generalize the logic. Thoughts?
I think it's a good idea to limit this to unlogged relations. For a dataset that fits into shared_buffers and creates many relations, the additional disk writes could be noticeable. > diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c > index 99337b0..d7964ac 100644 > --- a/src/backend/access/brin/brin.c > +++ b/src/backend/access/brin/brin.c > @@ -682,7 +682,12 @@ brinbuildempty(PG_FUNCTION_ARGS) > brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index), > BRIN_CURRENT_VERSION); > MarkBufferDirty(metabuf); > - log_newpage_buffer(metabuf, false); > + /* > + * When this full page image is replayed, there is no guarantee that > + * this page will be present to disk when replayed particularly for > + * unlogged relations, hence enforce it to be flushed to disk. > + */ The grammar seems a bit off here. > + /* > + * Initialize and xlog metabuffer and root buffer. When those full > + * pages are replayed, it is not guaranteed that those relation > + * init forks will be flushed to disk after replaying them, hence > + * enforce those pages to be flushed to disk at replay, only the > + * last record will enforce a flush for performance reasons and > + * because it is actually unnecessary to do it multiple times. > + */ That comment needs some love too. I think it really only makes sense if you know the current state. There's also some language polishing needed. > @@ -450,14 +450,21 @@ ginbuildempty(PG_FUNCTION_ARGS) > START_CRIT_SECTION(); > GinInitMetabuffer(MetaBuffer); > MarkBufferDirty(MetaBuffer); > - log_newpage_buffer(MetaBuffer, false); > + log_newpage_buffer(MetaBuffer, false, false); > GinInitBuffer(RootBuffer, GIN_LEAF); > MarkBufferDirty(RootBuffer); > - log_newpage_buffer(RootBuffer, false); > + log_newpage_buffer(RootBuffer, false, true); > END_CRIT_SECTION(); > Why isn't the metapage flushed to disk? > --- a/src/backend/access/spgist/spginsert.c > +++ b/src/backend/access/spgist/spginsert.c > @@ -173,7 +173,7 @@ spgbuildempty(PG_FUNCTION_ARGS) > (char *) page, true); > if (XLogIsNeeded()) > log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM, > - SPGIST_METAPAGE_BLKNO, page, false); > + SPGIST_METAPAGE_BLKNO, page, false, > false); > > /* Likewise for the root page. */ > SpGistInitPage(page, SPGIST_LEAF); > @@ -183,7 +183,7 @@ spgbuildempty(PG_FUNCTION_ARGS) > (char *) page, true); > if (XLogIsNeeded()) > log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM, > - SPGIST_ROOT_BLKNO, page, true); > + SPGIST_ROOT_BLKNO, page, true, false); > I don't think it's correct to not flush in these cases. Yes, the primary does an smgrimmedsync() - but that's not done on the standby. > @@ -9392,9 +9396,28 @@ xlog_redo(XLogReaderState *record) > * when checksums are enabled. There is no difference in > handling > * XLOG_FPI and XLOG_FPI_FOR_HINT records, they use a different > info > * code just to distinguish them for statistics purposes. > + * > + * XLOG_FPI_FOR_SYNC records are generated when a relation > needs to > + * be sync'ed just after replaying a full page. This is > important > + * to match the master behavior in the case where a page is > written > + * directly to disk without going through shared buffers, > particularly > + * when writing init forks for index relations. > */ How about FPI_FOR_SYNC records indicate that the page immediately needs to be written to disk, not just to shared buffers. This is important if the on-disk state is to be the authoritative, not the state in shared buffers. E.g. because on-disk files may later be copied directly. > diff --git a/src/backend/commands/tablecmds.c > b/src/backend/commands/tablecmds.c > index 0ddde72..eb22a51 100644 > --- a/src/backend/commands/tablecmds.c > +++ b/src/backend/commands/tablecmds.c > @@ -9920,7 +9920,8 @@ copy_relation_data(SMgrRelation src, SMgrRelation dst, > * space. > */ > if (use_wal) > - log_newpage(&dst->smgr_rnode.node, forkNum, blkno, > page, false); > + log_newpage(&dst->smgr_rnode.node, forkNum, blkno, page, > + false, false); Shouldn't this flush data if it's an init fork? Currently that's an academic distinction, given there'll never be a page, but it still seems prudent. > extern void InitXLogInsert(void); > diff --git a/src/include/catalog/pg_control.h > b/src/include/catalog/pg_control.h > index ad1eb4b..91445f1 100644 > --- a/src/include/catalog/pg_control.h > +++ b/src/include/catalog/pg_control.h > @@ -73,6 +73,7 @@ typedef struct CheckPoint > #define XLOG_END_OF_RECOVERY 0x90 > #define XLOG_FPI_FOR_HINT 0xA0 > #define XLOG_FPI 0xB0 > +#define XLOG_FPI_FOR_SYNC 0xC0 I'm not a big fan of the XLOG_FPI_FOR_SYNC name. Syncing is a bit too ambigous for my taste. How about either naming it XLOG_FPI_FLUSH or instead adding actual record data and a 'flags' field in there? I slightly prefer the latter - XLOG_FPI and XLOG_FPI_FOR_HINT really are different, XLOG_FPI_FOR_SYNC not so much. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers