data corruption hazard in reorderbuffer.c

Mark Dilger Thu, 15 Jul 2021 13:03:25 -0700

Hackers,

I believe there is a hazard in reorderbuffer.c if a call to write() buffers 
data rather than flushing it to disk, only to fail when flushing the data 
during close().  The relevant code is in ReorderBufferSerializeTXN(), which 
iterates over changes for a transaction, opening transient files as needed for 
the changes to be written, writing the transaction changes to the transient 
files using ReorderBufferSerializeChange(), and closing the files when finished 
using CloseTransientFile(), the return code from which is ignored.  If 
ReorderBufferSerializeChange() were fsync()ing the writes, then ignoring the 
failures on close() would likely be ok, or at least in line with what we do 
elsewhere.  But as far as I can see, no call to sync the file is performed.


I expect a failure here could result in a partially written change in the 
transient file, perhaps preceded or followed by additional complete or partial 
changes.  Perhaps even worse, a failure could result in a change not being 
written at all (rather than partially), potentially with preceding and 
following changes written intact, with no indication that one change is missing.

Upon testing, both of these expectations appear to be true.  Skipping some 
writes while not others easily creates a variety of failures, and for brevity I 
won't post a patch to demonstrate that here.  The following code change causes 
whole rather than partial changes to be written or skipped.  After applying 
this change and running the tests in contrib/test_decoding/, "test toast" shows 
that an UPDATE command has been silently skipped.

diff --git a/src/backend/replication/logical/reorderbuffer.c 
b/src/backend/replication/logical/reorderbuffer.c
index 7378beb684..a6c60b81c9 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -108,6 +108,7 @@
 #include "utils/rel.h"
 #include "utils/relfilenodemap.h"
 
+static bool lose_writes_until_close = false;
 
 /* entry for a hash table we use to map from xid to our transaction state */
 typedef struct ReorderBufferTXNByIdEnt
@@ -3523,6 +3524,8 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, 
ReorderBufferTXN *txn)
        Size            spilled = 0;
        Size            size = txn->size;
 
+       lose_writes_until_close = false;
+
        elog(DEBUG2, "spill %u changes in XID %u to disk",
                 (uint32) txn->nentries_mem, txn->xid);
 
@@ -3552,7 +3555,10 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, 
ReorderBufferTXN *txn)
                        char            path[MAXPGPATH];
 
                        if (fd != -1)
+                       {
                                CloseTransientFile(fd);
+                               lose_writes_until_close = 
!lose_writes_until_close;
+                       }
 
                        XLByteToSeg(change->lsn, curOpenSegNo, 
wal_segment_size);
 
@@ -3600,6 +3606,8 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, 
ReorderBufferTXN *txn)
 
        if (fd != -1)
                CloseTransientFile(fd);
+
+       lose_writes_until_close = false;
 }
 
 /*
@@ -3790,7 +3798,10 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, 
ReorderBufferTXN *txn,
 
        errno = 0;
        pgstat_report_wait_start(WAIT_EVENT_REORDER_BUFFER_WRITE);
-       if (write(fd, rb->outbuf, ondisk->size) != ondisk->size)
+
+       if (lose_writes_until_close)
+               ;       /* silently do nothing with buffer contents */
+       else if (write(fd, rb->outbuf, ondisk->size) != ondisk->size)
        {
                int                     save_errno = errno;


The attached patch catches the failures on close(), but to really fix this 
properly, we should call pg_fsync() before close().

Any thoughts on this?  It seems to add a fair amount of filesystem burden to 
add all the extra fsync activity, though I admit to having not benchmarked that 
yet.  Perhaps doing something like Thomas's work for commit dee663f7843 where 
closing files is delayed so that fewer syncs are needed?  I'm not sure how much 
that would help here, and would like feedback before pursuing anything of that 
sort.

The relevant code exists back as far as the 9_4_STABLE branch, where commit 
b89e151054a from 2014 first appears.

v1-0001-No-longer-ignoring-failures-on-file-close.patch
Description: Binary data


—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

data corruption hazard in reorderbuffer.c

Reply via email to