On 7/16/21 1:07 AM, Mark Dilger wrote:


On Jul 15, 2021, at 3:32 PM, Tomas Vondra
<tomas.von...@enterprisedb.com> wrote:

I think it's mostly futile to list all the possible issues this
might have caused - if you skip arbitrary decoded changes, that can
trigger pretty much any bug in reorder buffer. But those bugs can
be triggered by various other issues, of course.

I thought that at first, too, but when I searched for bug reports
with the given error messages, they all had to do with logical
replication or logical decoding.  That seems a bit fishy to me.  If
these bugs were coming from all over the system, why would that be
so?


No, I'm not suggesting those bugs are coming from all over the system. The theory is that there's a bug in logical decoding / reorder buffer, with the same symptoms. So it'd affect systems with logical replication.

It's hard to say what was the cause, but the "logic" bugs are
probably permanent, while the issues triggered by I/O probably
disappear after a restart?

If you mean "logic" bugs like passing an invalid file descriptor to
close(), then yes, those would be permanent, but I don't believe we
have any bugs of that kind.


No, by logic bugs I mean cases like failing to update the snapshot, losing track of relfilenode changes, etc. due to failing to consider some cases etc. As opposed to "I/O errors" where this is caused by external events.

The consequences of I/O errors are not going to go away after
restart.  On the subscriber side, if logical replication has replayed
less than a full transaction worth of changes without raising an
error, the corruption will be permanent.


True, good point. I was thinking about the "could not map relfilenode" error, which forces the decoding to restart, and on the retry it's unlikely to hit the same I/O error, so it'll succeed. But you're right that if it reaches the subscriber and gets committed (e.g. missing a row), it's permanent.

That being said, I agree this seems like an issue and we should not
ignore I/O errors.

I agree.

I'd bet other places using transient files (like sorting or hashagg
spilling to disk) has the same issue, although in that case the
impact is likely limited to a single query.

I'm not so convinced.  In most places, the result is ignored for
close()-type operations only when the prior operation failed and
we're closing the handle in preparation for raising an error.  There
are a small number of other places, such as
pg_import_system_collations and perform_base_backup, which ignore the
result from closing a handle, but those are reading from the handle,
not writing to it, so the situation is not comparable.

I believe the oversight in reorderbuffer.c really is a special case.


Hmm, you might be right. I have only briefly looked at buffile.c and tuplestore.c yesterday, and I wasn't sure about the error checking. But maybe that works fine, now that I look at it, because we don't really use the files after close().

I wonder if sync before the close is an appropriate solution,
though. It seems rather expensive, and those files are meant to be
"temporary" (i.e. we don't keep them over restart). So maybe we
could ensure the consistency is a cheaper way - perhaps tracking
some sort of checksum for each file, or something like that?

I'm open to suggestions of what to do, but thinking of these files as
temporary may be what misleads developers to believe they don't have
to be treated carefully.  The code was first committed in 2014 and as
far as I am aware nobody else has complained about this before.  In
some part that might be because CloseTemporaryFile() is less familiar
than close() to most developers, and they may be assuming that it
contains its own error handling and just don't realize that it
returns an error code just like regular close() does.


I don't think anyone thinks corruption of temporary files would not be an issue, but making them fully persistent (which is what fsync would do) just to eliminate a piece is not lost seems like an overkill to me. And the file may get corrupted even with fsync(), so I'm wondering if there's a way to ensure integrity without the fsync (so cheaper).

The scheme I was thinking about is a simple checksum for each BufFile. When writing a buffer to disk, update the crc32c checksum, and then check it after reading the file (and error-out if it disagrees).

The point of the reorder buffer is to sort the changes from a single
transaction so that they can be replayed in order.  The files used
for the sorting are temporary, but there is nothing temporary about
the failure to include all the changes in the files, as that will
have permanent consequences for whoever replays them.  If lucky, the
attempt to replay the changes will abort because they don't make
sense, but I've demonstrated to my own satisfaction that nothing
prevents silent data loss if the failure to write changes happens to
destroy a complete rather than partial change.


Sure, I agree with all of that.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply via email to