Sketch of a fix for that truncation data corruption issue

Tom Lane Mon, 10 Dec 2018 12:39:23 -0800

We got another report today [1] that seems to be due to the problem
we've seen before with failed vacuum truncations leaving corrupt state
on-disk [2].  Reflecting on that some more, it seems to me that we're
never going to get to a solution that everybody finds acceptable without
some rather significant restructuring at the buffer-access level.
Since looking for a back-patchable solution has yielded no progress in
eight years, what if we just accept that we will only fix this in HEAD,
and think outside the box about how we could fix it if we're willing
to change internal APIs as much as necessary?


After doodling for awhile with that in mind, it seems like we might be
able to fix it by introducing a new buffer state "truncation pending" that
we'd apply to empty-but-dirty buffers that we don't want to write out.
Aside from fixing the data corruption issue, this sketch has the
significant benefit that we don't need to take AccessExclusiveLock
anymore to do a vacuum truncation: it seems sufficient to lock out
would-be writers of the table.  VACUUM's truncation logic would go
like this:

1. Take ShareLock on table to lock out writers but not readers.
(We might need ShareRowExclusive, not sure.)  As with existing
code that takes AccessExclusiveLock, this is a lock upgrade,
so it could fail but that's OK, we just don't truncate.

2. Scan backwards from relation EOF to determine last page to be deleted
(same as existing logic).

3. Truncate FSM and VM.  (This can be done outside critical section
because it's harmless if we fail later; the lost state can be
reconstructed, and anyway we know all the forgotten-about pages are
empty.)

4. Issue WAL truncation record, and make sure it's flushed.

5. Begin critical section, so that any hard ERROR below becomes PANIC.
(We don't really expect any error, but it's not OK for the vacuum
process to disappear without having undone any truncation-pending marks.)

6. Re-scan buffers from first to last page to be deleted, using a fetch
mode that doesn't pull in pages not already present in buffer pool.
(That way, we don't issue any reads during this phase, reducing the risk
of unwanted ERROR/PANIC.)  As we examine each buffer:
* If not empty of live tuples, panic :-(
* If clean, delete buffer.
* If dirty, mark as truncation pending.
Remember the first and last page numbers that got marked as trunc pending.

7. Issue ftruncate(s), working backwards if the truncation spans multiple
segment files.  Don't error out on syscall failure, just stop truncating
and note boundary of successful truncation.

8. Re-scan buffers from first to last trunc pending page, again skipping
anything not found in buffer pool.  Anything found should be trunc
pending, or possibly empty if a reader pulled it in concurrently.
Either delete if it's above what we successfully truncated to, or drop
the trunc-pending bit (reverting it to dirty state) if not.

9. If actual truncation boundary was different from plan, issue another
WAL record saying "oh, we only managed to truncate to here, not there".

10. End critical section.

11. Release table ShareLock.


Now, what does the "truncation pending" buffer flag do exactly?

* The buffer manager is not allowed to drop such a page from the pool,
nor to attempt to write it out; it's just in limbo till vacuum's
critical section completes.

* Readers: assume page is empty, move on.  (A seqscan could actually
stop, knowing that all later pages must be empty too.)  Or, since
the page must be empty of live tuples, readers could just process it
normally, but I'd rather they didn't.

* Writers: error, should not be able to see this state.

* Background writer: ignore and move on (must *not* try to write
dirty page, since truncate might've already happened).

* Checkpointer: I think checkpointer has to wait for the flag to go
away :-(.  We can't mark a checkpoint complete if there are dirty
trunc-pending pages.  This aspect could use more thought, perhaps.

WAL replay looks like this:

* Normal truncate record: just truncate heap, FSM, and VM to where
it says, and discard buffers above that.

* "Only managed to truncate to here" record: write out empty heap
pages to fill the space from original truncation target to actual.
This restores the on-disk situation to be equivalent to what it
was in master, assuming all the dirty pages eventually got written.

It's slightly annoying to write the second truncation record inside
the critical section, but I think we need to because if we don't,
there's a window where we don't write the second record at all and
so the post-replay situation is different from what the master had
on disk.  Note that if we do crash in the critical section, the
post-replay situation will be that we truncate to the original target,
which seems fine since nothing else could've affected the table state.
(Maybe we should issue the first WAL record inside the critical
section too? Not sure.)

One issue with not holding AEL is that there are race conditions
wherein readers might attempt to fetch pages beyond the file EOF
(for example, a seqscan that started before the truncation began
would attempt to read up to the old EOF, or a very slow indexscan
might try to follow a pointer from a since-deleted index entry).
So we would have to change things to regard that as a non-error
condition.  That might be fine, but it does seem like it's giving up
some error detection capability.  If anyone's sufficiently worried
about that, we could keep the lock level at AEL; but personally
I think reducing the lock level is worth enough to be willing to make
that compromise.

Another area that's possibly worthy of concern is whether a reader
could attempt to re-set bits in the FSM or VM for to-be-deleted pages
after the truncation of those files.  We might need some interlock to
prevent that.  Or, perhaps, just re-truncate them after the main
truncation?  Or maybe it doesn't matter if there's bogus data in
the maps.

Also, I'm not entirely sure whether there's anything in our various
replication logic that's dependent on vacuum truncation taking AEL.
Offhand I'd expect the reduced use of AEL to be a plus, but maybe
I'm missing something.

Thoughts?  Are there obvious holes in this plan?

                        regards, tom lane

[1] https://www.postgresql.org/message-id/28278.1544463...@sss.pgh.pa.us
[2] 
https://www.postgresql.org/message-id/flat/5BBC590AE8DF4ED1A170E4D48F1B53AC@tunaPC

Sketch of a fix for that truncation data corruption issue

Reply via email to