On Fri, Jan 28, 2022 at 11:39 AM Heikki Linnakangas <hlinn...@iki.fi> wrote: > Hmm, if a relation is dropped, we use plain unlink() to delete it (at > the next checkpoint). Should we use durable_unlink() there, or otherwise > arrange to fsync() the parent directory?
Hmmmmm. I think the latter might be a good idea, but not because of the file we unlink after checkpoint. Rationale: On commit, we truncate all segments, and unlink segments > 0. After checkpoint, we unlink segment 0 (the tombstone preventing relfilenode recycling). So, it might close some storage leak windows if we did: 1. register_dirty_segment() on truncated segment 0, so that at checkpoint it is fsynced. That means that if we lose power between the checkpoint and the unlink(), at least its size is durably zero and not 1GB. I don't know how to completely avoid leaking empty tombstone files if you lose power in that window. durable_unlink() may narrow the window but it's still on the wrong side of a checkpoint and won't be replayed on crash recovery. I hope we can get rid of tombstones completely as Dilip is attempting in [1]. 2. fsync() the containing directory as part of checkpointing, so the unlinks of non-tombstone segments are made durable at checkpoint. Otherwise, after checkpoint, you might lose power, come back up and find the segments are still present in the directory, and worse, not truncated. AFAICS we have no defences against zombie segments > 0 if the tombstone is gone, allowing recycling. Zombie segments could appear to be concatenated to the next user of the relfilenode. That problem goes away with Dilip's project. [1] https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BfTEcYQvc18aEbrJjPri99A09JZcEXzXjT5h57S%2BAgkw%40mail.gmail.com