On Tue, May 14, 2019 at 2:19 AM Andres Freund <and...@anarazel.de> wrote: > How would this protect against the issues I mentioned where recovery > starts from an earlier checkpoint and the basebackup could pick up a > random set of version of the different forks? > > I don't like the idea of expanding the use of delayChkpt further for > common operations, if anything we ought to try to reduce it. But I also > don't see how it'd actually fix the issues, so perhaps that's moot.
Based on the discussion so far, I think there are a number of related problems here: 1. A base backup might copy the main fork but not the init fork. I think that's a problem that nobody's thought about before Andres raised it on this thread. I may be wrong. Anyway, if that happens, the start-of-recovery code isn't going to do the right thing, because it won't know that it's dealing with an unlogged relation. 2. Suppose the system reaches the end of heapam_relation_set_new_filenode and then the system crashes. Because of the smgrimmedsync(), and only because of the smgrimmedsync(), the init fork would exist at the start of recovery. However, unlike the heap, some of the index AMs don't actually have a call to smgrimmedsync() in their *buildempty() functions. If I'm not mistaken, the ones that write pages through shared_buffers do not do smgrimmedsync, and the ones that use private buffers do perform smgrimmedsync. Therefore, the ones that write pages through shared_buffers are vulnerable to a crash right after the unlogged table is created. The init fork could fail to survive the crash, and therefore the start-of-recovery code would again fail to know that it's dealign with an unlogged relation. 3. So, why is it like that, anyway? Why do index AMs that write pages via shared_buffers not have smgrimmedsync? The answer is that we did it like that because we were worrying about a different problem - specifically, checkpointing. If I dirty a buffer and write a WAL record for the changes that I made, it is guaranteed that the dirty data will make it to disk before the next checkpoint is written; we've got all sorts of interlocking in BufferSync, SyncOneBuffer, etc. to make sure that happens. But if I create a file in the filesystem and write a WAL record for that change, none of that interlocking does anything. So unless I not only create that file but smgrimmedsync() it before the next checkpoint's redo location is fixed, a crash could make the file disappear. For example, consider RelationCreateStorage. What prevents the following sequence of events? S1: sgmropen() S1: smgrcreate() S1: log_smgrcreate() -- at this point the checkpointer fixes the redo pointer, writes every dirty buffer in the system, and completes a checkpoint S1: commits -- crash On restart, I think we'll could potentially be missing the file created by smgrcreate(), and we won't replay the WAL record generated by log_smgrcreate() either because it's before the checkpoint. I believe that it is one aspect of this third problem that we previously fixed on this thread, but I think we failed to understand how general the issue actually is. It affects _init forks of unlogged tables, but I think it also affects essentially every other use of smgrcreate(), whether it's got anything to do with unlogged tables or not. For an index AM, which has a non-empty initial representation, the consequences are pretty limited, because the transaction can't commit having created only an empty fork. It's got to put some data into either the main fork or the init fork, as the case may be. If it aborts, then we may leave some orphaned files behind, but if we lose one, we just have fewer orphaned files, so nobody cares. And if it commits, then either everything's before the checkpoint (and the file will have been fsync'd because we fsync'd the buffers), or everything's after the checkpoint (so we will replay the WAL), or only the smgrcreate() is before the checkpoint and the rest is after (in which case XLogReadBufferExtended will do the smgrcreate over again for us and we'll be fine). But since the heap uses an empty initial representation, it enjoys no similar protection. So, IIUC, the reason why we were talking about delayCkpt before is because it would allow us to remove the smgrimmedsync() of the unlogged fork while still avoiding problem #3. However it does nothing to protect against problem #1 or #2. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company