Hi, On 2019-05-14 12:50:09 +0900, Michael Paquier wrote: > On Mon, May 13, 2019 at 10:52:21AM -0700, Andres Freund wrote: > > I was wrong here - I thought we WAL logged the main fork creation even > > for unlogged tables. I think it's foolish that we don't, but we don't. > > Why? The main fork is not actually necessary, and the beginning of > recovery would make sure that any existing main fork gets removed and > that the main fork gets recreated by copying it from the init fork, > which is WAL-logged.
Several things: 1) A main fork that's created while a base-backup is in progress might only partially exist on a HS node. We won't necessarily remove it, because the init fork might *not* have been copied. The init fork will be re-created by WAL replay, but that'll be too late for the ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP) call. Which then will a) cause ResetUnloggedRelations(UNLOGGED_RELATION_INIT) to fail, because the target file already exists, as mentioned elsewhere in this thread b) cause trouble when access the unlogged table - unfortunately there are several ways to access unlogged tables on a HS node - so the main fork better either not exist, or be in a valid empty state. Case in point: COPY blarg TO STDOUT; does not throw an error due to unlogged-ness. If we instead properly WAL logged the MAIN fork creation b) wouldn't be a problem. And we could have the UNLOGGED_RELATION_INIT case re-use the hashtable built at UNLOGGED_RELATION_CLEANUP time, and avoid a second scan of the filesystem. As all relations that were created after the LSN crash recovery started at would be valid, we would not need to do a full second scan: All unlogged relations created after that LSN are guaranteed to have a consistent main fork. 2) With a small bit of additional work, the sketch to fix 1) would allow us to allow read access to unlogged relations during HS. We'd just have do the UNLOGGED_RELATION_INIT work at the start of recovery. 3) I'm somewhat certain that there are relfilenode reuse issues with the current approach, again during base backups (as there the checkpoint logic to prevent relfilenode reuse doesn't work). I think you can easily end up with main / vm / fsm / init forks that actually don't belong to the same relation for a base backup. If you don't force delete all other forks at the time the init fork record is replayed, there's no defense against that. There's probably more, but I gotta cook. > So we definitely have to log the init fork, but logging the main fork > is just useless data in the WAL record. Or you have something else in > mind? Well, as I argued somewhere else in this thread, I think this all should just be one WAL record. Something very roughly like: struct xl_smgr_create { RelFileNode rnode; ForkNumber forkNum; bool remove_all_other_forks; bool copy_init_to_main_fork; size_t data_length; char fork_contents[]; }; Presumably with a flags argument instead of separate booleans. Greetings, Andres Freund