Re: [HACKERS] Unlogged tables cleanup

Andres Freund Mon, 13 May 2019 21:28:44 -0700

Hi,

On 2019-05-14 12:50:09 +0900, Michael Paquier wrote:
> On Mon, May 13, 2019 at 10:52:21AM -0700, Andres Freund wrote:
> > I was wrong here - I thought we WAL logged the main fork creation even
> > for unlogged tables. I think it's foolish that we don't, but we don't.
>
> Why?  The main fork is not actually necessary, and the beginning of
> recovery would make sure that any existing main fork gets removed and
> that the main fork gets recreated by copying it from the init fork,
> which is WAL-logged.


Several things:

1) A main fork that's created while a base-backup is in progress might
   only partially exist on a HS node. We won't necessarily remove it,
   because the init fork might *not* have been copied. The init fork
   will be re-created by WAL replay, but that'll be too late for the
   ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP) call.

   Which then will

   a) cause ResetUnloggedRelations(UNLOGGED_RELATION_INIT) to fail, because the
      target file already exists, as mentioned elsewhere in this thread

   b) cause trouble when access the unlogged table - unfortunately there
      are several ways to access unlogged tables on a HS node - so the
      main fork better either not exist, or be in a valid empty
      state. Case in point:
      COPY blarg TO STDOUT;
      does not throw an error due to unlogged-ness.

   If we instead properly WAL logged the MAIN fork creation b) wouldn't
   be a problem.

   And we could have the UNLOGGED_RELATION_INIT case re-use the
   hashtable built at UNLOGGED_RELATION_CLEANUP time, and avoid a second
   scan of the filesystem. As all relations that were created after the
   LSN crash recovery started at would be valid, we would not need to do
   a full second scan: All unlogged relations created after that LSN are
   guaranteed to have a consistent main fork.

2) With a small bit of additional work, the sketch to fix 1) would allow
   us to allow read access to unlogged relations during HS. We'd just
   have do the UNLOGGED_RELATION_INIT work at the start of recovery.

3) I'm somewhat certain that there are relfilenode reuse issues with the
   current approach, again during base backups (as there the checkpoint
   logic to prevent relfilenode reuse doesn't work). I think you can
   easily end up with main / vm / fsm / init forks that actually don't
   belong to the same relation for a base backup. If you don't force
   delete all other forks at the time the init fork record is replayed,
   there's no defense against that.

There's probably more, but I gotta cook.


> So we definitely have to log the init fork, but logging the main fork
> is just useless data in the WAL record.  Or you have something else in
> mind?

Well, as I argued somewhere else in this thread, I think this all should
just be one WAL record.

Something very roughly like:

struct xl_smgr_create
{
        RelFileNode     rnode;
        ForkNumber      forkNum;
        bool            remove_all_other_forks;
        bool            copy_init_to_main_fork;
        size_t          data_length;
        char            fork_contents[];
};

Presumably with a flags argument instead of separate booleans.


Greetings,

Andres Freund

Re: [HACKERS] Unlogged tables cleanup

Reply via email to