Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Anastasia Lubennikova Wed, 11 Sep 2019 07:27:49 -0700

10.09.2019 14:42, Asim R P wrote:

Hi Anastasia
On Thu, Aug 22, 2019 at 9:43 PM Anastasia Lubennikova<a.lubennik...@postgrespro.ru <mailto:a.lubennik...@postgrespro.ru>>wrote:
>
> But during the review, I found a bug in the current implementation.
> New behavior must apply to crash-recovery only, now it applies toarchiveRecovery too.> That can cause a silent loss of a tablespace during regular standbyoperation
> since it never calls CheckRecoveryConsistency().
>
> Steps to reproduce:
> 1) run master and replica
> 2) create dir for tablespace:
> mkdir  /tmp/tblspc1
>
> 3) create tablespace and database on the master:
> create tablespace tblspc1 location '/tmp/tblspc1';
> create database db1 tablespace tblspc1 ;
>
> 4) wait for replica to receive this changes and pause replication:
> select pg_wal_replay_pause();
>
> 5) move replica's tablespace symlink to some empty directory, i.e./tmp/tblspc2
> mkdir  /tmp/tblspc2
> ln -sfn /tmp/tblspc2 postgresql_data_replica/pg_tblspc/16384
>
By changing the tablespace symlink target, we are silently nullifyingeffects of a committed transaction from the standby data directory -the directory structure created by the standby for create tablespacetransaction. This step, therefore, does not look like a valid testcase to me. Can you share a sequence of steps that does not involvechanging data directory manually?

Hi, the whole idea of the test is to reproduce a data loss. For example,if the disk containing this tablespace failed.Probably, simply deleting the directory'postgresql_data_replica/pg_tblspc/16384'would work as well, though I was afraid that it can be caught by someearlier checks and my example won't be so illustrative.

Thank you for the review feedback. I agree with all the points. Letme incorporate them (I plan to pick this work up and drive it tocompletion as Paul got busy with other things).
But before that I'm revisiting another solution upthread, that ofcreating restart points when replaying create/drop database commandsbefore making filesystem changes such as removing a directory. Therestart points should align with checkpoints on master. The concernagainst this solution was creation of restart points will slow downrecovery. I don't think crash recovery is affected by this solutionbecause of the already existing enforcement of checkpoints. WALrecords prior to a create/drop database will not be seen by crashrecovery due to the checkpoint enforced during the command's normalexecution.

I haven't measured the impact of generating extra restart points inprevious solution, so I cannot tell whether concerns upthread arejustified. Still, I enjoy latest design more, since it is clear andsimilar with the code of checking unexpected uninitialized pages. Inprinciple it works. And the issue, I described in previous review can beeasily fixed by several additional checks of InHotStandby macro.


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

Reply via email to