Re: [HACKERS] Hot Backup with rsync fails at pg_clog if under load

Linas Virbalas Fri, 23 Sep 2011 05:47:55 -0700

On 9/23/11 12:05 PM, "Heikki Linnakangas"
<heikki.linnakan...@enterprisedb.com> wrote:


>>> It looks to me that pg_clog/0001 exists, but it shorter than recovery
>>> expects. Which shouldn't happen, of course, because the start-backup
>>> checkpoint should flush all the clog that's needed by recovery to disk
>>> before the backup procedure begins to them.
>> 
>> Yeah. What confuses me though is that we fail while trying to read from the
>> clog. When do we do that during normal (non-standby) recovery?
> 
> I believe the OP is setting up a standby. He didn't say if it's a hot
> standby, though. That changes the startup sequence a bit.

I confirm that it is a Hot Standby with Streaming Replication.

>> One other possibility would be that the problem is somehow triggered by
>> vacuum running while the start-backup checkpoint is commencing. That'd
>> explain why the problem goes away with immediate checkpoints, because those
>> make the windows mach smaller. But I haven't got a concrete theory of whats
>> happening..
> 
> Hmm, that's a good theory. The logic in SimpleLruPhysicakReadPage()
> tolerates non-existent files, because the old clog files might've been
> truncated away by a vacuum after the backup started. That's OK, because
> they will be recreated, and later deleted again, during the WAL replay.
> But what if something like this happens:
> 
> 0. pg_start_backup().
> 1. rsync starts. It gets a list of all files. It notes that pg_clog/0001
> exists.
> 2. vacuum runs, and deletes pg_clog/0001
> 3. rsync starts to copy pg_clog/0001. Oops, it's gone. Instead of
> leaving it out altogether, it includes it as an empty file.

I cannot confirm that. I have tried this scenario one more time and was
observing how pg_clog/ looked on the master and standby. Here's an overview:

1. pg_start_backup('base_backup') - waiting...

2. Checking the size of pg_clog periodically on the master:

-bash-3.2$ ls -l pg_clog
total 8
-rw------- 1 postgres postgres 8192 Sep 23 14:31 0000
-bash-3.2$ ls -l pg_clog
total 8
-rw------- 1 postgres postgres 8192 Sep 23 14:31 0000
-bash-3.2$ ls -l pg_clog
total 16

3. Somewhere just before (1) step releases and we get into rsync phase, the
pg_clog size changes:

-rw------- 1 postgres postgres 16384 Sep 23 14:33 0000
-bash-3.2$ ls -l pg_clog
total 16
-rw------- 1 postgres postgres 16384 Sep 23 14:33 0000
-bash-3.2$ ls -l pg_clog
total 16
-rw------- 1 postgres postgres 16384 Sep 23 14:33 0000

4. The rsync does transfer the file:

...
INFO   | jvm 1    | 2011/09/23 14:33:43 | 2011-09-23 14:33:43,881 INFO
management.script.ScriptPlugin base/16394/16405_fsm
INFO   | jvm 1    | 2011/09/23 14:33:43 | 2011-09-23 14:33:43,881 INFO
management.script.ScriptPlugin base/16394/pg_internal.init
INFO   | jvm 1    | 2011/09/23 14:33:43 | 2011-09-23 14:33:43,886 INFO
management.script.ScriptPlugin global/
INFO   | jvm 1    | 2011/09/23 14:33:43 | 2011-09-23 14:33:43,886 INFO
management.script.ScriptPlugin global/12691
INFO   | jvm 1    | 2011/09/23 14:33:43 | 2011-09-23 14:33:43,886 INFO
management.script.ScriptPlugin global/12696
INFO   | jvm 1    | 2011/09/23 14:33:43 | 2011-09-23 14:33:43,886 INFO
management.script.ScriptPlugin global/12697
INFO   | jvm 1    | 2011/09/23 14:33:43 | 2011-09-23 14:33:43,886 INFO
management.script.ScriptPlugin global/pg_internal.init
INFO   | jvm 1    | 2011/09/23 14:33:43 | 2011-09-23 14:33:43,887 INFO
management.script.ScriptPlugin pg_clog/0000
INFO   | jvm 1    | 2011/09/23 14:33:43 | 2011-09-23 14:33:43,887 INFO
management.script.ScriptPlugin pg_multixact/offsets/0000
INFO   | jvm 1    | 2011/09/23 14:33:43 | 2011-09-23 14:33:43,887 INFO
management.script.ScriptPlugin pg_notify/
...

But on the standby its size is the old one (thus, it seems, that the size
changed after the rsync transfer and before the pg_stop_backup() was
called):

ls -l pg_clog/
total 8
-rw------- 1 postgres postgres 8192 Sep 23 14:31 0000

5. Hence, the server doesn't start and pg_log complains:

2011-09-23 14:33:45 CEST LOG:  streaming replication successfully connected
to primary
2011-09-23 14:33:46 CEST FATAL:  the database system is starting up
2011-09-23 14:33:46 CEST FATAL:  the database system is starting up
2011-09-23 14:33:46 CEST FATAL:  the database system is starting up
2011-09-23 14:33:46 CEST FATAL:  the database system is starting up
2011-09-23 14:33:46 CEST FATAL:  the database system is starting up
2011-09-23 14:33:46 CEST FATAL:  the database system is starting up
2011-09-23 14:33:46 CEST FATAL:  the database system is starting up
2011-09-23 14:33:46 CEST FATAL:  could not access status of transaction
37206
2011-09-23 14:33:46 CEST DETAIL:  Could not read from file "pg_clog/0000" at
offset 8192: Success.

6. Now, if I do something that, of course, should never be done, and copy
this 0000 file from master to the standby soon enough (without even
starting/stopping backup), the standby starts up successfully.

--
Hope this helps,
Linas Virbalas
http://flyingclusters.blogspot.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hot Backup with rsync fails at pg_clog if under load

Reply via email to