Re: [HACKERS] inconsistent state after crash recovery

Satoshi Nagayasu Sat, 03 Aug 2013 03:25:57 -0700

(2013/08/02 21:19), Robert Haas wrote:

On Fri, Aug 2, 2013 at 8:17 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:

Robert Haas <robertmh...@gmail.com> writes:

On Fri, Jul 26, 2013 at 8:27 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:

would you expect crash recovery to notice the disappearance of a file
that was touched nowhere in the replayed actions?

Eh, maybe not.  But should we try harder to detect the unexpected
disappearance of one that is?


We do, don't we?  The replay stuff should complain unless it sees a drop
or truncate covering any unaccounted-for pages.


Hmm.  Yeah.  But the OP seems to think it doesn't work.


Yes. I'm afraid that.

My attached script shows that crash recovery re-creates the lost
table file implicitly, and fills some of those blocks (maybe
lower ones) with zero without any notice. We can easily observe
it by using pg_filedump.

Thus, the table file can lose records, but DBA cannot recognize
it because no message is left in the server log.

I agree that this is not a PostgreSQL bug.

However, DBA still needs to detect this table corruption,
brought by several components which PostgreSQL relys on, to
consider restoring from database backup.

If PostgreSQL can detect and tell something about that, it
would be really helpful for DBA to make some critical decision.
I think PostgreSQL will be able to do that.

Regards,
--
Satoshi Nagayasu <sn...@uptime.jp>
Uptime Technologies, LLC. http://www.uptime.jp

#!/bin/sh

PGHOME=/tmp/pgsql
PGDATA=/tmp/pgsql/data
PATH=${PGHOME}/bin:${PATH}
export PGHOME PGDATA PATH

createdb testdb

psql -e testdb<<EOF
drop table if exists t1;

create table t1 ( c1 varchar(20) );
checkpoint;

select count(*) from t1;

select relname,relfilenode from pg_class where relname='t1';
select oid,* from pg_database where datname=current_database();

insert into t1 values (trim(to_char(generate_series(1,1000), 
'00000000000000000000')) );

-- 1000 records
select count(*) from t1;
select pg_relation_size('t1');

-- write lower blocks in the table file.
checkpoint;

-- write backup blocks and new records in WAL
insert into t1 values (trim(to_char(generate_series(1001,2000), 
'00000000000000000000')) );

-- 2000 records
select count(*) from t1;
select pg_relation_size('t1');
EOF

tablefile=`psql -A -t testdb<<EOF
select
  ( select setting from pg_settings where name='data_directory' ) || '/'  ||
  ( select pg_relation_filepath(oid) from pg_class where relname='t1' )
;
EOF
`

echo $tablefile

# force a process crash, then recovery after that.
killall -9 postgres

# remove the table file.
rm -rf $tablefile

rm -rf ${PGDATA}/*.pid
rm -rf ${PGDATA}/pg_log/*

pg_ctl -w -D ${PGDATA} start

cat ${PGDATA}/pg_log/*

psql -e testdb<<EOF
-- only a backup block and following wal records can be
-- recovered here. so the table is going to be inconsistent.

-- 1000+ records (inconsistent)
select count(*) from t1;
select pg_relation_size('t1');
EOF

#!/bin/sh

PGHOME=/tmp/pgsql
PGDATA=/tmp/pgsql/data
PATH=${PGHOME}/bin:${PATH}
export PGHOME PGDATA PATH

createdb testdb

psql -e testdb<<EOF
drop table if exists t1;

create table t1 ( c1 varchar(20) );
checkpoint;

select count(*) from t1;

select relname,relfilenode from pg_class where relname='t1';
select oid,* from pg_database where datname=current_database();

insert into t1 values (trim(to_char(generate_series(1,1000), 
'00000000000000000000')) );

select count(*) from t1;
select pg_relation_size('t1');

-- write lower blocks in the table file.
checkpoint;

-- write backup blocks and new records in WAL
insert into t1 values (trim(to_char(generate_series(1001,2000), 
'00000000000000000000')) );

select count(*) from t1;
select pg_relation_size('t1');
EOF

tablefile=`psql -A -t testdb<<EOF
select
  ( select setting from pg_settings where name='data_directory' ) || '/'  || 
  ( select pg_relation_filepath(oid) from pg_class where relname='t1' )
;
EOF
`

echo $tablefile

# force a process crash, then recovery after that.
killall -9 postgres

ls -l $tablefile
pg_filedump -y -S 8192 $tablefile > /tmp/dump_before_drop

# remove the table file.
rm -rf $tablefile

rm -rf ${PGDATA}/*.pid
rm -rf ${PGDATA}/pg_log/*

pg_ctl -w -D ${PGDATA} start -o "-p 5433"

cat ${PGDATA}/pg_log/*

psql -e testdb<<EOF
-- if execute a checkpoint here, the table would be recovered
-- with inconsistent state after crash recovery,
-- becaues only the upper blocks can be recovered.
select count(*) from t1;
select pg_relation_size('t1');
EOF

tablefile=`psql -A -t testdb<<EOF
select
  ( select setting from pg_settings where name='data_directory' ) || '/'  || 
  ( select pg_relation_filepath(oid) from pg_class where relname='t1' )
;
EOF
`
ls -l $tablefile
pg_filedump -y -S 8192 $tablefile > /tmp/dump_after_recovery

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] inconsistent state after crash recovery

Reply via email to