On Sun, May 18, 2008 at 12:17 AM, Vladimir Kosilov <[EMAIL PROTECTED]> wrote:
> I've been having problem with pgsql-8.2.5 master/slave warm standby > replication setup where occasionally master node generates a wal file > smaller then expected 16MB. pg_standby on slave gets stuck on such short > files, and replication halts from that moment on. we have to do > pg_start_backup/ rsync of data to slave / pg_stop_backup and restart slave > in order to recover. database I'm replicating is write mostly. > > this is process on slave that is waiting on log file which is smaller in > size: > postgres 14277 11857 0 03:34 ? 00:00:00 sh -c > /usr/local/pgsql/bin/pg_standby -d -k 255 -r 7 -s 10 -w 0 -t > /usr/local/recovery_control/quit_recovery /usr/local/wal_archive > 000000010000034000000020 pg_xlog/RECOVERYXLOG 2>> > /usr/local/pgsql/data/standby.log > postgres 14278 14277 0 03:34 ? 00:00:00 > /usr/local/pgsql/bin/pg_standby -d -k 255 -r 7 -s 10 -w 0 -t > /usr/local/recovery_control/quit_recovery /usr/local/wal_archive > 000000010000034000000020 pg_xlog/RECOVERYXLOG > > here is a sample list of wal_archive directory slave, note the size of > 000000010000034000000020 expected wal file is less then 16MB: > > ... > -rw------- 1 postgres postgres 16777216 May 17 03:19 > 000000010000034000000017 > -rw------- 1 postgres postgres 16777216 May 17 03:21 > 000000010000034000000018 > -rw------- 1 postgres postgres 16777216 May 17 03:22 > 000000010000034000000019 > -rw------- 1 postgres postgres 16777216 May 17 03:24 > 00000001000003400000001A > -rw------- 1 postgres postgres 16777216 May 17 03:26 > 00000001000003400000001B > -rw------- 1 postgres postgres 16777216 May 17 03:27 > 00000001000003400000001C > -rw------- 1 postgres postgres 16777216 May 17 03:29 > 00000001000003400000001D > -rw------- 1 postgres postgres 16777216 May 17 03:30 > 00000001000003400000001E > -rw------- 1 postgres postgres 16777216 May 17 03:32 > 00000001000003400000001F > > -rw------- 1 postgres postgres 13746176 May 17 03:34 > 000000010000034000000020 > > -rw------- 1 postgres postgres 16777216 May 17 03:35 > 000000010000034000000021 > -rw------- 1 postgres postgres 16777216 May 17 03:37 > 000000010000034000000022 > -rw------- 1 postgres postgres 16777216 May 17 03:38 > 000000010000034000000023 > -rw------- 1 postgres postgres 16777216 May 17 03:40 > 000000010000034000000024 > -rw------- 1 postgres postgres 16777216 May 17 03:41 > 000000010000034000000025 > -rw------- 1 postgres postgres 16777216 May 17 03:43 > 000000010000034000000026 > -rw------- 1 postgres postgres 16777216 May 17 03:45 > 000000010000034000000027 > ... > > skipping a bit further I see there is at least one other instance where wal > file is shorter then normal 16MB: > > -rw------- 1 postgres postgres 16777216 May 17 05:42 > 00000001000003400000006F > -rw------- 1 postgres postgres 16777216 May 17 05:44 > 000000010000034000000070 > -rw------- 1 postgres postgres 16777216 May 17 05:46 > 000000010000034000000071 > -rw------- 1 postgres postgres 16777216 May 17 05:47 > 000000010000034000000072 > > -rw------- 1 postgres postgres 16384 May 17 05:50 > 000000010000034000000073 > > -rw------- 1 postgres postgres 16777216 May 17 05:51 > 000000010000034000000074 > -rw------- 1 postgres postgres 16777216 May 17 05:52 > 000000010000034000000075 > -rw------- 1 postgres postgres 16777216 May 17 05:54 > 000000010000034000000076 > > why would a master node create a wal file smaller then normal > checkpoint_segment size and how can this be avoided. I need a reliable > replication mechanism even at a cost of longer recovery on standby. > > master's postgresql.conf: > .. > fsync = on > wal_sync_method = open_sync > wal_buffers = 128 > checkpoint_segments = 64 > archive_command = 'test ! -f /usr/local/wal_archive_local/%f && cp %p > /usr/local/wal_archive_local/%f' > > archive files are then moved on master to standby every other minute: > > rsync -aq --remove-sent-files /usr/local/wal_archive_local/ > slave::wal_archive/ > > slave's recovery.conf: > restore_command = '/usr/local/pgsql/bin/pg_standby -d -k 255 -r 7 -s 10 -w > 0 -t /usr/local/recovery_control/quit_recovery /usr/local/wal_archive %f %p > 2>> /u > sr/local/pgsql/data/standby.log' > > both servers are identical Dell PE1950 servers with 4 sas hd w/hardware > RAID 1+0 running: > 2.6.18-8.1.8.el5 #1 SMP Tue Jul 10 06:39:17 EDT 2007 x86_64 x86_64 x86_64 > GNU/Linux > > Thanks much in advance > V. Any reason why you are not using rsync command in the archive_command in the first place? [EMAIL PROTECTED] [EMAIL PROTECTED] gmail | hotmail | indiatimes | yahoo }.com EnterpriseDB http://www.enterprisedb.com Mail sent from my BlackLaptop device