For posterity, it appears my issues were https://wiki.postgresql.org/wiki/Nov2013ReplicationIssue.
On Nov 4, 2013, at 3:48 PM, Ben Chobot wrote: > Anybody? I've tried this again on another streaming replication server, and > again had pg_toast errors until I re-basebackup'd it. Does it make sense to > try disabling streaming replication for the backup and just use wal replay > from the archive? My slave will be a bit behind, but I'll have drained the > clients from it anyway, so that's not a big deal for a temporary thing. > > On Nov 1, 2013, at 1:44 PM, Ben Chobot wrote: > >> I've got a bunch of independent database clusters, each with a couple of >> streaming replication slaves. I'm starting to upgrade them to 9.1.10, but on >> 3 of the 3 I've tried so far, this has somehow resulted in data corruption. >> I'm hoping it was the upgrade itself that caused the corruption, instead of >> 9.1.10 just exposing some latent corruption, for two reasons: >> 1. after doing a basebackup from the master, the same queries that started >> causing problems on the new 9.1.10 node no longer cause problems >> 2. the second 9.1.9 slave of each cluster doesn't have issues that the >> newly-upgraded 9.1.10 node did have with the same query. >> >> Interestingly, this seems to only be affecting my replication slaves. We >> snapshot all our clusters at a filesystem level and restore those snapshots >> into standalone, non-replicating test environments, and those test >> environments were all upgraded to 9.1.10 without issue. >> >> All my servers are running a mostly-stock config of 9.1.9, which has been >> modified like so: >> >> max_connections = 1000 # we don't really go this high, fwiw >> shared_buffers = 6000MB # 60GB machines >> wal_level = hot_standby >> hot_standby = on >> effective_cache_size = 34278MB >> log_min_duration_statement = 50 >> log_duration = on >> >> They store their data on XFS, with the options noatime,allocsize=1m. >> >> As for the errors, on two nodes, a primary key lookup resulted in multiple >> rows. Interestingly, this was on the same index for both nodes, which I >> would find unlikely if it was just random corruption or hardware-related, >> given that the systems have the same schema but independent data and usage. >> Using pg_reorg on that table on the master node fixed the issue on the node >> I tried it on, but exposed another problem: >> >> ERROR: unexpected chunk number 3 (expected 1) for toast value 67728890 in >> pg_toast_2619 >> >> The pg_toast error occurred on another node too. Sadly these were are >> production systems and I didn't know if the pg_toast table was the same for >> all three nodes, and it's quite possible that all three nodes would have >> exhibited the same set of problems if I'd let them run production queries >> long enough. >> >> Anyway, naturally this is making me nervous to do more upgrades, so I'm >> hoping I can figure out if it's the upgrade that's killing me, or if 9.1.10 >> is exposing underlying problems, or what. Unfortunately I have no idea what >> to look for. >> >> These system are ubuntu precise, and I did recall this from my memory: >> >> http://www.postgresql.org/message-id/21697.1343141...@sss.pgh.pa.us >> >> ... but it appears that the fix for that has already been applied to my >> postgresql-common package. Looking over the postgres logs before the >> upgrade, I don't see anything abnormal. >> >> Does this behavior sound familiar to anybody else out there, or does anybody >> have suggestions of what to look for or try? >