Anybody? I've tried this again on another streaming replication server, and again had pg_toast errors until I re-basebackup'd it. Does it make sense to try disabling streaming replication for the backup and just use wal replay from the archive? My slave will be a bit behind, but I'll have drained the clients from it anyway, so that's not a big deal for a temporary thing.
On Nov 1, 2013, at 1:44 PM, Ben Chobot wrote: > I've got a bunch of independent database clusters, each with a couple of > streaming replication slaves. I'm starting to upgrade them to 9.1.10, but on > 3 of the 3 I've tried so far, this has somehow resulted in data corruption. > I'm hoping it was the upgrade itself that caused the corruption, instead of > 9.1.10 just exposing some latent corruption, for two reasons: > 1. after doing a basebackup from the master, the same queries that started > causing problems on the new 9.1.10 node no longer cause problems > 2. the second 9.1.9 slave of each cluster doesn't have issues that the > newly-upgraded 9.1.10 node did have with the same query. > > Interestingly, this seems to only be affecting my replication slaves. We > snapshot all our clusters at a filesystem level and restore those snapshots > into standalone, non-replicating test environments, and those test > environments were all upgraded to 9.1.10 without issue. > > All my servers are running a mostly-stock config of 9.1.9, which has been > modified like so: > > max_connections = 1000 # we don't really go this high, fwiw > shared_buffers = 6000MB # 60GB machines > wal_level = hot_standby > hot_standby = on > effective_cache_size = 34278MB > log_min_duration_statement = 50 > log_duration = on > > They store their data on XFS, with the options noatime,allocsize=1m. > > As for the errors, on two nodes, a primary key lookup resulted in multiple > rows. Interestingly, this was on the same index for both nodes, which I would > find unlikely if it was just random corruption or hardware-related, given > that the systems have the same schema but independent data and usage. Using > pg_reorg on that table on the master node fixed the issue on the node I tried > it on, but exposed another problem: > > ERROR: unexpected chunk number 3 (expected 1) for toast value 67728890 in > pg_toast_2619 > > The pg_toast error occurred on another node too. Sadly these were are > production systems and I didn't know if the pg_toast table was the same for > all three nodes, and it's quite possible that all three nodes would have > exhibited the same set of problems if I'd let them run production queries > long enough. > > Anyway, naturally this is making me nervous to do more upgrades, so I'm > hoping I can figure out if it's the upgrade that's killing me, or if 9.1.10 > is exposing underlying problems, or what. Unfortunately I have no idea what > to look for. > > These system are ubuntu precise, and I did recall this from my memory: > > http://www.postgresql.org/message-id/21697.1343141...@sss.pgh.pa.us > > ... but it appears that the fix for that has already been applied to my > postgresql-common package. Looking over the postgres logs before the upgrade, > I don't see anything abnormal. > > Does this behavior sound familiar to anybody else out there, or does anybody > have suggestions of what to look for or try?