[GENERAL] 9.1.9 -> 9.1.10 causing corruption

Ben Chobot Fri, 01 Nov 2013 13:46:07 -0700

I've got a bunch of independent database clusters, each with a couple of 
streaming replication slaves. I'm starting to upgrade them to 9.1.10, but on 3 
of the 3 I've tried so far, this has somehow resulted in data corruption. I'm 
hoping it was the upgrade itself that caused the corruption, instead of 9.1.10 
just exposing some latent corruption, for two reasons:
1. after doing a basebackup from the master, the same queries that started 
causing problems on the new 9.1.10 node no longer cause problems
2. the second 9.1.9 slave of each cluster doesn't have issues that the 
newly-upgraded 9.1.10 node did have with the same query.


Interestingly, this seems to only be affecting my replication slaves. We 
snapshot all our clusters at a filesystem level and restore those snapshots 
into standalone, non-replicating test environments, and those test environments 
were all upgraded to 9.1.10 without issue. 

All my servers are running a mostly-stock config of 9.1.9, which has been 
modified like so:

max_connections = 1000 # we don't really go this high, fwiw
shared_buffers = 6000MB # 60GB machines
wal_level = hot_standby
hot_standby = on
effective_cache_size = 34278MB
log_min_duration_statement = 50
log_duration = on
 
They store their data on XFS, with the options noatime,allocsize=1m.

As for the errors, on two nodes, a primary key lookup resulted in multiple 
rows. Interestingly, this was on the same index for both nodes, which I would 
find unlikely if it was just random corruption or hardware-related, given that 
the systems have the same schema but independent data and usage. Using pg_reorg 
on that table on the master node fixed the issue on the node I tried it on, but 
exposed another problem:

ERROR: unexpected chunk number 3 (expected 1) for toast value 67728890 in 
pg_toast_2619

The pg_toast error occurred on another node too. Sadly these were are 
production systems and I didn't know if the pg_toast table was the same for all 
three nodes, and it's quite possible that all three nodes would have exhibited 
the same set of problems if I'd let them run production queries long enough.

Anyway, naturally this is making me nervous to do more upgrades, so I'm hoping 
I can figure out if it's the upgrade that's killing me, or if 9.1.10 is 
exposing underlying problems, or what. Unfortunately I have no idea what to 
look for.

These system are ubuntu precise, and I did recall this from my memory:

http://www.postgresql.org/message-id/21697.1343141...@sss.pgh.pa.us

... but it appears that the fix for that has already been applied to my 
postgresql-common package. Looking over the postgres logs before the upgrade, I 
don't see anything abnormal.

Does this behavior sound familiar to anybody else out there, or does anybody 
have suggestions of what to look for or try?

[GENERAL] 9.1.9 -> 9.1.10 causing corruption

Reply via email to