Re: [GENERAL] 9.1.9 -> 9.1.10 causing corruption

Ben Chobot Mon, 04 Nov 2013 15:50:21 -0800

Anybody? I've tried this again on another streaming replication server, and 
again had pg_toast errors until I re-basebackup'd it. Does it make sense to try 
disabling streaming replication for the backup and just use wal replay from the 
archive? My slave will be a bit behind, but I'll have drained the clients from 
it anyway, so that's not a big deal for a temporary thing.


On Nov 1, 2013, at 1:44 PM, Ben Chobot wrote:

> I've got a bunch of independent database clusters, each with a couple of 
> streaming replication slaves. I'm starting to upgrade them to 9.1.10, but on 
> 3 of the 3 I've tried so far, this has somehow resulted in data corruption. 
> I'm hoping it was the upgrade itself that caused the corruption, instead of 
> 9.1.10 just exposing some latent corruption, for two reasons:
> 1. after doing a basebackup from the master, the same queries that started 
> causing problems on the new 9.1.10 node no longer cause problems
> 2. the second 9.1.9 slave of each cluster doesn't have issues that the 
> newly-upgraded 9.1.10 node did have with the same query. 
> 
> Interestingly, this seems to only be affecting my replication slaves. We 
> snapshot all our clusters at a filesystem level and restore those snapshots 
> into standalone, non-replicating test environments, and those test 
> environments were all upgraded to 9.1.10 without issue. 
> 
> All my servers are running a mostly-stock config of 9.1.9, which has been 
> modified like so:
> 
> max_connections = 1000 # we don't really go this high, fwiw
> shared_buffers = 6000MB # 60GB machines
> wal_level = hot_standby
> hot_standby = on
> effective_cache_size = 34278MB
> log_min_duration_statement = 50
> log_duration = on
>  
> They store their data on XFS, with the options noatime,allocsize=1m.
> 
> As for the errors, on two nodes, a primary key lookup resulted in multiple 
> rows. Interestingly, this was on the same index for both nodes, which I would 
> find unlikely if it was just random corruption or hardware-related, given 
> that the systems have the same schema but independent data and usage. Using 
> pg_reorg on that table on the master node fixed the issue on the node I tried 
> it on, but exposed another problem:
> 
> ERROR: unexpected chunk number 3 (expected 1) for toast value 67728890 in 
> pg_toast_2619
> 
> The pg_toast error occurred on another node too. Sadly these were are 
> production systems and I didn't know if the pg_toast table was the same for 
> all three nodes, and it's quite possible that all three nodes would have 
> exhibited the same set of problems if I'd let them run production queries 
> long enough.
> 
> Anyway, naturally this is making me nervous to do more upgrades, so I'm 
> hoping I can figure out if it's the upgrade that's killing me, or if 9.1.10 
> is exposing underlying problems, or what. Unfortunately I have no idea what 
> to look for.
> 
> These system are ubuntu precise, and I did recall this from my memory:
> 
> http://www.postgresql.org/message-id/21697.1343141...@sss.pgh.pa.us
> 
> ... but it appears that the fix for that has already been applied to my 
> postgresql-common package. Looking over the postgres logs before the upgrade, 
> I don't see anything abnormal.
> 
> Does this behavior sound familiar to anybody else out there, or does anybody 
> have suggestions of what to look for or try?

Re: [GENERAL] 9.1.9 -> 9.1.10 causing corruption

Reply via email to