For posterity, it appears my issues were 
https://wiki.postgresql.org/wiki/Nov2013ReplicationIssue. 

On Nov 4, 2013, at 3:48 PM, Ben Chobot wrote:

> Anybody? I've tried this again on another streaming replication server, and 
> again had pg_toast errors until I re-basebackup'd it. Does it make sense to 
> try disabling streaming replication for the backup and just use wal replay 
> from the archive? My slave will be a bit behind, but I'll have drained the 
> clients from it anyway, so that's not a big deal for a temporary thing.
> 
> On Nov 1, 2013, at 1:44 PM, Ben Chobot wrote:
> 
>> I've got a bunch of independent database clusters, each with a couple of 
>> streaming replication slaves. I'm starting to upgrade them to 9.1.10, but on 
>> 3 of the 3 I've tried so far, this has somehow resulted in data corruption. 
>> I'm hoping it was the upgrade itself that caused the corruption, instead of 
>> 9.1.10 just exposing some latent corruption, for two reasons:
>> 1. after doing a basebackup from the master, the same queries that started 
>> causing problems on the new 9.1.10 node no longer cause problems
>> 2. the second 9.1.9 slave of each cluster doesn't have issues that the 
>> newly-upgraded 9.1.10 node did have with the same query. 
>> 
>> Interestingly, this seems to only be affecting my replication slaves. We 
>> snapshot all our clusters at a filesystem level and restore those snapshots 
>> into standalone, non-replicating test environments, and those test 
>> environments were all upgraded to 9.1.10 without issue. 
>> 
>> All my servers are running a mostly-stock config of 9.1.9, which has been 
>> modified like so:
>> 
>> max_connections = 1000 # we don't really go this high, fwiw
>> shared_buffers = 6000MB # 60GB machines
>> wal_level = hot_standby
>> hot_standby = on
>> effective_cache_size = 34278MB
>> log_min_duration_statement = 50
>> log_duration = on
>>  
>> They store their data on XFS, with the options noatime,allocsize=1m.
>> 
>> As for the errors, on two nodes, a primary key lookup resulted in multiple 
>> rows. Interestingly, this was on the same index for both nodes, which I 
>> would find unlikely if it was just random corruption or hardware-related, 
>> given that the systems have the same schema but independent data and usage. 
>> Using pg_reorg on that table on the master node fixed the issue on the node 
>> I tried it on, but exposed another problem:
>> 
>> ERROR: unexpected chunk number 3 (expected 1) for toast value 67728890 in 
>> pg_toast_2619
>> 
>> The pg_toast error occurred on another node too. Sadly these were are 
>> production systems and I didn't know if the pg_toast table was the same for 
>> all three nodes, and it's quite possible that all three nodes would have 
>> exhibited the same set of problems if I'd let them run production queries 
>> long enough.
>> 
>> Anyway, naturally this is making me nervous to do more upgrades, so I'm 
>> hoping I can figure out if it's the upgrade that's killing me, or if 9.1.10 
>> is exposing underlying problems, or what. Unfortunately I have no idea what 
>> to look for.
>> 
>> These system are ubuntu precise, and I did recall this from my memory:
>> 
>> http://www.postgresql.org/message-id/21697.1343141...@sss.pgh.pa.us
>> 
>> ... but it appears that the fix for that has already been applied to my 
>> postgresql-common package. Looking over the postgres logs before the 
>> upgrade, I don't see anything abnormal.
>> 
>> Does this behavior sound familiar to anybody else out there, or does anybody 
>> have suggestions of what to look for or try?
> 

Reply via email to