> So if this problem replicated to our standby servers does that indicate
> that the potential problematic fsync occurred during a pg_xlog write?

Pretty much. You have a couple issues here, and no easy way to approach them. 
Primarily, you got data corruption during a sync operation. This means either 
the OS or the hardware somewhere along the line lied about the write, or the 
write was corrupted and the filesystem log replayed incorrectly upon reboot. 
Once that happens, you can't trust *any* data in your database. Pre-checksum 
PostgreSQL has no way to verify integrity of existing data, and system crashes 
can corrupt quite a bit of data that was only tangentially involved.

What likely happens in these scenarios, is that the database startup succeeds, 
and then it read some rows in from a corrupted table. By corrupted, I mean even 
a single data page with a mangled pointer. That mangled pointer gave the 
database incorrect information about the state of that data page's contents, 
and the database continued on that information. That means subsequent 
transaction logs from that point are *also* corrupt, and hence any streaming or 
warm standby replicas are subsequently damaged as well. But they'll be damaged 
differently, because they likely didn't have the initial corruption, just the 
byte changes dictated by the WAL stream.

Unless you know where the initial corruption came from, the system that caused 
it should be quarantined for verification. RAM, disk, CPU, everything should 
pass integrity checks before putting it back into production.

> Would breaking replication at the time of the crash have prevented
> this from cascading or was it already too late at that point?

Most likely. If, at the time of the crash, you switched to one of your replicas 
and made it the new master, it would give you the opportunity to check out the 
crashed system before it spread the love. Even if you don't have a true STONITH 
model, starting up a potentially data-compromised node in an active cluster is 
a gamble.

I did something similar once. One of our DRBD nodes crashed and came back up 
and re-attached to the DRBD pair after a quick data discard and replay. I 
continued with some scheduled system maintenance, and performed a node failover 
with no incident. It wasn't until 20 minutes later that the corrupt disk pages 
started making their presence felt, and by then It was too late. Luckily we 
were still verifying, but with our secondaries ruined, we had to restore from 
backup. A 30-minute outage became a 4-hour one.

Afterwards, we put in a new policy that any crash means a DRBD verify at 
minimum, and until the node passes, it is to be considered invalid and 
unusable. If you haven't already, I suggest something similar for your setup. 
Verify a crashed node before using it again, no matter how much pressure you're 
under. It can always get worse.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd | Suite 500 | Chicago IL, 60604
312-676-8870
stho...@optionshouse.com


______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to 
this email


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Reply via email to