Hi everyone -- Sorry to revisit a dead horse, but I wanted to clear up some misinformation --
On Dec 31, 2007 5:35 PM, Tom Lane <[EMAIL PROTECTED]> wrote: > "Mason Hale" <[EMAIL PROTECTED]> writes: > >> This could be the kernel's fault, but I'm wondering whether the > >> RAID controller is going south. > > > To clarify a bit further -- on the production server, the data is > written to > > a 10-disk RAID 1+0, but the pg_xlog directory is symlinked to a > separate, > > dedicated SATA II disk. > > > There is a similar setup on the standby server, except that in addition > to > > the RAID for the data, and a separate SATA II disk for the pg_xlog, > there is > > another disk (also SATA II) dedicated for the archive of wal files > copied > > over from the production server. > It turns out that the separate SATA II disk was configured as a single-disk JBOD on the same controller as the 10-disk RAID 1+0. Since we've seen corruption in the data directory (on the RAID) and in the pg_xlog directory (on the SATA II disk) the RAID controller is one of the few common elements between those two partitions and hence is highly suspect, and may dispel some of the mystery with our situation. We will be replacing the RAID controller in short order. For what it is worth it is an Adaptec 31605 with a battery backup module. > > Oh. Maybe it's one of those disks' fault then. Although WAL corruption > would not lead to corruption of the primary DB as long as there were no > crash/replay events. Maybe there is more than one issue here, or maybe > it's the kernel's fault after all. > > Given the new information about the RAID controller is managing all the disks in the question (after all) -- if the RAID controller is going south, then there would be no need for a crash/replay event for that corruption to make it into the primary DB. Seems to be pretty damning evidence against the RAID controller, agreed? Mason