On Fri, Jan 16, 2015 at 5:20 PM, Peter Geoghegan <p...@heroku.com> wrote: > On Fri, Jan 16, 2015 at 10:33 AM, Merlin Moncure <mmonc...@gmail.com> wrote: >> ISTM the next step is to bisect the problem down over the weekend in >> order to to narrow the search. If that doesn't turn up anything >> productive I'll look into taking other steps. > > That might be the quickest way to do it, provided you can isolate the > bug fairly reliably. It might be a bit tricky to write a shell script > that assumes a certain amount of time having passed without the bug > tripping indicates that it doesn't exist, and have that work > consistently. I'm slightly concerned that you'll hit other bugs that > have since been fixed, given the large number of possible symptoms > here.
Quick update: not done yet, but I'm making consistent progress, with several false starts. (for example, I had a .conf problem with the new dynamic shared memory setting and git merrily bisected down to the introduction of the feature.). I have to triple check everything :(. The problem is generally reproducible but I get false negatives that throws off the bisection. I estimate that early next week I'll have it narrowed down significantly if not to the exact offending revision. So far, the 'nasty' damage seems to generally if not always follow a checksum failure and the checksum failures are always numerically adjacent. For example: [cds2 12707 2015-01-22 12:51:11.032 CST 2754]WARNING: page verification failed, calculated checksum 9465 but expected 9477 at character 20 [cds2 21202 2015-01-22 13:10:18.172 CST 3196]WARNING: page verification failed, calculated checksum 61889 but expected 61903 at character 20 [cds2 29153 2015-01-22 14:49:04.831 CST 4803]WARNING: page verification failed, calculated checksum 27311 but expected 27316 I'm not up on the intricacies of our checksum algorithm but this is making me suspicious that we are looking at a improperly flipped visibility bit via some obscure problem -- almost certainly with vacuum playing a role. This fits the profile of catastrophic damage that masquerades as numerous other problems. Or, perhaps, something is flipping what it thinks is a visibility bit but on the wrong page. I still haven't categorically ruled out pl/sh yet; that's something to keep in mind. In the 'plus' category, aside from flushing out this issue, I've had zero runtime problems so far aside from the mains problem; bisection (at least on the 'bad' side) has been reliably engaged by simply counting the number of warnings/errors/etc in the log. That's really impressive. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers