On Mon, Apr 1, 2013 at 10:37 AM, Jeff Janes <jeff.ja...@gmail.com> wrote:
> On Tue, Mar 26, 2013 at 4:23 PM, Jeff Davis <pg...@j-davis.com> wrote: > >> >> Patch attached. Only brief testing done, so I might have missed >> something. I will look more closely later. >> > > After applying your patch, I could run the stress test described here: > > http://archives.postgresql.org/pgsql-hackers/2012-02/msg01227.php > > But altered to make use of initdb -k, of course. > > Over 10,000 cycles of crash and recovery, I encountered two cases of > checksum failures after recovery, example: > ... > > Unfortunately I already cleaned up the data directory before noticing the > problem, so I have nothing to post for forensic analysis. I'll try to > reproduce the problem. > > I've reproduced the problem, this time in block 74 of relation base/16384/4931589, and a tarball of the data directory is here: https://docs.google.com/file/d/0Bzqrh1SO9FcELS1majlFcTZsR0k/edit?usp=sharing (the table is in database jjanes under role jjanes, the binary is commit 9ad27c215362df436f8c) What I would probably really want is the data as it existed after the crash but before recovery started, but since the postmaster immediately starts recovery after the crash, I don't know of a good way to capture this. I guess one thing to do would be to extract from the WAL the most recent FPW for block 74 of relation base/16384/4931589 (assuming it hasn't been recycled already) and see if it matches what is actually in that block of that data file, but I don't currently know how to do that. 11500 SELECT 2013-04-01 12:01:56.926 PDT:WARNING: page verification failed, calculated checksum 54570 but expected 34212 11500 SELECT 2013-04-01 12:01:56.926 PDT:ERROR: invalid page in block 74 of relation base/16384/4931589 11500 SELECT 2013-04-01 12:01:56.926 PDT:STATEMENT: select sum(count) from foo Cheers, Jeff