Re: [HACKERS] Write Ahead Logging for Hash Indexes

Jeff Janes Wed, 21 Sep 2016 20:22:05 -0700

On Tue, Sep 20, 2016 at 10:27 PM, Amit Kapila <[email protected]>
wrote:


> On Tue, Sep 20, 2016 at 10:24 PM, Jeff Janes <[email protected]> wrote:
> > On Thu, Sep 15, 2016 at 11:42 PM, Amit Kapila <[email protected]>
> > wrote:
> >>
> >>
> >> Okay, Thanks for pointing out the same.  I have fixed it.  Apart from
> >> that, I have changed _hash_alloc_buckets() to initialize the page
> >> instead of making it completely Zero because of problems discussed in
> >> another related thread [1].  I have also updated README.
> >>
> >
> > with v7 of the concurrent has patch and v4 of the write ahead log patch
> and
> > the latest relcache patch (I don't know how important that is to
> reproducing
> > this, I suspect it is not), I once got this error:
> >
> >
> > 38422  00000 2016-09-19 16:25:50.055 PDT:LOG:  database system was
> > interrupted; last known up at 2016-09-19 16:25:49 PDT
> > 38422  00000 2016-09-19 16:25:50.057 PDT:LOG:  database system was not
> > properly shut down; automatic recovery in progress
> > 38422  00000 2016-09-19 16:25:50.057 PDT:LOG:  redo starts at 3F/2200DE90
> > 38422  01000 2016-09-19 16:25:50.061 PDT:WARNING:  page verification
> failed,
> > calculated checksum 65067 but expected 21260
> > 38422  01000 2016-09-19 16:25:50.061 PDT:CONTEXT:  xlog redo at
> 3F/22053B50
> > for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
> > 38422  XX001 2016-09-19 16:25:50.071 PDT:FATAL:  invalid page in block 9
> of
> > relation base/16384/17334
> > 38422  XX001 2016-09-19 16:25:50.071 PDT:CONTEXT:  xlog redo at
> 3F/22053B50
> > for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
> >
> >
> > The original page with the invalid checksum is:
> >
>
> I think this is a example of torn page problem, which seems to be
> happening because of the below code in your test.
>
> !         if (JJ_torn_page > 0 && counter++ > JJ_torn_page &&
> !RecoveryInProgress()) {
> !   nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ/3);
> ! ereport(FATAL,
> ! (errcode(ERRCODE_DISK_FULL),
> !  errmsg("could not write block %u of relation %s: wrote only %d of %d
> bytes",
> ! blocknum,
> ! relpath(reln->smgr_rnode, forknum),
> ! nbytes, BLCKSZ),
> !  errhint("JJ is screwing with the database.")));
> !         } else {
> !   nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
> ! }
>
> If you are running the above test by disabling JJ_torn_page, then it
> is a different matter and we need to investigate it, but l assume you
> are running by enabling it.
>
> I think this could happen if the actual change in page is in 2/3 part
> of page which you are not writing in above code.  The checksum in page
> header which is written as part of partial page write (1/3 part of
> page) would have considered the actual change you have made whereas
> after restart when it again read the page to apply redo, the checksum
> calculation won't include the change being made in 2/3 part.
>

Correct.  But any torn page write must be covered by the restoration of a
full page image during replay, shouldn't it?  And that restoration should
happen blindly, without first reading in the old page and verifying the
checksum.  Failure to restore the page from a FPI would be a bug.  (That
was the purpose for which I wrote this testing harness in the first place,
to verify that the restoration of FPI happens correctly; although most of
the bugs it happens to uncover have been unrelated to that.)



>
> Today, Ashutosh has shared the logs of his test run where he has shown
> similar problem for HEAP page.  I think this could happen though
> rarely for any page with the above kind of tests.
>

I think Ashutosh's examples are of warnings, not errors.   I think the
warnings occur when replay needs to read in the block (for reason's I don't
understand yet) but then doesn't care if it passes the checksum or not
because it will just be blown away by the replay anyway.


> Does this explanation explains the reason of problem you are seeing?
>

If it can't survive artificial torn page writes, then it probably can't
survive reals ones either.  So I am pretty sure it is a bug of some sort.
Perhaps the bug is that it is generating an ERROR when should just be a
WARNING?


>
> >
> > If I ignore the checksum failure and re-start the system, the page gets
> > restored to be a bitmap page.
> >
>
> Okay, but have you ensured in some way that redo is applied to bitmap page?
>


I haven't done that yet.  I can't start the system without destroying the
evidence, and I haven't figured out yet how to import a specific block from
a shut-down server into a bytea of a running server, in order to inspect it
using pageinspect.

Today, while thinking on this problem, I realized that currently in
> patch we are using REGBUF_NO_IMAGE for bitmap page for one of the
> problem reported by you [1].  That change will fix the problem
> reported by you, but it will expose bitmap pages for torn-page
> hazards.  I think the right fix there is to make pd_lower equal to
> pd_upper for bitmap page, so that full page writes doesn't exclude the
> data in bitmappage.
>

I'm afraid that is over my head.  I can study it until it makes sense, but
it will take me a while.

Cheers,

Jeff


> [1] - https://www.postgresql.org/message-id/CAA4eK1KJOfVvFUmi6dcX9Y2-
> 0PFHkomDzGuyoC%3DaD3Qj9WPpFA%40mail.gmail.com
>
>

Re: [HACKERS] Write Ahead Logging for Hash Indexes

Reply via email to