Hi,

On 2025-03-28 08:54:42 -0400, Andres Freund wrote:
> On 2025-03-27 20:22:23 -0700, Noah Misch wrote:
> > On Thu, Mar 27, 2025 at 04:58:11PM -0400, Andres Freund wrote:
> > > - We need to register a local callback for shared buffer reads, which 
> > > don't
> > >   need them today . That's a small bit of added overhead. It's a shame to 
> > > do
> > >   so for counters that approximately never get incremented.
> >
> > Fair concern.  An idea is to let the complete_shared callback change the
> > callback list associated with the IO, so it could change
> > PGAIO_HCB_SHARED_BUFFER_READV to PGAIO_HCB_SHARED_BUFFER_READV_SLOW.  The
> > latter would differ from the former only in having the extra local callback.
> > Could that help?  I think the only overhead is using more PGAIO_HCB numbers.
>
> I think changing the callback could work - I'll do some measurements in a
> coffee or two, but I suspect the overhead is not worth being too worried about
> for now.  There's a different aspect that worries me slightly more, see
> further down.
> ...
> Unfortunately pgstat_prepare_report_checksum_failure() has to do a lookup in a
> local hashtable. That's more expensive than an indirect function call
> (i.e. the added local callback). I hope^Wsuspect it'll still be fine, and if
> not we can apply a mini-cache for the current database, which is surely the
> only thing that ever matters for performance.

I tried it and at ~30GB/s of read IO, with checksums disabled, I can't see a
difference of either having the unnecessary complete_local callback or having
the lookup in pgstat_prepare_report_checksum_failure(). In a profile there are
a few hits inside pgstat_get_entry_ref(), but not enough to matter.

Hence I think this isn't worth worrying about, at least for now. I think we
have far bigger fish to fry at this point than such a small performance
difference.

I've adjusted the comment above TRACE_POSTGRESQL_BUFFER_READ_DONE() to not
mention the overhead. I'm still inclined to think that it's better to call it
in the shared completion callback.


I also fixed support and added tests for ignore_checksum_failure, that also
needs to be determined at the start of the IO, not in the completion.  Once
more there were no tests, of course.


I spent the last 6 hours on the stupid error/warning messages around this,
somewhat ridiculous.

The number of combinations is annoyingly large. It's e.g. plausible to use
ignore_checksum_failure=on and zero_damaged_pages=on at the same time for
recovery. The same buffer could both be ignored *and* zeroed. Or somebody
could use ignore_checksum_failure=on but then still encounter a page that is
invalid.

But I finally got to a point where the code ends up readable, without undue
duplication.  It would, leaving some nasty hack aside, require a
errhint_internal() - but I can't imagine a reason against introducing that,
given we have it for the errmsg and errhint.

Here's the relevant code:

        /*
         * Treat a read that had both zeroed buffers *and* ignored checksums as 
a
         * special case, it's too irregular to be emitted the same way as the 
other
         * cases.
         */
        if (zeroed_any && ignored_any)
        {
                Assert(zeroed_any && ignored_any);
                Assert(nblocks > 1);    /* same block can't be both zeroed and 
ignored */
                Assert(result.status != PGAIO_RS_ERROR);
                affected_count = zeroed_or_error_count;

                ereport(elevel,
                                errcode(ERRCODE_DATA_CORRUPTED),
                                errmsg("zeroing %u pages and ignoring %u 
checksum failures among blocks %u..%u of relation %s",
                                           affected_count, checkfail_count, 
first, last, rpath.str),
                                affected_count > 1 ?
                                errdetail("Block %u held first zeroed page.",
                                                  first + first_off) : 0,
                                errhint("See server log for details about the 
other %u invalid blocks.",
                                                affected_count + 
checkfail_count - 1));
                return;
        }

        /*
         * The other messages are highly repetitive. To avoid duplicating a long
         * and complicated ereport(), gather the translated format strings
         * separately and then do one common ereport.
         */
        if (result.status == PGAIO_RS_ERROR)
        {
                Assert(!zeroed_any);    /* can't have invalid pages when 
zeroing them */
                affected_count = zeroed_or_error_count;
                msg_one = _("invalid page in block %u of relation %s");
                msg_mult = _("%u invalid pages among blocks %u..%u of relation 
%s");
                det_mult = _("Block %u held first invalid page.");
                hint_mult = _("See server log for the other %u invalid 
blocks.");
        }
        else if (zeroed_any && !ignored_any)
        {
                affected_count = zeroed_or_error_count;
                msg_one = _("invalid page in block %u of relation %s; zeroing 
out page");
                msg_mult = _("zeroing out %u invalid pages among blocks %u..%u 
of relation %s");
                det_mult = _("Block %u held first zeroed page.");
                hint_mult = _("See server log for the other %u zeroed blocks.");
        }
        else if (!zeroed_any && ignored_any)
        {
                affected_count = checkfail_count;
                msg_one = _("ignoring checksum failure in block %u of relation 
%s");
                msg_mult = _("ignoring %u checksum failures among blocks %u..%u 
of relation %s");
                det_mult = _("Block %u held first ignored page.");
                hint_mult = _("See server log for the other %u ignored 
blocks.");
        }
        else
                pg_unreachable();

        ereport(elevel,
                        errcode(ERRCODE_DATA_CORRUPTED),
                        affected_count == 1 ?
                        errmsg_internal(msg_one, first + first_off, rpath.str) :
                        errmsg_internal(msg_mult, affected_count, first, last, 
rpath.str),
                        affected_count > 1 ? errdetail_internal(det_mult, first 
+ first_off) : 0,
                        affected_count > 1 ? errhint_internal(hint_mult, 
affected_count - 1) : 0);

Does that approach make sense?

What do you think about using
  "zeroing invalid page in block %u of relation %s"
instead of
  "invalid page in block %u of relation %s; zeroing out page"

I thought about instead translating "ignoring", "ignored", "zeroing",
"zeroed", etc separately, but I have doubts about how well that would actually
translate.

Greetings,

Andres Freund


Reply via email to