On Tue, Mar 29, 2022 at 12:34 PM Stephen Frost wrote:
> Anyhow, this whole thread has struck me as a good reason to polish those
> patches off and add on top of them an extended checksum ability, first,
> independent of TDE, and remove the dependency of those patches from the
> TDE effort and inst
Greetings,
* Robert Haas (robertmh...@gmail.com) wrote:
> On Fri, Mar 25, 2022 at 10:34 AM Tom Lane wrote:
> > I dunno. Compatibility and speed concerns aside, that seems like an awful
> > lot of bits to be expending on every page compared to the value.
>
> I dunno either, but over on the TDE t
At Thu, 24 Mar 2022 15:33:29 -0400, Robert Haas wrote
in
> On Thu, Mar 17, 2022 at 9:21 PM Kyotaro Horiguchi
> wrote:
> > All versions pass check world.
>
> Thanks, committed.
(I was overwhelmed by the flood of following discussion..)
Anyway, thanks for picking up this and committing!
regar
On Fri, Mar 25, 2022 at 10:34:49AM -0400, Tom Lane wrote:
> Robert Haas writes:
>> On Fri, Mar 25, 2022 at 10:02 AM Tom Lane wrote:
>>> Adding another 16 bits won't get you to that, sadly. Yeah, it *might*
>>> extend the MTTF to more than the project's likely lifespan, but that
>>> doesn't mean
=?utf-8?Q?Dagfinn_Ilmari_Manns=C3=A5ker?= writes:
> LGTM, but it would be good to include $! in the die messages.
Roger, will do.
regards, tom lane
Tom Lane writes:
> Robert Haas writes:
>> ... It's not
>> like a 16-bit checksum was state-of-the-art even when we introduced
>> it. We just did it because we had 2 bytes that we could repurpose
>> relatively painlessly, and not any larger number. And that's still the
>> case today, so at least
Andres Freund writes:
> The same code also exists in src/bin/pg_basebackup/t/010_pg_basebackup.pl,
> which presumably has the same collision risks.
Oooh, I missed that.
> Perhaps we should put a
> function into Cluster.pm and use it from both?
+1, I'll make it so.
regar
Hi,
On 2022-03-25 11:50:48 -0400, Tom Lane wrote:
> Robert Haas writes:
> > ... It's not
> > like a 16-bit checksum was state-of-the-art even when we introduced
> > it. We just did it because we had 2 bytes that we could repurpose
> > relatively painlessly, and not any larger number. And that's s
Robert Haas writes:
> ... It's not
> like a 16-bit checksum was state-of-the-art even when we introduced
> it. We just did it because we had 2 bytes that we could repurpose
> relatively painlessly, and not any larger number. And that's still the
> case today, so at least in the short term we will
On Fri, Mar 25, 2022 at 10:34 AM Tom Lane wrote:
> I dunno. Compatibility and speed concerns aside, that seems like an awful
> lot of bits to be expending on every page compared to the value.
I dunno either, but over on the TDE thread people seemed quite willing
to expend like 16-32 *bytes* for
Robert Haas writes:
> On Fri, Mar 25, 2022 at 10:02 AM Tom Lane wrote:
>> Adding another 16 bits won't get you to that, sadly. Yeah, it *might*
>> extend the MTTF to more than the project's likely lifespan, but that
>> doesn't mean we couldn't get unlucky next week.
> I suspect that the number
On Fri, Mar 25, 2022 at 10:02 AM Tom Lane wrote:
> Robert Haas writes:
> > On Fri, Mar 25, 2022 at 9:49 AM Tom Lane wrote:
> >> That'll just reduce the probability of failure, not eliminate it.
>
> > I mean, if the expected time to the first failure on even 1 machine
> > exceeds the time until t
On Fri, Mar 25, 2022 at 2:07 AM Andres Freund wrote:
> We really ought to find a way to get to wider checksums :/
Eh, let's just use longer names for the buildfarm animals and call it good. :-)
--
Robert Haas
EDB: http://www.enterprisedb.com
Robert Haas writes:
> On Fri, Mar 25, 2022 at 9:49 AM Tom Lane wrote:
>> That'll just reduce the probability of failure, not eliminate it.
> I mean, if the expected time to the first failure on even 1 machine
> exceeds the time until the heat death of the universe by 10 orders of
> magnitude, it
On Fri, Mar 25, 2022 at 9:49 AM Tom Lane wrote:
> That'll just reduce the probability of failure, not eliminate it.
I mean, if the expected time to the first failure on even 1 machine
exceeds the time until the heat death of the universe by 10 orders of
magnitude, it's probably good enough.
--
Andres Freund writes:
> On 2022-03-25 01:38:45 -0400, Tom Lane wrote:
>> AFAICS, this strategy of whacking a predetermined chunk of the page with
>> a predetermined value is going to fail 1-out-of-64K times.
> Yea. I suspect that the way the modifications and checksumming are done are
> actually
Hi,
On 2022-03-25 01:38:45 -0400, Tom Lane wrote:
> Andres Freund writes:
> > Not sure what to do here... I guess we can just change the value we
> > overwrite
> > the page with and hope to not hit this again? But that feels deeply deeply
> > unsatisfying.
>
> AFAICS, this strategy of whacking
Andres Freund writes:
> Ah, and that's finally also the explanation why I couldn't reproduce the
> failure it in a different directory, with an otherwise identically configured
> PG: The length of the path to the tablespace influences the size of the
> XLOG_TBLSPC_CREATE record.
Ohhh ... yeah
Hi,
On 2022-03-25 01:23:00 -0400, Tom Lane wrote:
> Andres Freund writes:
> > I do see that the LSN that ends up on the page is the same across a few runs
> > of the test on serinus. Which presumably differs between different
> > animals. Surprised that it's this predictable - but I guess the run
Hi,
On 2022-03-24 21:54:38 -0700, Andres Freund wrote:
> I do see that the LSN that ends up on the page is the same across a few runs
> of the test on serinus. Which presumably differs between different
> animals. Surprised that it's this predictable - but I guess the run is short
> enough that th
Andres Freund writes:
> I do see that the LSN that ends up on the page is the same across a few runs
> of the test on serinus. Which presumably differs between different
> animals. Surprised that it's this predictable - but I guess the run is short
> enough that there's no variation due to autovac
Hi,
On 2022-03-25 00:08:20 -0400, Tom Lane wrote:
> Andres Freund writes:
> > The only thing I can really conclude here is that we apparently end up with
> > the same checksum for exactly the modifications we are doing? Just on those
> > two damn instances? Reliably?
>
> IIRC, the table's OID or
Andres Freund writes:
> The only thing I can really conclude here is that we apparently end up with
> the same checksum for exactly the modifications we are doing? Just on those
> two damn instances? Reliably?
IIRC, the table's OID or relfilenode enters into the checksum.
Could it be that assigni
Hi,
On 2022-03-24 19:43:02 -0700, Andres Freund wrote:
> Just to be sure I'm going to clean out serinus' ccache dir and rerun. I'll
> leave dragonet's alone for now.
Turns out they had the same dir. But it didn't help.
I haven't yet figured out why, but I now *am* able to reproduce the problem i
On Fri, Mar 25, 2022 at 3:35 PM Andres Freund wrote:
> So I'm not sure how much can be gleaned from raw "failure counts" without
> taking the number of runs into account as well?
Ah, right, it does indeed hold the record for most runs in 3 months,
and taking runs into account its "stats" failure
Hi,
On 2022-03-24 19:20:10 -0700, Andres Freund wrote:
> I forced a run while writing the other email, with keep_error_whatnot, and I
> just saw it failing... Looking whether there's anything interesting to glean.
Unfortunately the test drops the table and it doesn't report the filepath of
the fa
Hi,
On 2022-03-25 15:23:24 +1300, Thomas Munro wrote:
> One random thing I've noticed about serinus is that it seems to drop
> UDP packets more than others, but dragonet apparently doesn't:
Serinus is built with optimization. Which I guess could lead to other backends
reporting stats more quickly
On Fri, Mar 25, 2022 at 3:14 PM Andres Freund wrote:
> On 2022-03-24 21:22:38 -0400, Tom Lane wrote:
> > serinus is 0-for-3 since then, and dragonet 0-for-4, so we can be pretty
> > confident that the failure is repeatable for them.
>
> That's weird. They run on the same host, but otherwise they h
Hi,
On 2022-03-24 21:59:08 -0400, Tom Lane wrote:
> Another thing that seems quite baffling, but is becoming clearer by
> the hour, is that only serinus and dragonet are seeing this failure.
> How is that? They're not very similarly configured --- one is gcc,
> one clang, and one uses jit and one
Hi,
On 2022-03-24 21:22:38 -0400, Tom Lane wrote:
> serinus is 0-for-3 since then, and dragonet 0-for-4, so we can be pretty
> confident that the failure is repeatable for them.
That's weird. They run on the same host, but otherwise they have very little
in common. There's plenty other animals ru
I wrote:
> ... So that leaves 7dac61402e, which did this to
> the test script that's failing:
> use strict;
> use warnings;
> -use Config;
> use PostgreSQL::Test::Cluster;
> use PostgreSQL::Test::Utils;
> Discuss.
Another thing that seems quite baffling, but is becoming clearer by
the hour,
Hi,
On 2022-03-24 20:39:27 -0400, Robert Haas wrote:
> But that leaves me even more confused. How can a change to only the server
> code cause a client utility to fail to detect corruption that is being
> created by Perl while the server is stopped?
I guess it could somehow cause the first page t
Robert Haas writes:
> I hate to say "no" because the evidence suggests that the answer might
> be "yes" -- but it definitely isn't intending to change anything about
> the shutdown sequence. It just introduces a mechanism to backends to
> force the checkpointer to delay writing the checkpoint reco
On Thu, Mar 24, 2022 at 8:45 PM Tom Lane wrote:
> Hmm, I'd supposed that the failing test cases were new as of 412ad7a55.
> Now I see they're not, which indeed puts quite a different spin on
> things. Your thought about maybe the server isn't shut down yet is
> interesting --- did 412ad7a55 touch
Robert Haas writes:
> And ... right after hitting send, I see that the recovery check
> failures are under separate troubleshooting and thus probably
> unrelated.
Yeah, we've been chasing those for months.
> But that leaves me even more confused. How can a change to
> only the server code cause
On Thu, Mar 24, 2022 at 8:37 PM Robert Haas wrote:
> Any ideas?
And ... right after hitting send, I see that the recovery check
failures are under separate troubleshooting and thus probably
unrelated. But that leaves me even more confused. How can a change to
only the server code cause a client u
On Thu, Mar 24, 2022 at 6:04 PM Tom Lane wrote:
> Robert Haas writes:
> > Thanks, committed.
>
> Some of the buildfarm is seeing failures in the pg_checksums test.
Hmm. So the tests seem to be failing because 002_actions.pl stops the
database cluster, runs pg_checksums (which passes), writes som
Robert Haas writes:
> Thanks, committed.
Some of the buildfarm is seeing failures in the pg_checksums test.
regards, tom lane
On Thu, Mar 17, 2022 at 9:21 PM Kyotaro Horiguchi
wrote:
> Finally, no two of from 10 to 14 doesn't accept the same patch.
>
> As a cross-version check, I compared all combinations of the patches
> for two adjacent versions and confirmed that no hunks are lost.
>
> All versions pass check world.
At Wed, 16 Mar 2022 10:14:56 -0400, Robert Haas wrote
in
> Hmm. I think the last two instances of "buffers" in this comment
> should actually say "blocks".
Ok. I replaced them with "blocks" and it looks nicer. Thanks!
> > I'll try that, if you are already working on it, please inform me. (It
>
On Wed, Mar 16, 2022 at 1:14 AM Kyotaro Horiguchi
wrote:
> storage.c:
> +* Make sure that a concurrent checkpoint can't complete while
> truncation
> +* is in progress.
> +*
> +* The truncation operation might drop buffers that the checkpoint
> +* otherwise
At Tue, 15 Mar 2022 12:44:49 -0400, Robert Haas wrote
in
> On Wed, Jan 26, 2022 at 3:25 AM Kyotaro Horiguchi
> wrote:
> > The attached is the fixed version and it surely works with the repro.
>
> Hi,
>
> I spent the morning working on this patch and came up with the
> attached version. I wrot
On Wed, Jan 26, 2022 at 3:25 AM Kyotaro Horiguchi
wrote:
> The attached is the fixed version and it surely works with the repro.
Hi,
I spent the morning working on this patch and came up with the
attached version. I wrote substantial comments in RelationTruncate(),
where I tried to make it more
At Mon, 24 Jan 2022 23:33:20 +0300, Daniel Shelepanov
wrote in
> Hi. This is my first attempt to review a patch so feel free to tell me
> if I missed something.
Welcome!
> As of today's state of REL_14_STABLE
> (ef9706bbc8ce917a366e4640df8c603c9605817a), the problem is
> reproducible using the
On 27.09.2021 11:30, Kyotaro Horiguchi wrote:
Thank you for the comments! (Sorry for the late resopnse.)
At Tue, 10 Aug 2021 14:14:05 -0400, Robert Haas wrote in
On Thu, Mar 4, 2021 at 10:01 PM Kyotaro Horiguchi
wrote:
The patch assumed that CHKPT_START/COMPLETE barrier are exclusively
use
Thank you for the comments! (Sorry for the late resopnse.)
At Tue, 10 Aug 2021 14:14:05 -0400, Robert Haas wrote
in
> On Thu, Mar 4, 2021 at 10:01 PM Kyotaro Horiguchi
> wrote:
> > The patch assumed that CHKPT_START/COMPLETE barrier are exclusively
> > used each other, but MarkBufferDirtyHint
Thaks for looking this, Robert and Tom.
At Fri, 24 Sep 2021 16:22:28 -0400, Tom Lane wrote in
> Robert Haas writes:
> > On Fri, Sep 24, 2021 at 3:42 PM Tom Lane wrote:
> >> I think the basic idea is about right, but I'm not happy with the
> >> three-way delayChkpt business; that seems too cute
Robert Haas writes:
> On Fri, Sep 24, 2021 at 3:42 PM Tom Lane wrote:
>> I think the basic idea is about right, but I'm not happy with the
>> three-way delayChkpt business; that seems too cute by three-quarters.
> Nobody, but the version of the patch that I was looking at uses a
> separate bit f
On Fri, Sep 24, 2021 at 3:42 PM Tom Lane wrote:
> Robert Haas writes:
> > I like this patch.
>
> I think the basic idea is about right, but I'm not happy with the
> three-way delayChkpt business; that seems too cute by three-quarters.
> I think two independent boolean flags, one saying "I'm preve
Robert Haas writes:
> I like this patch.
I think the basic idea is about right, but I'm not happy with the
three-way delayChkpt business; that seems too cute by three-quarters.
I think two independent boolean flags, one saying "I'm preventing
checkpoint start" and one saying "I'm preventing check
On Thu, Mar 4, 2021 at 10:01 PM Kyotaro Horiguchi
wrote:
> The patch assumed that CHKPT_START/COMPLETE barrier are exclusively
> used each other, but MarkBufferDirtyHint which delays checkpoint start
> is called in RelationTruncate while delaying checkpoint completion.
> That is not a strange nor
At Thu, 4 Mar 2021 22:37:23 +0500, Ibrar Ahmed wrote in
> The regression is failing for this patch, do you mind look at that and send
> the updated patch?
>
> https://api.cirrus-ci.com/v1/task/6313174510075904/logs/test.log
>
> ...
> t/006_logical_decoding.pl ok
> t/007_sync_rep.pl
On Wed, Jan 6, 2021 at 1:33 PM Kyotaro Horiguchi
wrote:
> At Mon, 17 Aug 2020 11:22:15 -0700, Andres Freund
> wrote in
> > Hi,
> >
> > On 2020-08-17 14:05:37 +0300, Heikki Linnakangas wrote:
> > > On 14/04/2020 22:04, Teja Mupparti wrote:
> > > > Thanks Kyotaro and Masahiko for the feedback. I t
At Mon, 17 Aug 2020 11:22:15 -0700, Andres Freund wrote in
> Hi,
>
> On 2020-08-17 14:05:37 +0300, Heikki Linnakangas wrote:
> > On 14/04/2020 22:04, Teja Mupparti wrote:
> > > Thanks Kyotaro and Masahiko for the feedback. I think there is a
> > > consensus on the critical-section around truncat
On 06.11.2020 14:40, Masahiko Sawada wrote:
So I agree to
proceed with the patch that adds a critical section independent of
fixing other related things discussed in this thread. If Teja seems
not to work on this I’ll write the patch.
Regards,
--
Masahiko Sawada
EnterpriseDB: https://www.ent
On Tue, Aug 18, 2020 at 3:22 AM Andres Freund wrote:
>
> Hi,
>
> On 2020-08-17 14:05:37 +0300, Heikki Linnakangas wrote:
> > On 14/04/2020 22:04, Teja Mupparti wrote:
> > > Thanks Kyotaro and Masahiko for the feedback. I think there is a
> > > consensus on the critical-section around truncate,
> >
Status update for a commitfest entry.
I see quite a few unanswered questions in the thread since the last patch
version was sent. So, I move it to "Waiting on Author".
The new status of this patch is: Waiting on Author
Hi,
On 2020-08-17 14:05:37 +0300, Heikki Linnakangas wrote:
> On 14/04/2020 22:04, Teja Mupparti wrote:
> > Thanks Kyotaro and Masahiko for the feedback. I think there is a
> > consensus on the critical-section around truncate,
>
> +1
I'm inclined to think that we should do that independent of t
On 14/04/2020 22:04, Teja Mupparti wrote:
Thanks Kyotaro and Masahiko for the feedback. I think there is a
consensus on the critical-section around truncate,
+1
but I just want to emphasize the need for reversing the order of the
dropping the buffers and the truncation.
Repro details (when
On Wed, 15 Apr 2020 at 04:04, Teja Mupparti wrote:
>
> Thanks Kyotaro and Masahiko for the feedback. I think there is a consensus on
> the critical-section around truncate, but I just want to emphasize the need
> for reversing the order of the dropping the buffers and the truncation.
>
> Repro
, 2020 7:35 PM
To: masahiko.saw...@2ndquadrant.com
Cc: and...@anarazel.de ; tejesw...@hotmail.com
; pgsql-hack...@postgresql.org
; hexexp...@comcast.net
Subject: Re: Corruption during WAL replay
At Mon, 13 Apr 2020 18:53:26 +0900, Masahiko Sawada
wrote in
> On Mon, 13 Apr 2020 at 17:40, And
At Mon, 13 Apr 2020 18:53:26 +0900, Masahiko Sawada
wrote in
> On Mon, 13 Apr 2020 at 17:40, Andres Freund wrote:
> >
> > Hi,
> >
> > On 2020-04-13 15:24:55 +0900, Masahiko Sawada wrote:
> > > On Sat, 11 Apr 2020 at 09:00, Teja Mupparti wrote:
> > > /*
> > > * We WAL-log the truncation before
On Mon, 13 Apr 2020 at 17:40, Andres Freund wrote:
>
> Hi,
>
> On 2020-04-13 15:24:55 +0900, Masahiko Sawada wrote:
> > On Sat, 11 Apr 2020 at 09:00, Teja Mupparti wrote:
> > >
> > > Thanks Andres and Kyotaro for the quick review. I have fixed the typos
> > > and also included the critical sect
Hi,
On 2020-04-13 15:24:55 +0900, Masahiko Sawada wrote:
> On Sat, 11 Apr 2020 at 09:00, Teja Mupparti wrote:
> >
> > Thanks Andres and Kyotaro for the quick review. I have fixed the typos and
> > also included the critical section (emulated it with try-catch block since
> > palloc()s are caus
On Sat, 11 Apr 2020 at 09:00, Teja Mupparti wrote:
>
> Thanks Andres and Kyotaro for the quick review. I have fixed the typos and
> also included the critical section (emulated it with try-catch block since
> palloc()s are causing issues in the truncate code). This time I used git
> format-pat
Hi,
On 2020-04-10 20:49:05 -0400, Alvaro Herrera wrote:
> On 2020-Mar-30, Andres Freund wrote:
>
> > If we are really concerned with truncation failing - I don't know why we
> > would be, we accept that we have to be able to modify files etc to stay
> > up - we can add a pre-check ensuring that p
On 2020-Mar-30, Andres Freund wrote:
> If we are really concerned with truncation failing - I don't know why we
> would be, we accept that we have to be able to modify files etc to stay
> up - we can add a pre-check ensuring that permissions are set up
> appropriately to allow us to truncate.
I r
: Andres Freund
Sent: Monday, March 30, 2020 4:31 PM
To: Kyotaro Horiguchi
Cc: tejesw...@hotmail.com ; pgsql-hack...@postgresql.org
; hexexp...@comcast.net
Subject: Re: Corruption during WAL replay
Hi,
On 2020-03-24 18:18:12 +0900, Kyotaro Horiguchi wrote:
> At Mon, 23 Mar 2020 20:56:59 +0
At Mon, 30 Mar 2020 16:31:59 -0700, Andres Freund wrote in
> Hi,
>
> On 2020-03-24 18:18:12 +0900, Kyotaro Horiguchi wrote:
> > At Mon, 23 Mar 2020 20:56:59 +, Teja Mupparti
> > wrote in
> > > The original bug reporting-email and the relevant discussion is here
> > ...
> > > The crux of t
Hi,
On 2020-03-24 18:18:12 +0900, Kyotaro Horiguchi wrote:
> At Mon, 23 Mar 2020 20:56:59 +, Teja Mupparti
> wrote in
> > The original bug reporting-email and the relevant discussion is here
> ...
> > The crux of the fix is, in the current code, engine drops the buffer and
> > then truncat
Thanks for working on this.
At Mon, 23 Mar 2020 20:56:59 +, Teja Mupparti wrote
in
> This is my *first* attempt to submit a Postgres patch, please let me know if
> I missed any process or format of the patch
Welcome! The format looks fine to me. It would be better if it had a
commit mess
71 matches
Mail list logo