On Tue, Aug 25, 2020 at 1:43 AM Tom Lane wrote:
> I wrote:
> > For our archives' sake: today I got seemingly-automated mail informing me
> > that this patch has been merged into the 4.19-stable, 5.4-stable,
> > 5.7-stable, and 5.8-stable kernel branches; but not 4.4-stable,
> > 4.9-stable, or 4.14
I wrote:
> For our archives' sake: today I got seemingly-automated mail informing me
> that this patch has been merged into the 4.19-stable, 5.4-stable,
> 5.7-stable, and 5.8-stable kernel branches; but not 4.4-stable,
> 4.9-stable, or 4.14-stable, because it failed to apply.
And this morning's ma
Thomas Munro writes:
> On Tue, Jul 28, 2020 at 3:27 PM Tom Lane wrote:
>> Anyway, I guess the interesting question for us is how long it
>> will take for this fix to propagate into real-world systems.
>> I don't have much of a clue about the Linux kernel workflow,
>> anybody want to venture a gue
On Tue, Jul 28, 2020 at 3:27 PM Tom Lane wrote:
> Anyway, I guess the interesting question for us is how long it
> will take for this fix to propagate into real-world systems.
> I don't have much of a clue about the Linux kernel workflow,
> anybody want to venture a guess?
Me neither. It just hi
Thomas Munro writes:
> Hehe, the dodgy looking magic numbers *were* wrong:
> - * The kernel signal delivery code writes up to about 1.5kB
> + * The kernel signal delivery code writes a bit over 4KB
> https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20200724092528.1578671-2-...@ellerman.id.a
On Wed, Dec 11, 2019 at 3:22 PM Thomas Munro wrote:
> On Tue, Oct 15, 2019 at 4:50 AM Tom Lane wrote:
> > > Filed at
> > > https://bugzilla.kernel.org/show_bug.cgi?id=205183
>
> For the curious-and-not-subscribed, there's now a kernel patch
> proposed for this. We guessed pretty close, but the p
On Tue, Oct 15, 2019 at 4:50 AM Tom Lane wrote:
> > Filed at
> > https://bugzilla.kernel.org/show_bug.cgi?id=205183
For the curious-and-not-subscribed, there's now a kernel patch
proposed for this. We guessed pretty close, but the problem wasn't
those dodgy looking magic numbers, it was that the
I want to give some conclusion to our occurance of this, which now I think was
neither an instance nor indicitive of any bug. Summary: postgres was being
kill -9 by a deployment script, after it "timed out". Thus no log messages.
I initially experienced this while testing migration of a customer
I wrote:
> Filed at
> https://bugzilla.kernel.org/show_bug.cgi?id=205183
> We'll see what happens ...
Further to this --- I went back and looked at the outlier events
where we saw an infinite_recurse failure on a non-Linux-PPC64
platform. There were only three:
mereswine| ARMv7|
Hi,
On 2019-10-13 13:44:59 +1300, Thomas Munro wrote:
> On Sun, Oct 13, 2019 at 1:06 PM Tom Lane wrote:
> > I don't think any further proof is required that this is
> > a kernel bug. Where would be a good place to file it?
>
> linuxppc-...@lists.ozlabs.org might be the right place.
>
> https:/
Andres Freund writes:
> Probably requires reproducing on a pretty recent kernel first, to have a
> decent chance of being investigated...
How recent do you think it needs to be? The machine I was testing on
yesterday is under a year old:
uname -m = ppc64le
uname -r = 4.18.19-100.fc27.ppc64le
un
Filed at
https://bugzilla.kernel.org/show_bug.cgi?id=205183
We'll see what happens ...
regards, tom lane
Andres Freund writes:
> On 2019-10-13 10:29:45 -0400, Tom Lane wrote:
>> How recent do you think it needs to be?
> My experience reporting kernel bugs is that the latest released version,
> or even just the tip of the git tree, is your best bet :/.
Considering that we're going to point them at c
Hi,
On 2019-10-13 10:29:45 -0400, Tom Lane wrote:
> Andres Freund writes:
> > Probably requires reproducing on a pretty recent kernel first, to have a
> > decent chance of being investigated...
>
> How recent do you think it needs to be? The machine I was testing on
> yesterday is under a year
On Sun, Oct 13, 2019 at 1:06 PM Tom Lane wrote:
> I don't think any further proof is required that this is
> a kernel bug. Where would be a good place to file it?
linuxppc-...@lists.ozlabs.org might be the right place.
https://lists.ozlabs.org/listinfo/linuxppc-dev
I wrote:
> In short, my current belief is that Linux PPC64 fails when trying
> to deliver a signal if there's right around 2KB of stack remaining,
> even though it should be able to expand the stack and press on.
I figured I should try to remove some variables from the equation
by demonstrating th
I've now also been able to reproduce the "infinite_recurse" segfault
on wobbegong's host (or, since I was using a gcc build, I guess I
should say vulpes' host). The first-order result is that it's the
same problem with the kernel not giving us as much stack space as
we expect: there's only 1179648
I wrote:
> It's not very clear how those things would lead to an intermittent
> failure though. In the case of the postmaster crashes, we now see
> that timing of signal receipts is relevant. For infinite_recurse,
> maybe it only fails if an sinval interrupt happens at the wrong time?
> (This the
On Sat, Oct 12, 2019 at 9:40 AM Tom Lane wrote:
> Andres Freund writes:
> > On 2019-10-11 14:56:41 -0400, Tom Lane wrote:
> >> ... So it's really hard to explain
> >> that as anything except a kernel bug: sometimes, the kernel
> >> doesn't give us as much stack as it promised it would. And the
>
Andres Freund writes:
> On 2019-10-11 14:56:41 -0400, Tom Lane wrote:
>> ... So it's really hard to explain
>> that as anything except a kernel bug: sometimes, the kernel
>> doesn't give us as much stack as it promised it would. And the
>> machine is not loaded enough for there to be any rational
Hi,
On 2019-10-11 14:56:41 -0400, Tom Lane wrote:
> I still don't have a good explanation for why this only seems to
> happen in the pg_upgrade test sequence. However, I did notice
> something very interesting: the postmaster crashes after consuming
> only about 1MB of stack space. This is despi
On Sat, Oct 12, 2019 at 08:41:12AM +1300, Thomas Munro wrote:
> On Sat, Oct 12, 2019 at 7:56 AM Tom Lane wrote:
> > This matches up with the intermittent infinite_recurse failures
> > we've been seeing in the buildfarm. Those are happening across
> > a range of systems, but they're (almost) all L
Thomas Munro writes:
> Yeah, I don't know anything about this stuff, but I was also beginning
> to wonder if something is busted in the arch-specific fault.c code
> that checks if stack expansion is valid[1], in a way that fails with a
> rapidly growing stack, well timed incoming signals, and perh
On Sat, Oct 12, 2019 at 7:56 AM Tom Lane wrote:
> This matches up with the intermittent infinite_recurse failures
> we've been seeing in the buildfarm. Those are happening across
> a range of systems, but they're (almost) all Linux-based ppc64,
> suggesting that there's a longstanding arch-specif
Andrew Dunstan writes:
> On 10/11/19 11:45 AM, Tom Lane wrote:
>> FWIW, I'm not excited about that as a permanent solution. It requires
>> root privilege, and it affects the whole machine not only the buildfarm,
>> and making it persist across reboots is even more invasive.
> OK, but I'm not kee
I wrote:
> What we've apparently got here is that signals were received
> so fast that the postmaster ran out of stack space. I remember
> Andres complaining about this as a theoretical threat, but I
> hadn't seen it in the wild before.
> I haven't finished investigating though, as there are some
On 10/11/19 11:45 AM, Tom Lane wrote:
> Andrew Dunstan writes:
>>> At least on F29 I have set /proc/sys/kernel/core_pattern and it works.
> FWIW, I'm not excited about that as a permanent solution. It requires
> root privilege, and it affects the whole machine not only the buildfarm,
> and maki
Andrew Dunstan writes:
>> At least on F29 I have set /proc/sys/kernel/core_pattern and it works.
FWIW, I'm not excited about that as a permanent solution. It requires
root privilege, and it affects the whole machine not only the buildfarm,
and making it persist across reboots is even more invasi
On 10/10/19 6:01 PM, Andrew Dunstan wrote:
> On 10/10/19 5:34 PM, Tom Lane wrote:
>> I wrote:
> Yeah, I've been wondering whether pg_ctl could fork off a subprocess
> that would fork the postmaster, wait for the postmaster to exit, and then
> report the exit status.
>>> [ pushed at 6a
On 10/10/19 5:34 PM, Tom Lane wrote:
> I wrote:
Yeah, I've been wondering whether pg_ctl could fork off a subprocess
that would fork the postmaster, wait for the postmaster to exit, and then
report the exit status.
>> [ pushed at 6a5084eed ]
>> Given wobbegong's recent failure rate
On Thu, Oct 10, 2019 at 05:34:51PM -0400, Tom Lane wrote:
> A nearer-term solution would be to reproduce this manually and
> dig into the core. Mark, are you in a position to give somebody
> ssh access to wobbegong's host, or another similarly-configured VM?
>
> (While at it, it'd be nice to inve
I wrote:
>>> Yeah, I've been wondering whether pg_ctl could fork off a subprocess
>>> that would fork the postmaster, wait for the postmaster to exit, and then
>>> report the exit status.
> [ pushed at 6a5084eed ]
> Given wobbegong's recent failure rate, I don't think we'll have to wait
> long.
I
On Tue, Jul 23, 2019 at 7:29 PM Tom Lane wrote:
> Parallel workers aren't ever allowed to write, in the current
> implementation, so it's not real obvious why they'd have any
> WAL log files open at all.
Parallel workers are not forbidden to write WAL, nor are they
forbidden to modify blocks. The
Thomas Munro writes:
> On Wed, Aug 7, 2019 at 4:29 PM Tom Lane wrote:
>> Yeah, I've been wondering whether pg_ctl could fork off a subprocess
>> that would fork the postmaster, wait for the postmaster to exit, and then
>> report the exit status. Where to report it *to* seems like the hard part,
On 07/08/2019 17:45, Tom Lane wrote:
Heikki Linnakangas writes:
On 07/08/2019 16:57, Tom Lane wrote:
Also, if you're using systemd or something else that thinks it
ought to interfere with where cores get dropped, that could be
a problem.
I think they should just go to a file called "core",
Heikki Linnakangas writes:
> On 07/08/2019 16:57, Tom Lane wrote:
>> Also, if you're using systemd or something else that thinks it
>> ought to interfere with where cores get dropped, that could be
>> a problem.
> I think they should just go to a file called "core", I don't think I've
> changed
On 07/08/2019 16:57, Tom Lane wrote:
Heikki Linnakangas writes:
On 07/08/2019 02:57, Thomas Munro wrote:
On Wed, Jul 24, 2019 at 5:15 PM Tom Lane wrote:
So I think I've got to take back the assertion that we've got
some lurking generic problem. This pattern looks way more
like a platform-sp
Heikki Linnakangas writes:
> On 07/08/2019 02:57, Thomas Munro wrote:
>> On Wed, Jul 24, 2019 at 5:15 PM Tom Lane wrote:
>>> So I think I've got to take back the assertion that we've got
>>> some lurking generic problem. This pattern looks way more
>>> like a platform-specific issue. Overaggres
On 07/08/2019 02:57, Thomas Munro wrote:
On Wed, Jul 24, 2019 at 5:15 PM Tom Lane wrote:
So I think I've got to take back the assertion that we've got
some lurking generic problem. This pattern looks way more
like a platform-specific issue. Overaggressive OOM killer
would fit the facts on vul
On Wed, Aug 7, 2019 at 5:07 PM Tom Lane wrote:
> Thomas Munro writes:
> > Another question is whether the build farm should be setting the Linux
> > oom score adjust thing.
>
> AFAIK you can't do that without being root.
Rats, yeah you need CAP_SYS_RESOURCE or root to lower it.
--
Thomas Munro
Thomas Munro writes:
> Another question is whether the build farm should be setting the Linux
> oom score adjust thing.
AFAIK you can't do that without being root.
regards, tom lane
On Wed, Aug 7, 2019 at 4:29 PM Tom Lane wrote:
> Thomas Munro writes:
> > I wondered if the build farm should try to report OOM kill -9 or other
> > signal activity affecting the postmaster.
>
> Yeah, I've been wondering whether pg_ctl could fork off a subprocess
> that would fork the postmaster,
Thomas Munro writes:
> I wondered if the build farm should try to report OOM kill -9 or other
> signal activity affecting the postmaster.
Yeah, I've been wondering whether pg_ctl could fork off a subprocess
that would fork the postmaster, wait for the postmaster to exit, and then
report the exit
On Wed, Jul 24, 2019 at 5:15 PM Tom Lane wrote:
> Thomas Munro writes:
> > On Wed, Jul 24, 2019 at 10:11 AM Tom Lane wrote:
> > Do you have an example to hand? Is this
> > failure always happening on Linux?
>
> I dug around a bit further, and while my recollection of a lot of
> "postmaster exit
Thomas Munro writes:
> On Wed, Jul 24, 2019 at 10:11 AM Tom Lane wrote:
>> In any case, the evidence from the buildfarm is pretty clear that
>> there is *some* connection. We've seen a lot of recent failures
>> involving "postmaster exited during a parallel transaction", while
>> the number of p
On Wed, Jul 24, 2019 at 11:32:30AM +1200, Thomas Munro wrote:
> On Wed, Jul 24, 2019 at 11:04 AM Justin Pryzby wrote:
> > I ought to have remembered that it *was* in fact out of space this AM when
> > this
> > core was dumped (due to having not touched it since scheduling transition to
> > this V
On 2019-Jul-23, Justin Pryzby wrote:
> I want to say I'm almost certain it wasn't ENOSPC in other cases, since,
> failing to find log output, I ran df right after the failure.
I'm not sure that this proves much, since I expect temporary files to be
deleted on failure; by the time you run 'df' the
On Wed, Jul 24, 2019 at 10:11 AM Tom Lane wrote:
> Thomas Munro writes:
> > *I suspect that the only thing implicating parallelism in this failure
> > is that parallel leaders happen to print out that message if the
> > postmaster dies while they are waiting for workers; most other places
> > (pr
On Wed, Jul 24, 2019 at 11:04 AM Justin Pryzby wrote:
> I ought to have remembered that it *was* in fact out of space this AM when
> this
> core was dumped (due to having not touched it since scheduling transition to
> this VM last week).
>
> I want to say I'm almost certain it wasn't ENOSPC in o
Justin Pryzby writes:
> I want to say I'm almost certain it wasn't ENOSPC in other cases, since,
> failing to find log output, I ran df right after the failure.
The fact that you're not finding log output matching what was reported
to the client seems to me to be a mighty strong indication that t
On Wed, Jul 24, 2019 at 10:46:42AM +1200, Thomas Munro wrote:
> On Wed, Jul 24, 2019 at 10:42 AM Justin Pryzby wrote:
> > On Wed, Jul 24, 2019 at 10:03:25AM +1200, Thomas Munro wrote:
> > > On Wed, Jul 24, 2019 at 5:42 AM Justin Pryzby
> > > wrote:
> > > > #2 0x0085ddff in errfinish (du
On Wed, Jul 24, 2019 at 10:42 AM Justin Pryzby wrote:
> On Wed, Jul 24, 2019 at 10:03:25AM +1200, Thomas Munro wrote:
> > On Wed, Jul 24, 2019 at 5:42 AM Justin Pryzby wrote:
> > > #2 0x0085ddff in errfinish (dummy=) at
> > > elog.c:555
> > > edata =
> >
> > If you have that co
On Wed, Jul 24, 2019 at 10:03:25AM +1200, Thomas Munro wrote:
> On Wed, Jul 24, 2019 at 5:42 AM Justin Pryzby wrote:
> > #2 0x0085ddff in errfinish (dummy=) at
> > elog.c:555
> > edata =
>
> If you have that core, it might be interesting to go to frame 2 and
> print *edata or e
On Wed, Jul 24, 2019 at 10:03 AM Thomas Munro wrote:
> > edata =
> If you have that core, it might be interesting to go to frame 2 and
> print *edata or edata->saved_errno. ...
Rats. We already saw that it's optimised out so unless we can find
that somewhere else in a variable that's p
Thomas Munro writes:
> *I suspect that the only thing implicating parallelism in this failure
> is that parallel leaders happen to print out that message if the
> postmaster dies while they are waiting for workers; most other places
> (probably every other backend in your cluster) just quietly exi
On Wed, Jul 24, 2019 at 5:42 AM Justin Pryzby wrote:
> #2 0x0085ddff in errfinish (dummy=) at
> elog.c:555
> edata =
> elevel = 22
> oldcontext = 0x27e15d0
> econtext = 0x0
> __func__ = "errfinish"
> #3 0x006f7e94 in CheckPointReplication
On Wed, Jul 24, 2019 at 4:27 AM Justin Pryzby wrote:
> < 2019-07-23 10:33:51.552 CDT postgres >FATAL: postmaster exited during a
> parallel transaction
> < 2019-07-23 10:33:51.552 CDT postgres >STATEMENT: CREATE UNIQUE INDEX
> unused0_huawei_umts_nodeb_locell_201907_unique_idx ON
> child.unus
On Tue, Jul 23, 2019 at 01:28:47PM -0400, Tom Lane wrote:
> ... you'd think an OOM kill would show up in the kernel log.
> (Not necessarily in dmesg, though. Did you try syslog?)
Nothing in /var/log/messages (nor dmesg ring).
I enabled abrtd while trying to reproduce it last week. Since you ask
Justin Pryzby writes:
> Does anyone have a stress test for parallel workers ?
> On a customer's new VM, I got this several times while (trying to) migrate
> their DB:
> < 2019-07-23 10:33:51.552 CDT postgres >FATAL: postmaster exited during a
> parallel transaction
We've been seeing this irre
59 matches
Mail list logo