Jeff Janes writes:
> So my test harness is an inexplicably effective show-case for the
> vulnerability, but it is not the reason the vulnerability should be
> fixed.
I spent a bit of time looking into this. In principle the postmaster
could be fixed to repeat the SIGQUIT signal every second or s
Jeff Janes writes:
> On Wed, May 23, 2012 at 2:21 PM, Tom Lane wrote:
>> However, I remain unsatisfied with this idea as an explanation for the
>> behavior you're seeing. In the first place, that race condition window
>> ought not be wide enough to allow failure probabilities as high as 10%.
>>
On Wed, May 23, 2012 at 2:21 PM, Tom Lane wrote:
> I wrote:
>> Jeff Janes writes:
>>> But what happens if the SIGQUIT is blocked before the system(3) is
>>> invoked? Does the ignore take precedence over the block, or does the
>>> block take precedence over the ignore, and so the signal is still
I wrote:
> Jeff Janes writes:
>> But what happens if the SIGQUIT is blocked before the system(3) is
>> invoked? Does the ignore take precedence over the block, or does the
>> block take precedence over the ignore, and so the signal is still
>> waiting once the block is reversed after the system(3
Jeff Janes writes:
> On Wed, May 23, 2012 at 1:10 PM, Tom Lane wrote:
>> On my machine, man system(3) saith:
>>
>> system() ignores the SIGINT and SIGQUIT signals, and blocks the
>> SIGCHLD signal, while waiting for the command to terminate. If this
>> might cause the application to
On Wed, May 23, 2012 at 1:10 PM, Tom Lane wrote:
> Jeff Janes writes:
>> It looks to me like the SIGQUIT from the postmaster is simply getting
>> lost. And from what little I understand of signal handling, this is a
>> known race with system(3). The archive_command, child of archiver,
>> exits
I wrote:
> On my machine, man system(3) saith:
> system() ignores the SIGINT and SIGQUIT signals, and blocks the
> SIGCHLD signal, while waiting for the command to terminate. If this
> might cause the application to miss a signal that would have killed
> it, the application sh
Jeff Janes writes:
> It looks to me like the SIGQUIT from the postmaster is simply getting
> lost. And from what little I understand of signal handling, this is a
> known race with system(3). The archive_command, child of archiver,
> exits before it can receive the signal sent to the entire arch
On Mon, May 21, 2012 at 9:22 AM, Fujii Masao wrote:
> On Sat, May 19, 2012 at 1:23 AM, Jeff Janes wrote:
>> I've been testing the crash recovery of REL9_2_BETA1, using the same
>> method I posted in the "Scaling XLog insertion" thread. I have the
>> checkpointer occasionally throw a FATAL error,
On Thu, May 24, 2012 at 1:26 AM, Tom Lane wrote:
> Peter Eisentraut writes:
>> On mån, 2012-05-21 at 13:14 -0400, Tom Lane wrote:
>>> ... wait, scratch that. AFAICS, that commit was totally useless,
>>> because BlockSig should always already contain SIGQUIT.
>
>> No, because PostgresMain() delet
Peter Eisentraut writes:
> On mån, 2012-05-21 at 13:14 -0400, Tom Lane wrote:
>> ... wait, scratch that. AFAICS, that commit was totally useless,
>> because BlockSig should always already contain SIGQUIT.
> No, because PostgresMain() deletes it from BlockSig.
Ah. So potentially we have an iss
On mån, 2012-05-21 at 13:14 -0400, Tom Lane wrote:
> > ... but having said that, I see Peter's commit
> > d6de43099ac0bddb4b1da40088487616da892164 only touched postgres.c's
> > quickdie(), and not all the *other* background processes with
> identical
> > coding. That seems a clear oversight, so I
On mån, 2012-05-21 at 12:52 -0400, Tom Lane wrote:
> I see Peter's commit d6de43099ac0bddb4b1da40088487616da892164 only
> touched postgres.c's quickdie(), and not all the *other* background
> processes with identical coding. That seems a clear oversight, so I
> will go fix it.
None[*] of the othe
I wrote:
> ... but having said that, I see Peter's commit
> d6de43099ac0bddb4b1da40088487616da892164 only touched postgres.c's
> quickdie(), and not all the *other* background processes with identical
> coding. That seems a clear oversight, so I will go fix it. Doesn't
> explain why the archiver
I wrote:
> Fujii Masao writes:
>> You might have gotten the following problem which was discussed before.
>> This problem was fixed in SIGQUIT signal handler of a backend, but ISTM
>> not that of an archiver.
>> http://archives.postgresql.org/pgsql-admin/2009-11/msg00088.php
> pgarch.c's SIGQUIT
Fujii Masao writes:
> You might have gotten the following problem which was discussed before.
> This problem was fixed in SIGQUIT signal handler of a backend, but ISTM
> not that of an archiver.
> http://archives.postgresql.org/pgsql-admin/2009-11/msg00088.php
pgarch.c's SIGQUIT handler just does
Jeff Janes writes:
> ... sometimes the automatic recovery never initiates. It looks
> like the postmaster is waiting for the archiver to exit before it
> starts recovery, and the archiver is waiting for something, I don't
> really know what.
Can you try poking into the archiver's state with gdb?
On Sat, May 19, 2012 at 1:23 AM, Jeff Janes wrote:
> I've been testing the crash recovery of REL9_2_BETA1, using the same
> method I posted in the "Scaling XLog insertion" thread. I have the
> checkpointer occasionally throw a FATAL error,
We should also fix this problem? If yes, could you show
I've been testing the crash recovery of REL9_2_BETA1, using the same
method I posted in the "Scaling XLog insertion" thread. I have the
checkpointer occasionally throw a FATAL error, which causes the
postmaster to take down all of the other processes (DETAIL: The
postmaster has commanded this ser
19 matches
Mail list logo