On Wed, May 23, 2012 at 2:21 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > I wrote: >> Jeff Janes <jeff.ja...@gmail.com> writes: >>> But what happens if the SIGQUIT is blocked before the system(3) is >>> invoked? Does the ignore take precedence over the block, or does the >>> block take precedence over the ignore, and so the signal is still >>> waiting once the block is reversed after the system(3) is over? I >>> could write a test program to see, but that wouldn't be very good >>> evidence of the portability. > >> AFAICT from the POSIX spec for system(3), that would be a bug in >> system(). > > Actually, on further thought, it seems like there is *necessarily* a > race condition in this. There must be some interval where the child > process has already exited but the waiting parent hasn't de-ignored the > signals.
Yup. the posix man page doesn't say that this is guaranteed to be atomic, so I assume no such guarantee is even attempted. For that matter, the man page for system(3) doesn't even tell how one goes about signaling the child--unless the signal is sent to the negation of the group leader the child won't receive it. (Postgres handles group-leader thing correctly, I just point it out to show the man page is not intended to be complete) > However, I remain unsatisfied with this idea as an explanation for the > behavior you're seeing. In the first place, that race condition window > ought not be wide enough to allow failure probabilities as high as 10%. > In the second place, that code has been like that for a long while, > so this theory absolutely does not explain why you're seeing a > materially higher probability of failure in HEAD than 9.1. There is > something else going on. After a while trying to bisect the behavior, I decided it was a mug's game. Both arms of the race (the firing of archive_command and the engineered crash) are triggered indirectly be the same event, the start of a checkpoint. Small changes in the code can lead to small changes in the timing which make drastic changes in how likely it is that the two arms collide exactly at the vulnerability. So my test harness is an inexplicably effective show-case for the vulnerability, but it is not the reason the vulnerability should be fixed. By the way, my archive_command is very fast, as all it does it echo the date into a log file. I want postgres to think it is archive mode, but for this purpose I don't want to actually deal with having an archive. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers