Re: BUG #15449: file_fdw using program cause exit code error when using LIMIT

Etsuro Fujita Thu, 08 Nov 2018 04:53:51 -0800

(2018/11/08 10:50), Thomas Munro wrote:

On Wed, Nov 7, 2018 at 11:03 PM Etsuro Fujita
<[email protected]>  wrote:

(2018/11/07 9:22), Kyotaro HORIGUCHI wrote:

At Tue, 06 Nov 2018 21:07:37 +0900, Etsuro Fujita<[email protected]>   
wrote in<[email protected]>

(2018/11/06 12:53), Kyotaro HORIGUCHI wrote:

even if it comes from another pipe, we can assume that the
SIGPIPE immediately stopped the program before returning any
garbage to us.


Sorry, I don't understand this.


Mmm. It looks confusing, sorry. In other words:

We don't care the reason for program's ending if we received
enough data from the program and we actively (= before receiving
EOF) stop reading. On the other hand if SIGPIPE (on another pipe)
or something fatal happened on the program before we end reading,
we receive EOF just after that and we are interested in the cause
of the program's death.

# Does the above make sense?


Yeah, thanks for the explanation!


+1

I take back what I said earlier about false positives from other
pipes.  I think it's only traditional Unix programs designed for use
in pipelines and naive programs that let SIGPIPE terminate the
process.   The accepted answer here gives a good way to think about
it:

https://stackoverflow.com/questions/8369506/why-does-sigpipe-exist


Thanks for the information!

A program sophisticated enough to be writing to other pipes is no
longer in that category and should be setting up signal dispositions
itself, so I agree that we should enable the default disposition and
ignore WTERMSIG(exit_code) == SIGPIPE, as proposed.  That is pretty
close to the intended purpose of that signal AFAICS.


Great!

In the sense of "We don't care the reason", negligible reasons
are necessariry restricted to SIGPIPE, evan SIGSEGV could be
theoretically ignored safely. "theoretically" here means it is
another issue whether we rely on the output from a program which
causes SEGV (or any reason other than SIGPIPE, which we caused).


For the SIGSEGV case, I think it would be better that we don't rely on
the output data, IMO, because I think there might be a possibility that
the program have generated that data incorrectly/unexpectedly.


+1

I don't think we should ignore termination by signals other than
SIGPIPE: that could hide serious problems from users.  I want to know
if my program is crashing with SIGBUS, SIGTERM, SIGFPE etc, even if it
happens after we read enough data; there is a major problem that a
human needs to investigate!


I think so too.

Finally in the attached PoC, I set cstate->eof_reached on failure
of NextCopyFromRawFields then if eof_reached we don't ingore
SIGPIPE. For a program like
"puts("test1");fflush(stdout);kill(getpid(), SIGPIPE);", I had
the following behavior. (I got an error without the fflush() but
it is inevitable).

=# select * from ft2 limit 1;
      a
-------
    test1

=# select * from ft2 limit 2;
ERROR:  program "/home/horiguti/work/exprog" failed
DETAIL:  child process was terminated by signal 13: Broken pipe

For the original case:

=# select * from ft1 limit 1;
     a
------
    test
=# select * from ft1 limit 2;
     a
------
    test
(1 row)


I didn't confirmed whether it is complete.


Sorry, I don't understand this fully, but the reason to add the
eof_reached stuff is to avoid ignoring the SIGPIPE failure in normal
COPY FROM PROGRAM cases?


Yes, if we have received EOF, the program ended before we
finished reading from the pipe thus any error should be
reported.  Elsewise we don't care at least SIGPIPE. In the
attached patch all kind of errors are ignored when pipe is closed
from reading-end on backends.


OK, I understand your idea.  Thanks for the patch!

+1

As the result it doesn't report an error for SELECT * FROM ft2
LIMIT 1 on "main(void){puts("test1"); return 1;}".

=#  select * from ft limit 1;
     a
-------
   test1
(1 row)

limit 2 reports the error.

=# select * from ft limit 2;
ERROR:  program "/home/horiguti/work/exprog" failed
DETAIL:  child process exited with exit code 1


I think this would be contrary to users expectations: if the SELECT
command works for limit 1, they would expect that the command would work
for limit 2 as well.  So, I think it would be better to error out that
command for limit 1 as well, as-is.


I think it's correct that LIMIT 1 gives no error but LIMIT 2 gives an
error.  For LIMIT 1, we got all the rows we wanted, and then we closed
the pipe.  If we got a non-zero non-signal exit code, or a signal exit
code and it was SIGPIPE (not any other signal!), then we should
consider that to be expected.

Maybe I'm missing something, but the non-zero non-signal exit code meansthat there was something wrong with the called program, so I think ahuman had better investigate that as well IMO, which would probably be aminor problem, though. Too restrictive?

I tried to think of a scenario where the already-received output is
truly invalidated by a later error that happens after we close the
pipe.  It could be something involving a program that uses something
optimistic like serializable snapshot isolation that can later decide
that whatever it told you earlier is not valid after all.  Suppose the
program is clever enough to expect EPIPE and not consider that to be
an error, but wants to tell us about serialization failure with a
non-zero exit code.  To handle that, you'd need a way to provide an
option to file_fdw to tell it not to ignore non-zero exit codes after
close.  This seems so exotic and contrived that it's not worth
bothering with for now, but could always be added.

Interesting! I agree that such an option could add more flexibility inhandling the non-zero-exit-code case.

BTW just for curiosity:

perl -e 'for (my $i=0; $i<  1000000; $i++) { print "$i\n"; }' | head -5
Exit code: terminated by SIGPIPE, like seq


Good to know!

ruby -e 'for i in 1..1000000 do puts i; end' | head -5
Exit code: 1, like Python

Sad. Anyway, thanks a lot for these experiments in addition to theprevious ones!

On Wed, Nov 7, 2018 at 4:44 PM Etsuro Fujita
<[email protected]>  wrote:

(2018/11/06 19:50), Thomas Munro wrote:

On my FreeBSD system, I compared the output of procstat -i (= show
signal disposition) for two "sleep 60" processes, one invoked from the
shell and the other from COPY ... FROM PROGRAM.  The differences were:
PIPE, TTIN, TTOU and USR2.  For the first and last of those, the
default action would be to terminate the process, but the COPY PROGRAM
child ignored them; for TTIN and TTOU, the default action would be to
stop the process, but again they are ignored.  Why do bgwriter.c,
startup.c, ... set SIGTTIN and SIGTTOU back to SIG_DFL, but not
regular backends?


So, we should revert SIGUSR2 as well to default processing?


I don't think it matters in practice, but it might be nice to restore
that just for consistency.


Agreed.

I'm not sure what to think about the TTIN,
TTOU stuff; I don't understand job control well right now but I don't
think it really applies to programs run by a PostgreSQL backend, so if
we restore those it'd probably again be only for consistency.  Then
again, there may be a reason someone decided to ignore those in the
postmaster + regular backends but not the various auxiliary processes.
Anyone?


I don't have any idea about that.

Best regards,
Etsuro Fujita

Re: BUG #15449: file_fdw using program cause exit code error when using LIMIT

Reply via email to