On 2017-09-19 18:06:29 -0700, Andres Freund wrote: > On 2017-09-19 16:46:58 -0400, Tom Lane wrote: > > Have we forgotten an fflush() or something? > > After hacking a fix for my previous theory, I started adding strace into > the mix, to verify this. Takes longer to reproduce, but after filtering > to -e trace=network, I got this: > > socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3 > connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 > ENOENT (No such file or directory) > socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3 > connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 > ENOENT (No such file or directory) > socket(AF_UNIX, SOCK_STREAM, 0) = 3 > connect(3, {sa_family=AF_UNIX, sun_path="/tmp/EDkYotgk3u/.s.PGSQL.57230"}, > 110) = 0 > getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 > getsockname(3, {sa_family=AF_UNIX}, [128->2]) = 0 > sendto(3, "\0\0\0O\0\3\0\0user\0andres\0database\0pos"..., 79, MSG_NOSIGNAL, > NULL, 0) = 79 > recvfrom(3, "R\0\0\0\10\0\0\0\0S\0\0\0,application_name\0t"..., 16384, 0, > NULL, NULL) = 340 > sendto(3, "Q\0\0\0\37SELECT $$psql-connected$$;\0", 32, MSG_NOSIGNAL, NULL, > 0) = 32 > recvfrom(3, > "T\0\0\0!\0\1?column?\0\0\0\0\0\0\0\0\0\0\31\377\377\377\377\377\377"..., > 16384, 0, NULL, NULL) = 79 > sendto(3, "Q\0\0\0\33SELECT pg_sleep(3600);\0", 28, MSG_NOSIGNAL, NULL, 0) = > 28 > recvfrom(3, 0x555817dae2a0, 16384, 0, NULL, NULL) = -1 ECONNRESET (Connection > reset by peer) > +++ exited with 2 +++ > > So indeed, we got a connreset before receiving the proper error > message. > > The corresponding server log (debug3): > 2017-09-20 00:57:00.573 UTC [713] DEBUG: sending SIGQUIT to process 730 > 2017-09-20 00:57:00.573 UTC [713] DEBUG: sending SIGQUIT to process 716 > 2017-09-20 00:57:00.573 UTC [713] DEBUG: sending SIGQUIT to process 715 > 2017-09-20 00:57:00.573 UTC [713] DEBUG: sending SIGQUIT to process 717 > 2017-09-20 00:57:00.573 UTC [713] DEBUG: sending SIGQUIT to process 718 > 2017-09-20 00:57:00.573 UTC [713] DEBUG: sending SIGQUIT to process 719 > 2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl WARNING: > terminating connection because of crash of another server process > 2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl DETAIL: The > postmaster has commanded this server process to roll back the current t > ransaction and exit, because another server process exited abnormally and > possibly corrupted shared memory. > 2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl HINT: In a moment > you should be able to reconnect to the database and repeat your c > ommand. > 2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl DEBUG: > shmem_exit(-1): 0 before_shmem_exit callbacks to make > 2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl DEBUG: > shmem_exit(-1): 0 on_shmem_exit callbacks to make > 2017-09-20 00:57:00.573 UTC [720] DEBUG: shmem_exit(-1): 0 before_shmem_exit > callbacks to make > 2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl DEBUG: > proc_exit(-1): 0 callbacks to make > ... > 2017-09-20 00:57:00.577 UTC [713] DEBUG: server process (PID 730) exited > with exit code 2 > 2017-09-20 00:57:00.577 UTC [713] DETAIL: Failed process was running: SELECT > pg_sleep(3600); > 2017-09-20 00:57:00.577 UTC [713] LOG: all server processes terminated; > reinitializing > > So the server indeed was killed by SIGQUIT, not an escalation to > SIGKILL. And it output stuff to the server log, and didn't complain > about communication to the client... Odd.
Hah! The reason for that is that socket_flush tries to avoid doing stuff recursively: static int socket_flush(void) { int res; /* No-op if reentrant call */ if (PqCommBusy) return 0; ... (detected by putting an elog(COMMERROR) there) adding an abort shows the following backtrace: #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 #1 0x00007f089de4e42a in __GI_abort () at abort.c:89 #2 0x000056473218a3f6 in socket_flush () at /home/andres/src/postgresql/src/backend/libpq/pqcomm.c:1408 #3 0x000056473246e4ec in send_message_to_frontend (edata=0x5647329e34e0 <errordata>) at /home/andres/src/postgresql/src/backend/utils/error/elog.c:3319 #4 0x000056473246ad02 in EmitErrorReport () at /home/andres/src/postgresql/src/backend/utils/error/elog.c:1483 #5 0x00005647324680af in errfinish (dummy=0) at /home/andres/src/postgresql/src/backend/utils/error/elog.c:483 #6 0x00005647322fb340 in quickdie (postgres_signal_arg=3) at /home/andres/src/postgresql/src/backend/tcop/postgres.c:2608 #7 <signal handler called> #8 0x00007f089fbb65bd in __libc_send (fd=7, buf=0x564734a22c80, n=79, flags=0) at ../sysdeps/unix/sysv/linux/x86_64/send.c:26 #9 0x000056473217c874 in secure_raw_write (port=0x564734a1c730, ptr=0x564734a22c80, len=79) at /home/andres/src/postgresql/src/backend/libpq/be-secure.c:310 #10 0x000056473217c6fc in secure_write (port=0x564734a1c730, ptr=0x564734a22c80, len=79) at /home/andres/src/postgresql/src/backend/libpq/be-secure.c:256 #11 0x000056473218a47c in internal_flush () at /home/andres/src/postgresql/src/backend/libpq/pqcomm.c:1437 #12 0x000056473218a40c in socket_flush () at /home/andres/src/postgresql/src/backend/libpq/pqcomm.c:1413 #13 0x00005647322f693d in ReadyForQuery (dest=DestRemote) at /home/andres/src/postgresql/src/backend/tcop/dest.c:259 #14 0x00005647322fd078 in PostgresMain (argc=1, argv=0x564734a24db8, dbname=0x564734a24c98 "postgres", username=0x5647349fbda8 "andres") at /home/andres/src/postgresql/src/backend/tcop/postgres.c:4047 #15 0x000056473225b54e in BackendRun (port=0x564734a1c730) at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:4364 #16 0x000056473225ac5e in BackendStartup (port=0x564734a1c730) at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:4036 #17 0x0000564732257007 in ServerLoop () at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1756 #18 0x0000564732256596 in PostmasterMain (argc=3, argv=0x5647349f9b70) at /home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1364 #19 0x0000564732190be8 in main (argc=3, argv=0x5647349f9b70) at /home/andres/src/postgresql/src/backend/main/main.c:228 I think this is actually not the $monitor backend but $killme, but I think it shows how this is possible. I don't think there's an easy way out of this - making all low-level client communcation properly reentrant sounds hard. What I would like to do is to add an log message somewhere around send_message_to_frontend() noticing that the message to the client could not be sent, to avoid future confusion like this. Does anybody have a better idea? Given this fact pattern, I'll allow the case without a received error message in the recovery test. Objections? Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers