wie...@porcupine.org (Wietse Venema) wrote: > > No, Postfix read_wait() uses poll() unconditionally. If Solaris > has an edge-triggered implementation (i.e. no notification when > poll() is called after the connection is already closed), that > sucks. > > Presumably there are other ways a way to determine that write will > not block on Solaris, or to find out that the peer has closed the > connection without doing I/O on it. Your call. Solaris is no longer > a primary test platform for me.
I thought I would poke at it with a stick and see what would happen. First attempt I changed peekfd() to detect ECONNRESET, and if so issue close(fd). That way the later poll() would definitely fail. But unfortunately, it fails rather hard; Dec 17 13:22:53 smtp01.unix postfix/smtpd[5474]: [ID 197553 mail.info] connect from l4.unix[172.20.12.2] Dec 17 13:22:53 smtp01.unix postfix/smtpd[5474]: [ID 947731 mail.warning] warning: Early connection lost detected: 12 Dec 17 13:22:53 smtp01.unix postfix/smtpd[5474]: [ID 197553 mail.info] disconnect from l4.unix[172.20.12.2] Dec 17 13:22:53 smtp01.unix postfix/smtpd[5474]: [ID 947731 mail.crit] fatal: poll: Connection reset by peer Dec 17 13:22:54 smtp01.unix postfix/master[5459]: [ID 947731 mail.warning] warning: process /usr/libexec/postfix/smtpd pid 5471 exit status 1 I would guess because poll checks for POLLNVAL, and if so calls msg_fatal(). I can't think of any other, softer, ways to mark the fd to future poll failure. (Set it readonly? Or mark it in some Postfix internal structure to signal failure, but I do not know Postfix enough). So then I attempt to fix up write_wait.c to detect the problem before we call poll(). if ((peekfd(fd) < 0) && (errno == ECONNRESET)) { msg_warn("write_wait() connection reset %d", fd); return 0; } pollfd.fd = fd; pollfd.events = POLLOUT; for (;;) { switch (poll(&pollfd, 1, timeout < 0 ? WAIT_FOR_EVENT : timeout * 1000)) { To my surprise, I see this behaviour in Solaris: 13724: 0.0000 ioctl(11, FIONREAD, 0x08047614) Err#131 ECONNRESET 13724: 0.0000 time() = 13724: 0.0000 ioctl(11, FIONREAD, 0x08047804) = 0 Only the first call to FIONREAD gets the error?! from then on, it just signals 0 bytes. (Presumably what you meant by edge-triggered). So, lets detect it with a zero-length read() char buffer[1]; if ((read(fd, buffer, 0) < 0) && (errno == ECONNRESET)) { msg_warn("write_wait() connection reset %d", fd); return 0; } Alas, we get the same situation! Only the first read() gets the error, after that, read() returns with 0. However, I found that this works: char buffer[1]; if ((write(fd, buffer, 0) < 0) && (errno == EPIPE)) { msg_warn("write_wait() connection reset %d", fd); return 0; } pollfd.fd = fd; pollfd.events = POLLOUT; for (;;) { switch (poll(&pollfd, 1, timeout < 0 ? WAIT_FOR_EVENT : timeout * 1000)) { Which truss shows as: 22179: 0.0000 read(11, " Q U I T\r\n", 4096) = 6 22179: 0.0001 ioctl(11, FIONREAD, 0x08047614) Err#131 ECONNRESET 22179: 0.0000 write(11, 0x0804782F, 0) Err#32 EPIPE 22179: 0.0001 Received signal #13, SIGPIPE [ignored] 22179: 0.0000 write(11, " 2 2 1 2 . 0 . 0 B y".., 15) Err#32 EPIPE 22179: 0.0000 Received signal #13, SIGPIPE [ignored] 22179: 0.0001 close(11) = 0 It is interesting to note the errno changed to EPIPE when calling write() instead of read(). Is a write( ,0) always non-blocking? Probably needs answering. Running with L4 health checks for 20mins and: # ps -edaf| grep smtpd | wc -l 2 Everything "appears to work" but I have no put it on production. I do wish there was a nicer way to detect the problem than to use a zero-write. Possibly signal the problem in peekfd(), and act on it in write_wait(). -- Jorgen Lundman | <lund...@lundman.net> Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work) Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell) Japan | +81 (0)3 -3375-1767 (home)