On Mon, Oct 19, 2015 at 06:45:32PM -0700, Eric Dumazet wrote: > On Tue, 2015-10-20 at 02:12 +0100, Alan Burlison wrote: > > > Another problem is that if I call close() on a Linux socket that's in > > accept() the accept call just sits there until there's an incoming > > connection, which succeeds even though the socket is supposed to be > > closed, but then an immediately following accept() on the same socket > > fails. > > This is exactly what the comment I pasted documents. > > On linux, doing close(listener) on one thread does _not_ wakeup other > threads doing accept(listener) > > So I guess allowing shutdown(listener) was a way to somehow propagate > some info on the threads stuck in accept() > > This is a VFS issue, and a long standing one. > > Think of all cases like dup() and fd passing games, and the close(fd) > being able to signal out of band info is racy. > > close() is literally removing one ref count on a file.
> Expecting it doing some kind of magical cleanup of a socket is not > reasonable/practical. > > On a multi threaded program, each thread doing an accept() increased the > refcount on the file. Refcount is an implementation detail, of course. However, in any Unix I know of, there are two separate notions - descriptor losing connection to opened file (be it from close(), exit(), execve(), dup2(), etc.) and opened file getting closed. The latter cannot happen while there are descriptors connected to the file in question, of course. However, that is not the only thing that might prevent an opened file from getting closed - e.g. sending an SCM_RIGHTS datagram with attached descriptor connected to the opened file in question *at* *the* *moment* *of* *sendmsg(2)* will carry said opened file until it is successfully received or discarded (in the former case recepient will get a new descriptor refering to that opened file, of course). Having the original descriptor closed right after sendmsg(2) does *not* do anything to opened file. On any Unix that implements descriptor-passing. There's going to be a notion of "last close"; that's what this refcount is about and _that_ is more than implementation detail. The real question is what kind of semantics would one want in the following situations: 1) // fd is a socket fcntl(fd, F_SETFD, FD_CLOEXEC); fork(); in parent: accept(fd); in child: execve() 2) // fd is a socket, 1 is /dev/null fork(); in parent: accept(fd); in child: dup2(1, fd); 3) // fd is a socket fd2 = dup(fd); in thread A: accept(fd); in thread B: close(fd); 4) // fd is a socket, 1 is /dev/null fd2 = dup(fd); in thread A: accept(fd); in thread B: dup2(1, fd); 5) // fd is a socket, 1 is /dev/null fd2 = dup(fd); in thread A: accept(fd); in thread B: close(fd2); 6) // fd is a socket in thread A: accept(fd); in thread B: close(fd); In other words, is that destruction of * any descriptor refering to this socket [utterly insane for obvious reasons] * the last descriptor refering to this socket (modulo descriptor passing, etc.) [a bitch to implement, unless we treat a syscall in progress as keeping the opened file open], or * _the_ descriptor used to issue accept(2) [a bitch to implement, with a lot of fun races in an already race-prone area]? Additional question is whether it's * just a magical behaviour of close(2) [ugly], or * something that happens when descriptor gets dissociated from opened file [obviously more consistent]? BTW, for real fun, consider this: 7) // fd is a socket fd2 = dup(fd); in thread A: accept(fd); in thread B: accept(fd); in thread C: accept(fd2); in thread D: close(fd); Which threads (if any), should get hit where it hurts? I honestly don't know what Solaris does; AFAICS, FreeBSD behaves like Linux these days. NetBSD plays really weird games in their fd_close(); what they are trying to achieve is at least sane - in (7) they'd hit A and B with EBADF and C would restart and continue waiting, in (3,4,6) A gets EBADF, in (1,2,5) accept() is unaffected. The problem is that their solution is racy - they have a separate refcount on _descriptor_, plus a file method (->fo_restart) for triggering an equivalent of signal interrupting anything that might be blocked on that sucker, with syscall restart (and subsequent EBADF on attempt to refetch the sucker. Racy if we reopen or are doing dup2() in the first place - these restarts might get CPU just after we return from dup2() and pick the *new* descriptor just fine. It might be possible to fix their approach (having if (__predict_false(ff->ff_file == NULL)) { /* * Another user of the file is already closing, and is * waiting for other users of the file to drain. Release * our reference, and wake up the closer. */ atomic_dec_uint(&ff->ff_refcnt); cv_broadcast(&ff->ff_closing); path in fd_close() mark the thread as "don't bother restarting, just bugger off" might be workable), but... it's still pretty costly. They pay with memory footprint (at least 32 bits per descriptor, and that's leaving aside the fun issues with what to wait on) and the only thing that might be saving them from cacheline ping-pong from hell is that their struct fdfile is really fat - there's a lot more than just an extra u32 in there. I have no idea what semantics does Solaris have in that area and how racy their descriptor table handling is. And no, I'm not going to RTFS their kernel, CDDL being what it is. I *do* know that Linux and all *BSD kernels had pretty severe races in that area. Quite a few of those, and a lot more painful than the one RTFS(NetBSD) seems to have caught just now. So I would seriously recommend the folks who are free to RTFS(Solaris) to review that area. Carefully. There tend to be dragons. _IF_ somebody can come up with clean semantics and tolerable approach to implementing it, I'd be glad to see that. What we do is "syscall in progress keeps the file it operates upon open, no matter what happens to descriptors". AFAICS, what NetBSD tries to implement is also reasonably clean wrt semantics ("detaching an opened file from a descriptor that is being operated upon by some syscalls triggers restart or failure of all syscalls operating on the opened file in question and waits for them to bugger off", but their implementation appears to be both racy and far too heavyweight, with no obvious solutions to the latter. Come to think of that, restart-based solutions have an obvious problem - if we were talking about restart due to signal, the userland code could (and would have to) block those, just to avoid this kind of issues with the wrong descriptor picked on restart. But there's no way to block _that_, so if you have two descriptors refering to the same socket and 4 threads doing A: sleeps in accept(fd1) B: sleeps in accept(fd2) C: close(fd1) D: (with all precautions re signals taken by the whole thing) dup2(fd3, fd2) you can end up with C coming first, kicking A and B (as operating on that socket) with A legitimately failing and B going into restart. And losing CPU to D, which does that dup2(), so when B regains CPU it's operating on the socket it never intended to. So this approach seems to be broken, no matter what... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html