On Thu, Sep 24, 2015 at 05:29:34PM -0700, Steven Schlansker wrote: > Hello linux-kernel, > > I write to you on behalf of many developers at my company, who > are having trouble with their applications endlessly locking up > inside of libc code, with no hope of recovery. > > Currently it affects our Mono and Node processes mostly, and the > symptoms are the same: user code invokes getaddrinfo, and libc > attempts to determine whether ipv4 or ipv6 is appropriate, by using > the RTM_GETADDR netlink message. The write into the netlink socket > succeeds, and it immediately reads back the results ... and waits > forever. The read never returns. The stack looks like this: > > #0 0x00007fd7d8d214ad in recvmsg () at ../sysdeps/unix/syscall-template.S:81 > #1 0x00007fd7d8d3e44d in make_request (fd=fd@entry=13, pid=1) at > ../sysdeps/unix/sysv/linux/check_pf.c:177 > #2 0x00007fd7d8d3e9a4 in __check_pf > (seen_ipv4=seen_ipv4@entry=0x7fd7d37fdd00, > seen_ipv6=seen_ipv6@entry=0x7fd7d37fdd10, > in6ai=in6ai@entry=0x7fd7d37fdd40, in6ailen=in6ailen@entry=0x7fd7d37fdd50) > at ../sysdeps/unix/sysv/linux/check_pf.c:341 > #3 0x00007fd7d8cf64e1 in __GI_getaddrinfo (name=0x31216e0 > "mesos-slave4-prod-uswest2.otsql.opentable.com", service=0x0, > hints=0x31216b0, pai=0x31f09e8) at ../sysdeps/posix/getaddrinfo.c:2355 > #4 0x0000000000e101c8 in uv__getaddrinfo_work (w=0x31f09a0) at > ../deps/uv/src/unix/getaddrinfo.c:102 > #5 0x0000000000e09179 in worker (arg=<optimized out>) at > ../deps/uv/src/threadpool.c:91 > #6 0x0000000000e16eb1 in uv__thread_start (arg=<optimized out>) at > ../deps/uv/src/unix/thread.c:49 > #7 0x00007fd7d8ff3182 in start_thread (arg=0x7fd7d37fe700) at > pthread_create.c:312 > #8 0x00007fd7d8d2047d in clone () at > ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 > > (libuv is part of Node and makes DNS lookups "asynchronous" by having > a thread pool in the background working) > > The applications will run for hours or days successfully, until eventually > hanging with > no apparent pattern or cause. And once this hang happens it hangs badly, > because > check_pf is holding a lock during the problematic recvmsg call. > > I raised this issue on the libc-help mailing list, but I'm hoping that lkml > will > have a higher number of people familiar with netlink that may better offer > advice. > The original thread is here: > https://sourceware.org/ml/libc-help/2015-09/msg00014.html > > Looking at the getaddrinfo / check_pf source code: > https://fossies.org/dox/glibc-2.22/sysdeps_2unix_2sysv_2linux_2check__pf_8c_source.html > > 146 if (TEMP_FAILURE_RETRY (__sendto (fd, (void *) &req, sizeof (req), 0, > 147 (struct sockaddr *) &nladdr, > 148 sizeof (nladdr))) < 0) > 149 goto out_fail; > 150 > 151 bool done = false; > 152 > 153 bool seen_ipv4 = false; > 154 bool seen_ipv6 = false; > 155 > 156 do > 157 { > 158 struct msghdr msg = > 159 { > 160 (void *) &nladdr, sizeof (nladdr), > 161 &iov, 1, > 162 NULL, 0, > 163 0 > 164 }; > 165 > 166 ssize_t read_len = TEMP_FAILURE_RETRY (__recvmsg (fd, &msg, 0)); > 167 if (read_len <= 0) > 168 goto out_fail; > 169 > 170 if (msg.msg_flags & MSG_TRUNC) > 171 goto out_fail; > 172 > > I notice that there is possibility that if messages are dropped either on send > or receive side, maybe this code will hang forever? The netlink(7) man page > makes > me slightly worried: > > > Netlink is not a reliable protocol. It tries its best to deliver a message > > to its destination(s), but may drop messages when an out-of-memory > > condition or other error occurs. > > However, reliable transmissions from kernel to user are impossible in any > > case. The kernel can't send a netlink message if the socket buffer is > > full: the message will be dropped and the kernel and the user-space process > > will no longer have the same view of kernel state. It is up to the > > application to detect when this happens (via the ENOBUFS error returned > > by recvmsg(2)) and resynchronize. > > > I have taken the glibc code and created a simple(r) program to attempt to > reproduce this issue. > I inserted some simple polling between the sendto and recvmsg calls to make > the failure case more evident: > > struct pollfd pfd; > pfd.fd = fd; > pfd.events = POLLIN; > pfd.revents = 0; > > int pollresult = poll(&pfd, 1, 1000); > if (pollresult < 0) { > perror("glibc: check_pf: poll"); > abort(); > } else if (pollresult == 0 || pfd.revents & POLLIN == 0) { > fprintf(stderr, "[%ld] glibc: check_pf: netlink socket read > timeout\n", gettid()); > abort(); > } > > I have placed the full source code and strace output here: > https://gist.github.com/stevenschlansker/6ad46c5ccb22bc4f3473 > > The process quickly sends off hundreds of threads which sit in a > loop attempting this RTM_GETADDR message exchange. > > The code may be compiled as "gcc -o pf_dump -pthread pf_dump.c" > > An example invocation that quickly fails: > > root@24bf2e440b5e:/# strace -ff -o pfd ./pf_dump > [3700] exit success > glibc: check_pf: netlink socket read timeout > Aborted (core dumped) > > Interestingly, this seems to be very easy to reproduce using pthreads, but > much less > common with fork() or clone()d threads. I'm not sure if this is just an > artifact > of how I am testing or an actual clue, but I figured I'd mention it. > > I have tested this program on vanilla kernels 4.0.4 and 4.2.1 -- the 4.0.4 > version > reliably crashes, but I am having trouble reproducing on 4.2.1 > > So usually I would upgrade to 4.2.1 and be happy, except we ran into serious > problems > with 4.1.2 and are now a little shy about upgrading: > > https://bugzilla.xamarin.com/show_bug.cgi?id=29212 > > So my questions from here are: > > * Is this glibc code correct? > * What are the situations where a recvmsg from a netlink socket can hang as > it does here? > * Is the potential "fix" in 4.2.1 due to any particular commit? I checked > the changelogs and nothing caught my eye. > > We'll be testing out 4.2.1 more thoroughly over the coming days but I am > hoping someone here > can shed some light on our problem. > http://comments.gmane.org/gmane.linux.network/363085
might explain your problem. I thought this was resolved in 4.1, but it looks like the problem still persists there. At least I have reports from my workplace that 4.1.6 and 4.1.7 are still affected. I don't know if there have been any relevant changes in 4.2. Copying Herbert and Eric for additional input. Guenter -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/