** Description changed: + Libpcap goes into an infinite spin, using 100% of CPU when the interface it is watching is brought down. + The issue was fixed upstream as part of version 1.1.0 (now 1.1.1), which will be in Maverick as soon as Debian imports begin. + The attached patch, while developed independently, is effectively identical to the fix implemented upstream. + + TEST CASE: + - run tcpdump on a network interface (ex: sudo tcpdump -i eth0) + - bring down the interface (ex: sudo ifconfig eth0 down) + - watch as tcpdump uses 100% cpu + + Original Report: Binary package hint: libpcap0.8 Programs using libpcap0.8, like tcpdump, show the behavior that they consume 100% CPU power instead of error-handling / termination. It's a problem in libpcap. I traced it down and present a fix here. I wrote to ubuntu-devel-disc...@lists.ubuntu.com (https://lists.ubuntu.com/archives/ubuntu-devel-discuss/2009-August/009194.html) and the maintainer, Romain Francoise, on Aug 7. But there was no response. This fix goes belong with my report https://bugs.launchpad.net/ubuntu/+source/tcpdump/+bug/369288 (concerning tcpdump), which I have closed because I found it's a libpcap problem. Perhaps it may be needed to check why it occured: has libpcap recently changed in the poll() handling - or did the linux kernel counterpart change? test screnario: r...@tomate:/tmp# openvpn --dev tun & r...@tomate:/tmp# ifconfig tun0 up & r...@tomate:/tmp# tcpdump -ni tun0 & - strace -tt -p 12238 >/tmp/a 2>&1 10:14:39.250055 poll([{fd=3, events=POLLIN}], 1, 1000) = 0 (Timeout) 10:14:40.251505 poll([{fd=3, events=POLLIN}], 1, 1000) = 0 (Timeout) 10:14:41.252910 poll([{fd=3, events=POLLIN}], 1, 1000) = 0 (Timeout) 10:14:42.254355 poll([{fd=3, events=POLLIN}], 1, 1000) = 0 (Timeout) 10:14:43.255764 poll([{fd=3, events=POLLIN}], 1, 1000) = 0 (Timeout) 10:14:44.257102 poll([{fd=3, events=POLLIN}], 1, 1000) = 0 (Timeout) !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! here i terminated openvpn; dev tun0 closes !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 10:14:45.258137 poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, revents=POLLERR}]) 10:14:45.392988 poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, revents=POLLERR}]) 10:14:45.393089 poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, revents=POLLERR}]) 10:14:45.393182 poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, revents=POLLERR}]) 10:14:45.393257 poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, revents=POLLERR}]) 10:14:45.393361 poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, revents=POLLERR}]) 10:14:45.393436 poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, revents=POLLERR}]) 10:14:45.393520 poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, revents=POLLERR}]) 10:14:45.393621 poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, revents=POLLERR}]) 10:14:45.393694 poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, revents=POLLERR}]) 10:14:45.393851 poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, revents=POLLERR}]) 10:14:45.393924 poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, revents=POLLERR}]) 10:14:45.394020 poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, revents=POLLERR}]) 10:14:45.394093 poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, revents=POLLERR}]) 10:14:45.394165 poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, revents=POLLERR}]) 10:14:45.394260 poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, revents=POLLERR}]) [..] 10:14:58.921514 poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, revents=POLLERR}]) 10:14:58.921558 poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, revents=POLLERR}]) - tho...@tomate:~$ grep ^10:14:45 /tmp/a|wc - 12508 112572 1088196 + 12508 112572 1088196 tho...@tomate:~$ grep ^10:14:46 /tmp/a|wc - 22874 205866 1990038 + 22874 205866 1990038 tho...@tomate:~$ grep ^10:14:47 /tmp/a|wc - 22488 202392 1956456 + 22488 202392 1956456 tho...@tomate:~$ grep ^10:14:48 /tmp/a|wc - 22853 205677 1988211 + 22853 205677 1988211 tho...@tomate:~$ grep ^10:14:49 /tmp/a|wc - 22924 206316 1994388 + 22924 206316 1994388 tho...@tomate:~$ grep ^10:14:50 /tmp/a|wc - 22975 206775 1998825 + 22975 206775 1998825 tho...@tomate:~$ grep ^10:14:51 /tmp/a|wc - 22130 199170 1925310 + 22130 199170 1925310 tho...@tomate:~$ grep ^10:14:52 /tmp/a|wc - 22778 205002 1981686 + 22778 205002 1981686 tho...@tomate:~$ grep ^10:14:53 /tmp/a|wc - 23061 207549 2006307 + 23061 207549 2006307 tho...@tomate:~$ grep ^10:14:54 /tmp/a|wc - 22982 206838 1999434 + 22982 206838 1999434 tho...@tomate:~$ grep ^10:14:55 /tmp/a|wc - 22937 206433 1995519 + 22937 206433 1995519 tho...@tomate:~$ grep ^10:14:56 /tmp/a|wc - 22906 206154 1992822 + 22906 206154 1992822 tho...@tomate:~$ grep ^10:14:57 /tmp/a|wc - 22742 204678 1978554 + 22742 204678 1978554 tho...@tomate:~$ grep ^10:14:58 /tmp/a|wc - 21115 190033 1836989 + 21115 190033 1836989 ----> much the same :( ..and aways POLLERR: tho...@tomate:~$ grep -i poll /tmp/a |grep 10.14:5|grep -v POLLERR - tho...@tomate:~$ + tho...@tomate:~$ It eats 100% CPU load forever - tcpdump bug? tcpdump compiled with debug symbols (recompiled from deb-src). gdb attached: Program received signal SIGINT, Interrupt. 0x00007f8a4de4145f in poll () from /lib/libc.so.6 (gdb) bt #0 0x00007f8a4de4145f in poll () from /lib/libc.so.6 #1 0x00007f8a4e0deba4 in ?? () from /usr/lib/libpcap.so.0.8 #2 0x00007f8a4e0e2dd9 in pcap_loop () from /usr/lib/libpcap.so.0.8 #3 0x000000000044b7ed in ?? () #4 0x00007f8a4dd835a6 in __libc_start_main () from /lib/libc.so.6 #5 0x0000000000402c09 in ?? () #6 0x00007fff568b3068 in ?? () -> tcpdump is not responsible. It's a libpcap issue. libpcap recompiled with debug symbols from deb-src: Program received signal SIGINT, Interrupt. 0x00007ff5817b545f in poll () from /lib/libc.so.6 (gdb) bt #0 0x00007ff5817b545f in poll () from /lib/libc.so.6 - #1 0x00007ff581a547fd in pcap_read_linux_mmap (handle=0x1263860, - max_packets=-1, callback=0x44c030 <pcap_dump_cl...@plt+300128>, - user=0x7fff8a22f7c0 "P\206C") at ./pcap-linux.c:2194 - #2 0x00007ff581a5884c in pcap_loop (p=0x1263860, cnt=-1, - callback=0x44c030 <pcap_dump_cl...@plt+300128>, - user=0x7fff8a22f7c0 "P\206C") at ./pcap.c:302 + #1 0x00007ff581a547fd in pcap_read_linux_mmap (handle=0x1263860, + max_packets=-1, callback=0x44c030 <pcap_dump_cl...@plt+300128>, + user=0x7fff8a22f7c0 "P\206C") at ./pcap-linux.c:2194 + #2 0x00007ff581a5884c in pcap_loop (p=0x1263860, cnt=-1, + callback=0x44c030 <pcap_dump_cl...@plt+300128>, + user=0x7fff8a22f7c0 "P\206C") at ./pcap.c:302 #3 0x000000000044b7ed in ?? () #4 0x00007ff5816f75a6 in __libc_start_main () from /lib/libc.so.6 - -> ok, here we go. ./pcap-linux.c:2194 it's this do-while block eating all my cpu power: - do { - /* poll() requires a negative timeout to wait forever */ - ret = poll(&pollinfo, 1, (handle->md.timeout > 0)? - handle->md.timeout: -1); - if ((ret < 0) && (errno != EINTR)) { - snprintf(handle->errbuf, PCAP_ERRBUF_SIZE, - "can't poll on packet socket fd %d: %d-_%s", - handle->fd, errno, _pcap_strerror(errno)); - return -1; - } - /* check for break loop condition on interrupted _syscall*/ - if (handle->break_loop) { - handle->break_loop = 0; - return -2; - } - } while (ret < 0); - + do { + /* poll() requires a negative timeout to wait forever */ + ret = poll(&pollinfo, 1, (handle->md.timeout > 0)? + handle->md.timeout: -1); + if ((ret < 0) && (errno != EINTR)) { + snprintf(handle->errbuf, PCAP_ERRBUF_SIZE, + "can't poll on packet socket fd %d: %d-_%s", + handle->fd, errno, _pcap_strerror(errno)); + return -1; + } + /* check for break loop condition on interrupted _syscall*/ + if (handle->break_loop) { + handle->break_loop = 0; + return -2; + } + } while (ret < 0); Return value of poll() is checked. ret is > -1 (it's 1 due to strace, see above) and errno is not set because it's not a ret = -1 condition. Interrupted syscall condition is not true (handle->break_loop). revents=POLLERR is not checked. [Imho, at least POLLERR, POLLUP and POLLNVAL are possible cases to handle; unsure with POLLRDHUP] - Why it occured? Either your code has changed. Or perhaps the kernel poll() behavior on this error event has changed (returns not -1 anymore; and now likes to set bits on the "revents"). My kernel is - Linux tomate 2.6.28-14-generic #47-Ubuntu SMP Sat Jul 25 01:19:55 UTC 2009 x86_64 GNU/Linu) + Linux tomate 2.6.28-14-generic #47-Ubuntu SMP Sat Jul 25 01:19:55 UTC 2009 x86_64 GNU/Linu) My suggestion for a fix (after you may have verified with the kernel dev team) is this: - normal events are: - poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, revents=POLLIN}]) - write(1, "11:12:36.440319 IP x.x.x.x"..., 11911:12:36.440319 IP x.x.x.x.655 > x.x.x.x.36321: . ack 865 win 2595 <nop,nop,timestamp 1373567254 21184905>) = 119 - error event is: - poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, revents=POLLERR}]) - - return-value in the error case is not -1, not 0, not < -1. + normal events are: + poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, revents=POLLIN}]) + write(1, "11:12:36.440319 IP x.x.x.x"..., 11911:12:36.440319 IP x.x.x.x.655 > x.x.x.x.36321: . ack 865 win 2595 <nop,nop,timestamp 1373567254 21184905>) = 119 + error event is: + poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, revents=POLLERR}]) + + return-value in the error case is not -1, not 0, not < -1. man page of poll says: - On success, a positive number is returned; this is the number of struc‐ - tures which have non-zero revents fields (in other words, those - descriptors with events or errors reported). A value of 0 indicates - that the call timed out and no file descriptors were ready. On error, - -1 is returned, and errno is set appropriately. + On success, a positive number is returned; this is the number of struc‐ + tures which have non-zero revents fields (in other words, those + descriptors with events or errors reported). A value of 0 indicates + that the call timed out and no file descriptors were ready. On error, + -1 is returned, and errno is set appropriately. -> thus on error events, the return value is > 0 (!) - --- pcap-linux-orig.c 2009-08-07 11:35:56.518551462 +0200 +++ pcap-linux.c 2009-08-07 11:30:10.426556000 +0200 @@ -2199,7 +2199,20 @@ - handle->fd, errno, pcap_strerror(errno)); - return -1; - } + handle->fd, errno, pcap_strerror(errno)); + return -1; + } - /* check for break loop condition on interrupted syscall*/ + + /* poll() sometimes returns > 0 even in error cases + * and sets error-bits like POLLERR, POLLHUP, .. + * There are also "good" bits are possible, like POLLPRI. + * TODO: what's with POLLRDHUP since kernel 2.6.17? + */ + if (ret > 0 && pollinfo.revents & (POLLERR | POLLHUP | POLLNVAL)) { + snprintf(handle->errbuf, PCAP_ERRBUF_SIZE, - + "critical revent in poll on packet socket fd %d: 0x%4.4x", + + "critical revent in poll on packet socket fd %d: 0x%4.4x", + handle->fd, pollinfo.revents); + return -1; + } + + /* check for break loop condition on interrupted _syscall*/ - if (handle->break_loop) { - handle->break_loop = 0; - return -2; - + if (handle->break_loop) { + handle->break_loop = 0; + return -2; r...@tomate:/tmp# fg 1 openvpn --dev tun ^CFri Aug 7 11:47:06 2009 event_wait : Interrupted system call (code=4) tcpdump: pcap_loop: critical revent in poll on packet socket fd 3: 0x0008 0 packets captured 0 packets received by filter 0 packets dropped by kernel Fri Aug 7 11:47:06 2009 SIGINT[hard,] received, process exiting Looks good. Btw, same result for just setting "ifconfig tun0 down". Let's strace it: poll([{fd=3, events=POLLIN}], 1, 1000) = 0 (Timeout) poll([{fd=3, events=POLLIN}], 1, 1000) = 0 (Timeout) poll([{fd=3, events=POLLIN}], 1, 1000) = 0 (Timeout) poll([{fd=3, events=POLLIN}], 1, 1000) = 0 (Timeout) poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, revents=POLLERR}]) write(2, "tcpdump: pcap_loop: critical revent in poll on packet socket fd 3: 0x0008\n"..., 74) = 74 getsockopt(3, SOL_PACKET, PACKET_STATISTICS, "\0\0\0\0\0\0\0\0"..., [8]) = 0 write(2, "0 packets captured"..., 18) = 18 write(2, "\n"..., 1) = 1 write(2, "0 packets received by filter"..., 28) = 28 write(2, "\n"..., 1) = 1 write(2, "0 packets dropped by kernel\n"..., 28) = 28 setsockopt(3, SOL_PACKET, PACKET_RX_RING, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16) = -1 EBUSY (Device or resource busy) munmap(0x7f27b3f1b000, 2146304) = 0 close(3) = 0 exit_group(1) = ? Looks fine. [Hmm. Is the last setsockopt(3, SOL_PACKET, PACKET_RX_RING, ..) on the dead file descriptor really needed?...] - Btw, I run ubuntu jaunty amd64. + Btw, I run ubuntu jaunty amd64. tcpdump is Version: 3.9.8-4ubuntu2 libpcap0.8 is Version: 1.0.0-1 - Regards, - - Thomas Osterried + - Thomas Osterried
-- libpcap issue: tcpdump consumes 100% cpu power after interface shutdown https://bugs.launchpad.net/bugs/414744 You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs