[Bug 414744] Re: libpcap issue: tcpdump consumes 100% cpu power after interface shutdown

eapache Sat, 01 May 2010 13:57:05 -0700

** Description changed:

+ Libpcap goes into an infinite spin, using 100% of CPU when the interface it 
is watching is brought down.
+ The issue was fixed upstream as part of version 1.1.0 (now 1.1.1), which will 
be in Maverick as soon as Debian imports begin.
+ The attached patch, while developed independently, is effectively identical 
to the fix implemented upstream.
+ 
+ TEST CASE:
+ - run tcpdump on a network interface (ex: sudo tcpdump -i eth0)
+ - bring down the interface (ex: sudo ifconfig eth0 down)
+ - watch as tcpdump uses 100% cpu
+ 
+ Original Report:
  Binary package hint: libpcap0.8
  
  Programs using libpcap0.8, like tcpdump, show the behavior that they consume 
100% CPU power instead of error-handling / termination.
  It's a problem in libpcap. I traced it down and present a fix here.
  I wrote to ubuntu-devel-disc...@lists.ubuntu.com 
(https://lists.ubuntu.com/archives/ubuntu-devel-discuss/2009-August/009194.html)
 and the maintainer, Romain Francoise, on Aug 7. But there was no response.
  
  This fix goes belong with my report 
https://bugs.launchpad.net/ubuntu/+source/tcpdump/+bug/369288 (concerning 
tcpdump), which I have closed because I found it's a libpcap problem.
  Perhaps it may be needed to check why it occured: has libpcap recently 
changed in the poll() handling - or did the linux kernel counterpart change?
  
  test screnario:
  r...@tomate:/tmp# openvpn --dev tun &
  r...@tomate:/tmp# ifconfig tun0 up &
  r...@tomate:/tmp# tcpdump -ni tun0 &
- 
  
  strace -tt -p 12238 >/tmp/a 2>&1
  
  10:14:39.250055 poll([{fd=3, events=POLLIN}], 1, 1000) = 0 (Timeout)
  10:14:40.251505 poll([{fd=3, events=POLLIN}], 1, 1000) = 0 (Timeout)
  10:14:41.252910 poll([{fd=3, events=POLLIN}], 1, 1000) = 0 (Timeout)
  10:14:42.254355 poll([{fd=3, events=POLLIN}], 1, 1000) = 0 (Timeout)
  10:14:43.255764 poll([{fd=3, events=POLLIN}], 1, 1000) = 0 (Timeout)
  10:14:44.257102 poll([{fd=3, events=POLLIN}], 1, 1000) = 0 (Timeout)
  !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
  here i terminated openvpn; dev tun0 closes
  !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
  10:14:45.258137 poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, 
revents=POLLERR}])
  10:14:45.392988 poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, 
revents=POLLERR}])
  10:14:45.393089 poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, 
revents=POLLERR}])
  10:14:45.393182 poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, 
revents=POLLERR}])
  10:14:45.393257 poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, 
revents=POLLERR}])
  10:14:45.393361 poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, 
revents=POLLERR}])
  10:14:45.393436 poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, 
revents=POLLERR}])
  10:14:45.393520 poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, 
revents=POLLERR}])
  10:14:45.393621 poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, 
revents=POLLERR}])
  10:14:45.393694 poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, 
revents=POLLERR}])
  10:14:45.393851 poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, 
revents=POLLERR}])
  10:14:45.393924 poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, 
revents=POLLERR}])
  10:14:45.394020 poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, 
revents=POLLERR}])
  10:14:45.394093 poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, 
revents=POLLERR}])
  10:14:45.394165 poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, 
revents=POLLERR}])
  10:14:45.394260 poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, 
revents=POLLERR}])
  [..]
  10:14:58.921514 poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, 
revents=POLLERR}])
  10:14:58.921558 poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, 
revents=POLLERR}])
  
- 
  tho...@tomate:~$ grep ^10:14:45 /tmp/a|wc
-   12508  112572 1088196
+   12508  112572 1088196
  tho...@tomate:~$ grep ^10:14:46 /tmp/a|wc
-   22874  205866 1990038
+   22874  205866 1990038
  tho...@tomate:~$ grep ^10:14:47 /tmp/a|wc
-   22488  202392 1956456
+   22488  202392 1956456
  tho...@tomate:~$ grep ^10:14:48 /tmp/a|wc
-   22853  205677 1988211
+   22853  205677 1988211
  tho...@tomate:~$ grep ^10:14:49 /tmp/a|wc
-   22924  206316 1994388
+   22924  206316 1994388
  tho...@tomate:~$ grep ^10:14:50 /tmp/a|wc
-   22975  206775 1998825
+   22975  206775 1998825
  tho...@tomate:~$ grep ^10:14:51 /tmp/a|wc
-   22130  199170 1925310
+   22130  199170 1925310
  tho...@tomate:~$ grep ^10:14:52 /tmp/a|wc
-   22778  205002 1981686
+   22778  205002 1981686
  tho...@tomate:~$ grep ^10:14:53 /tmp/a|wc
-   23061  207549 2006307
+   23061  207549 2006307
  tho...@tomate:~$ grep ^10:14:54 /tmp/a|wc
-   22982  206838 1999434
+   22982  206838 1999434
  tho...@tomate:~$ grep ^10:14:55 /tmp/a|wc
-   22937  206433 1995519
+   22937  206433 1995519
  tho...@tomate:~$ grep ^10:14:56 /tmp/a|wc
-   22906  206154 1992822
+   22906  206154 1992822
  tho...@tomate:~$ grep ^10:14:57 /tmp/a|wc
-   22742  204678 1978554
+   22742  204678 1978554
  tho...@tomate:~$ grep ^10:14:58 /tmp/a|wc
-   21115  190033 1836989
+   21115  190033 1836989
  
  ----> much the same :(
  
  ..and aways POLLERR:
  
  tho...@tomate:~$ grep -i poll /tmp/a  |grep 10.14:5|grep -v POLLERR
- tho...@tomate:~$ 
+ tho...@tomate:~$
  
  It eats 100% CPU load forever
- 
  
  tcpdump bug?
  
  tcpdump compiled with debug symbols (recompiled from deb-src). gdb
  attached:
  
  Program received signal SIGINT, Interrupt.
  0x00007f8a4de4145f in poll () from /lib/libc.so.6
  (gdb) bt
  #0  0x00007f8a4de4145f in poll () from /lib/libc.so.6
  #1  0x00007f8a4e0deba4 in ?? () from /usr/lib/libpcap.so.0.8
  #2  0x00007f8a4e0e2dd9 in pcap_loop () from /usr/lib/libpcap.so.0.8
  #3  0x000000000044b7ed in ?? ()
  #4  0x00007f8a4dd835a6 in __libc_start_main () from /lib/libc.so.6
  #5  0x0000000000402c09 in ?? ()
  #6  0x00007fff568b3068 in ?? ()
  
  -> tcpdump is not responsible. It's a libpcap issue.
  
  libpcap recompiled with debug symbols from deb-src:
  
  Program received signal SIGINT, Interrupt.
  0x00007ff5817b545f in poll () from /lib/libc.so.6
  (gdb) bt
  #0  0x00007ff5817b545f in poll () from /lib/libc.so.6
- #1  0x00007ff581a547fd in pcap_read_linux_mmap (handle=0x1263860, 
-     max_packets=-1, callback=0x44c030 <pcap_dump_cl...@plt+300128>, 
-     user=0x7fff8a22f7c0 "P\206C") at ./pcap-linux.c:2194
- #2  0x00007ff581a5884c in pcap_loop (p=0x1263860, cnt=-1, 
-     callback=0x44c030 <pcap_dump_cl...@plt+300128>, 
-     user=0x7fff8a22f7c0 "P\206C") at ./pcap.c:302
+ #1  0x00007ff581a547fd in pcap_read_linux_mmap (handle=0x1263860,
+     max_packets=-1, callback=0x44c030 <pcap_dump_cl...@plt+300128>,
+     user=0x7fff8a22f7c0 "P\206C") at ./pcap-linux.c:2194
+ #2  0x00007ff581a5884c in pcap_loop (p=0x1263860, cnt=-1,
+     callback=0x44c030 <pcap_dump_cl...@plt+300128>,
+     user=0x7fff8a22f7c0 "P\206C") at ./pcap.c:302
  #3  0x000000000044b7ed in ?? ()
  #4  0x00007ff5816f75a6 in __libc_start_main () from /lib/libc.so.6
  
- 
  -> ok, here we go.
  
  ./pcap-linux.c:2194
  it's this do-while block eating all my cpu power:
  
-                 do {
-                         /* poll() requires a negative timeout to wait forever 
*/
-                         ret = poll(&pollinfo, 1, (handle->md.timeout > 0)?
-                                                 handle->md.timeout: -1);
-                         if ((ret < 0) && (errno != EINTR)) {
-                                 snprintf(handle->errbuf, PCAP_ERRBUF_SIZE,
-                                          "can't poll on packet socket fd %d: 
%d-_%s",
-                                         handle->fd, errno,                    
  _pcap_strerror(errno));
-                                 return -1;
-                         }
-                         /* check for break loop condition on interrupted      
  _syscall*/
-                         if (handle->break_loop) {
-                                 handle->break_loop = 0;
-                                 return -2;
-                         }
-                 } while (ret < 0);
- 
+                 do {
+                         /* poll() requires a negative timeout to wait forever 
*/
+                         ret = poll(&pollinfo, 1, (handle->md.timeout > 0)?
+                                                 handle->md.timeout: -1);
+                         if ((ret < 0) && (errno != EINTR)) {
+                                 snprintf(handle->errbuf, PCAP_ERRBUF_SIZE,
+                                          "can't poll on packet socket fd %d: 
%d-_%s",
+                                         handle->fd, errno,                    
  _pcap_strerror(errno));
+                                 return -1;
+                         }
+                         /* check for break loop condition on interrupted      
  _syscall*/
+                         if (handle->break_loop) {
+                                 handle->break_loop = 0;
+                                 return -2;
+                         }
+                 } while (ret < 0);
  
  Return value of poll() is checked. ret is > -1 (it's 1 due to strace,
  see above) and errno is not set because it's not a ret = -1 condition.
  Interrupted syscall condition is not true (handle->break_loop).
  revents=POLLERR is not checked. [Imho, at least POLLERR, POLLUP and POLLNVAL
  are possible cases to handle; unsure with POLLRDHUP]
  
- 
  Why it occured? Either your code has changed. Or perhaps the kernel
  poll() behavior on this error event has changed (returns not -1
  anymore; and now likes to set bits on the "revents").
  
  My kernel is
-   Linux tomate 2.6.28-14-generic #47-Ubuntu SMP Sat Jul 25 01:19:55 UTC 2009 
x86_64 GNU/Linu)
+   Linux tomate 2.6.28-14-generic #47-Ubuntu SMP Sat Jul 25 01:19:55 UTC 2009 
x86_64 GNU/Linu)
  
  My suggestion for a fix (after you may have verified with the kernel dev team)
  is this:
-   normal events are:
-     poll([{fd=3, events=POLLIN}], 1, 1000)  = 1 ([{fd=3, revents=POLLIN}])
-   write(1, "11:12:36.440319 IP x.x.x.x"..., 11911:12:36.440319 IP x.x.x.x.655 
> x.x.x.x.36321: . ack 865 win 2595 <nop,nop,timestamp 1373567254 21184905>) = 
119
-   error event is:
-     poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, revents=POLLERR}])
- 
- return-value in the error case is not -1, not 0, not < -1. 
+   normal events are:
+     poll([{fd=3, events=POLLIN}], 1, 1000)  = 1 ([{fd=3, revents=POLLIN}])
+   write(1, "11:12:36.440319 IP x.x.x.x"..., 11911:12:36.440319 IP x.x.x.x.655 
> x.x.x.x.36321: . ack 865 win 2595 <nop,nop,timestamp 1373567254 21184905>) = 
119
+   error event is:
+     poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, revents=POLLERR}])
+ 
+ return-value in the error case is not -1, not 0, not < -1.
  man page of poll says:
-        On success, a positive number is returned; this is the number of struc‐
-        tures  which  have  non-zero  revents  fields  (in  other  words, those
-        descriptors with events or errors reported).  A value  of  0  indicates
-        that  the call timed out and no file descriptors were ready.  On error,
-        -1 is returned, and errno is set appropriately.
+        On success, a positive number is returned; this is the number of struc‐
+        tures  which  have  non-zero  revents  fields  (in  other  words, those
+        descriptors with events or errors reported).  A value  of  0  indicates
+        that  the call timed out and no file descriptors were ready.  On error,
+        -1 is returned, and errno is set appropriately.
  
  -> thus on error events, the return value is > 0 (!)
- 
  
  --- pcap-linux-orig.c   2009-08-07 11:35:56.518551462 +0200
  +++ pcap-linux.c        2009-08-07 11:30:10.426556000 +0200
  @@ -2199,7 +2199,20 @@
-                                         handle->fd, errno, 
pcap_strerror(errno));
-                                 return -1;
-                         }
+                                         handle->fd, errno, 
pcap_strerror(errno));
+                                 return -1;
+                         }
  -                       /* check for break loop condition on interrupted 
syscall*/
  +
  +                       /* poll() sometimes returns > 0 even in error cases
  +                        * and sets error-bits like POLLERR, POLLHUP, ..
  +                        * There are also "good" bits are possible, like 
POLLPRI.
  +                        * TODO: what's with POLLRDHUP since kernel 2.6.17?
  +                        */
  +                       if (ret > 0 && pollinfo.revents & (POLLERR | POLLHUP 
| POLLNVAL)) {
  +                               snprintf(handle->errbuf, PCAP_ERRBUF_SIZE,
- +                                       "critical revent in poll on packet 
socket fd %d: 0x%4.4x", 
+ +                                       "critical revent in poll on packet 
socket fd %d: 0x%4.4x",
  +                                       handle->fd, pollinfo.revents);
  +                                       return -1;
  +                       }
  +
  +                       /* check for break loop condition on interrupted      
  _syscall*/
-                         if (handle->break_loop) {
-                                 handle->break_loop = 0;
-                                 return -2;
- 
+                         if (handle->break_loop) {
+                                 handle->break_loop = 0;
+                                 return -2;
  
  r...@tomate:/tmp# fg 1
  openvpn --dev tun
  ^CFri Aug  7 11:47:06 2009 event_wait : Interrupted system call (code=4)
  tcpdump: pcap_loop: critical revent in poll on packet socket fd 3: 0x0008
  0 packets captured
  0 packets received by filter
  0 packets dropped by kernel
  Fri Aug  7 11:47:06 2009 SIGINT[hard,] received, process exiting
  
  Looks good. Btw, same result for just setting "ifconfig tun0 down".
  Let's strace it:
  
  poll([{fd=3, events=POLLIN}], 1, 1000)  = 0 (Timeout)
  poll([{fd=3, events=POLLIN}], 1, 1000)  = 0 (Timeout)
  poll([{fd=3, events=POLLIN}], 1, 1000)  = 0 (Timeout)
  poll([{fd=3, events=POLLIN}], 1, 1000)  = 0 (Timeout)
  poll([{fd=3, events=POLLIN}], 1, 1000)  = 1 ([{fd=3, revents=POLLERR}])
  write(2, "tcpdump: pcap_loop: critical revent in poll on packet socket fd 3: 
0x0008\n"..., 74) = 74
  getsockopt(3, SOL_PACKET, PACKET_STATISTICS, "\0\0\0\0\0\0\0\0"..., [8]) = 0
  write(2, "0 packets captured"..., 18)   = 18
  write(2, "\n"..., 1)                    = 1
  write(2, "0 packets received by filter"..., 28) = 28
  write(2, "\n"..., 1)                    = 1
  write(2, "0 packets dropped by kernel\n"..., 28) = 28
  setsockopt(3, SOL_PACKET, PACKET_RX_RING, 
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16) = -1 EBUSY (Device or resource busy)
  munmap(0x7f27b3f1b000, 2146304)         = 0
  close(3)                                = 0
  exit_group(1)                           = ?
  
  Looks fine.
  [Hmm. Is the last setsockopt(3, SOL_PACKET, PACKET_RX_RING, ..) on the dead
  file descriptor really needed?...]
  
- Btw, I run ubuntu jaunty amd64. 
+ Btw, I run ubuntu jaunty amd64.
  tcpdump is Version: 3.9.8-4ubuntu2
  libpcap0.8 is Version: 1.0.0-1
  
- 
  Regards,
-         - Thomas Osterried
+         - Thomas Osterried


-- 
libpcap issue: tcpdump consumes 100% cpu power after interface shutdown
https://bugs.launchpad.net/bugs/414744
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 414744] Re: libpcap issue: tcpdump consumes 100% cpu power after interface shutdown

Reply via email to