I'm trying to troubleshoot an intermittent Apache performance problem, and I've narrowed it down using to what appears to be a brief whole-system hang that last from 0.5 - 3 seconds. They occur every few minutes.
I took the rather extreme step of doing "ktrace -t cnisuwt -i -d -p 1" and then I waited for the hang. This is what I got: 54937 httpd 1251302859.375313 CALL shutdown(0x3,<invalid=1>) 54937 httpd 1251302859.375333 RET shutdown 0 54937 httpd 1251302859.375348 CALL select(0x4,0xbfbfe92c,0,0,0xbfbfe9ac) 54937 httpd 1251302859.375363 CSW stop kernel 54937 httpd 1251302859.376402 CSW resume kernel 54937 httpd 1251302859.376439 RET select 1 54937 httpd 1251302859.376453 CALL read(0x3,0xbfbfe9b4,0x200) 54937 httpd 1251302859.376470 GIO fd 3 read 0 bytes 54937 httpd 1251302859.376482 RET read 0 54937 httpd 1251302859.376495 CALL close(0x3) 54937 httpd 1251302859.376511 RET close 0 54937 httpd 1251302859.376525 CALL sigaction(SIGUSR1,0xbfbfebb0,0xbfbfeb98) 54937 httpd 1251302859.376538 RET sigaction 0 54937 httpd 1251302859.376552 CALL munmap(0x282ff000,0x11) 54937 httpd 1251302859.376607 RET munmap 0 54937 httpd 1251302859.376633 CALL accept(0x11,0xbfbfebf0,0xbfbfec10) 54937 httpd 1251302859.376649 CSW stop kernel 796 svscan 1251302859.481064 CSW resume kernel 54937 httpd 1251302859.489374 CSW resume kernel 54937 httpd 1251302859.489391 STRU struct sockaddr { AF_INET, 172.17.0.143:61610 } 98229 httpd 1251302859.601850 CSW resume kernel 46517 httpd 1251302859.601900 CSW resume kernel 98202 httpd 1251302859.611661 CSW resume kernel 837 nrpe2 1251302859.622681 CSW resume kernel 54454 httpd 1251302859.655422 CSW resume kernel 54454 httpd 1251302859.655443 STRU struct sockaddr { AF_INET, 172.17.0.131:59011 } 7182 httpd 1251302859.722381 CSW resume kernel 98178 httpd 1251302859.722438 CSW resume kernel 858 gmond 1251302859.794996 CSW resume kernel 858 gmond 1251302859.794998 GIO fd 5 wrote 0 bytes 770 ntpd 1251302860.076501 CSW resume kernel 98346 httpd 1251302860.086261 CSW resume kernel 65277 httpd 1251302860.086300 CSW resume kernel 98514 httpd 1251302860.106849 CSW resume kernel 7191 httpd 1251302860.106894 CSW resume kernel 796 svscan 1251302861.403335 RET nanosleep 0 796 svscan 1251302861.403370 CALL wait4(0xffffffff,0xbfbfee18,WNOHANG,0) 796 svscan 1251302861.403405 RET wait4 0 54454 httpd 1251302861.403481 RET accept 3 98229 httpd 1251302861.403532 RET select 0 796 svscan 1251302861.403553 CALL stat(0x804a3bb,0xbfbfed6c) 858 gmond 1251302861.403601 GIO fd 5 read 20 bytes 54454 httpd 1251302861.403619 CSW stop user 46517 httpd 1251302861.403647 RET select 0 858 gmond 1251302861.403674 RET kevent 1 858 gmond 1251302861.403710 CALL socket(PF_INET,SOCK_DGRAM,IPPROTO_IP) 98202 httpd 1251302861.403714 RET select 0 858 gmond 1251302861.403752 RET socket 9 837 nrpe2 1251302861.403756 RET select 0 There is a gap between 1251302860.106894 and 1251302861.403335 of over one second, and the "effective gap" starts around 1251302859.376649 and thus lasts for about two seconds. This machine runs Apache and during this sample it was being hit every 0.1 seconds with a test request for a simple static file (in addition to production traffic). It is a 2-processor machine that is 85-95% idle; there's nothing in userspace that runs that long without yielding. According to systat, it handles 5000+ syscalls every second. But according to ktrace, nothing happens at all during the hang. This matches user experience. (The static file request, which usually completes in <0.01s suddenly takes 2 seconds as observed from the remote machine issuing the requests.) Here's the relevant snip from the httpd process handling that static file at the time of the hang: 54937 httpd 1251302859.376633 CALL accept(0x11,0xbfbfebf0,0xbfbfec10) 54937 httpd 1251302859.376649 CSW stop kernel 54937 httpd 1251302859.489374 CSW resume kernel 54937 httpd 1251302859.489391 STRU struct sockaddr { AF_INET, 172.17.0.143:61610 } 54937 httpd 1251302861.403862 RET accept 3 It's stuck in accept, but does *not* get context-switched away from during the delay. (The earlier context switch corresponds to the 0.1 seconds between requests; there is an Apache instance configured to handle just the test requests with one child process; that process has nothing else to do or block on.) I'll include some other processes below. I think it's weird that all these processes get context-switched-into before/during the hang, and I wonder if it's a clue. The kernel is obviously still running, since it wakes these processes up, but nothing is happening. That and the fact that it happens on multiple machines (though we've only tested this one) leads me away from suspecting the hardware. Are we out of a kernel resource or is something getting garbage collected / stuck / temporarily locked inside the kernel somewhere? This is a production server, so our debugging options are very limited. We also can't recreate the problem on a test server. So KTR and/or the kernel debugger are probably out. There's nothing in syslog or dmesg or on the console. Could anyone suggest away to gather more information about this, given our constraints? More samples: Here's an Apache "parent" process; its hang is in select(). The immediately following select() call demonstrates what the middle one should also have done: 98178 httpd 1251302858.703733 CALL select(0,0,0,0,0xbfbfed3c) 98178 httpd 1251302858.703789 CSW stop kernel 98178 httpd 1251302859.722438 CSW resume kernel 98178 httpd 1251302861.403908 RET select 0 98178 httpd 1251302861.404118 CALL gettimeofday(0xbfbfec58,0) 98178 httpd 1251302861.404453 RET gettimeofday 0 98178 httpd 1251302861.404478 CALL wait4(0xffffffff,0xbfbfed80,WNOHANG,0) 98178 httpd 1251302861.405288 RET wait4 0 98178 httpd 1251302861.405576 CALL select(0,0,0,0,0xbfbfed3c) 98178 httpd 1251302861.405603 CSW stop kernel 98178 httpd 1251302862.417549 CSW resume kernel 98178 httpd 1251302862.417581 RET select 0 Same behavior from nrpe2: 837 nrpe2 1251302859.107587 CALL select(0x5,0xbfbfe16c,0,0xbfbfe16c,0xbfbfe20c) 837 nrpe2 1251302859.107607 CSW stop kernel 837 nrpe2 1251302859.622681 CSW resume kernel 837 nrpe2 1251302861.403756 RET select 0 Here's a production Apache child process. Also hangs in accept(), but the delay does happen inside the context switch: 55062 httpd 1251302859.044661 CALL accept(0x12,0xbfbfec00,0xbfbfec20) 55062 httpd 1251302859.044681 CSW stop kernel 55062 httpd 1251302861.792647 CSW resume kernel 55062 httpd 1251302861.792662 STRU struct sockaddr { AF_INET, 172.17.0.131:59093 } 55062 httpd 1251302861.792880 RET accept 3 gmond hangs in kevent() on some kind of I/O: 858 gmond 1251302859.233805 CALL kevent(0x5,0,0,0x2833d960,0x2,0xbfbfe344) 858 gmond 1251302859.233820 CSW stop kernel 858 gmond 1251302859.794996 CSW resume kernel 858 gmond 1251302859.794998 GIO fd 5 wrote 0 bytes 858 gmond 1251302861.403601 GIO fd 5 read 20 bytes 858 gmond 1251302861.403674 RET kevent 1 svscan hangs in nanosleep(). It sleeps for 5 seconds, not 7. 796 svscan 1251302854.429189 CALL lseek(0x3,0,SEEK_SET,0) 796 svscan 1251302854.429269 RET lseek 0 796 svscan 1251302854.429346 CALL close(0x3) 796 svscan 1251302854.429709 RET close 0 796 svscan 1251302854.429894 CALL nanosleep(0xbfbfee18,0xbfbfee10) 796 svscan 1251302854.430033 CSW stop kernel 796 svscan 1251302859.481064 CSW resume kernel 796 svscan 1251302861.403335 RET nanosleep 0 796 svscan 1251302861.403370 CALL wait4(0xffffffff,0xbfbfee18,WNOHANG,0) 796 svscan 1251302861.403405 RET wait4 0 796 svscan 1251302861.403553 CALL stat(0x804a3bb,0xbfbfed6c) 796 svscan 1251302861.404027 NAMI "." ntpd shows the same behavior, except it wakes up by alarm. Again, the normal behavior (no gap between CSW and RET) is shown right after: 770 ntpd 1251302859.078305 CALL select(0x3b,0xbfbfed7c,0,0,0) 770 ntpd 1251302859.078312 CSW stop kernel 770 ntpd 1251302860.076501 CSW resume kernel 770 ntpd 1251302861.403952 RET select -1 errno 4 Interrupted system call 770 ntpd 1251302861.404168 PSIG SIGALRM caught handler=0x806ad90 mask=0x0 code=0x0 770 ntpd 1251302861.404191 CALL sigreturn(0xbfbfea50) 770 ntpd 1251302861.404553 RET sigreturn JUSTRETURN 770 ntpd 1251302861.404890 CALL select(0x3b,0xbfbfed7c,0,0,0) 770 ntpd 1251302861.405955 CSW stop kernel 770 ntpd 1251302862.077114 CSW resume kernel 770 ntpd 1251302862.077171 RET select -1 errno 4 Interrupted system call There were other processes running, mainly more httpds, but I went through each and every one of them and I don't see anything that I haven't already shown an example of. Thanks very much for any advice! _______________________________________________ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"