I wrote: > el...@sentor.se wrote: > > > > No one have any thoughts about this? > > > > Its happening sporadically on several FreeBSD 10 machines of mine, > > while > > all of the FreeBSD 9-machines work just fine. > > > > If the problem isn't fixed, people won't be able to upgrade to and > > run > > snort on FreeBSD 10. > > > > log: > > > > > I'm starting snort (as root). > > > > > > <<<lots of startup logs for pid 22646>>> > > > Oct 15 08:46:59 snort[22646]: Initializing daemon mode > > > Oct 15 08:46:59 snort[22648]: Daemon initialized, signaled parent > > > pid: 22646 > > > Oct 15 08:46:59 snort[22648]: Reload thread starting... > > > Oct 15 08:46:59 snort[22648]: Reload thread started, thread > > > 0x8146e8800 (22648) > > > End of log. > > > > > > Error! Nothing more happens with the snort process! > > > Normally it should continue and log the following lines as well: > > > > > > > > > snort[nnn]: Decoding Ethernet > > > snort[nnn]: Checking PID path... > > > snort[nnn]: PID path stat checked out ok, PID path set to > > > /var/run/ > > > snort[nnn]: Writing PID "7627" to file "/var/run//snort_mon0.pid" > > > snort[nnn]: Chroot directory = /usr/foobar/log > > > snort[nnn]: Set gid to 100 > > > snort[nnn]: Set uid to 100 > > > snort[nnn]: > > > snort[nnn]: --== Initialization Complete ==-- > > > snort[nnn]: Commencing packet processing (pid=nnn) > > > > > > > > > > > > > > > > > > When looking at this half-started snort process with 'ps', it > > > looks > > > like > > > this: > > > > > > ps faxulwwj 22648 > > > USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME > > > COMMAND UID PPID CPU PRI NI MWCHAN PGID SID JOBC > > > root 22648 51.8 1.1 488552 179344 - Rs 8:46AM 53:06.52 > > > /usr/local/bin/s 0 1 0 88 0 - 22648 22648 0 > > > > > > > > > The process is still owned by root, so just as the missing log > > > lines are > > > saying, it has not yet performed any change of uid/gid. > > > > > > > > > > > > > > > So there seem to be two questions. > > > > > > Q1) > > > What happens between "Reload thread started, thread 0x8146e8800 > > > (22648)" and > > > "Decoding Ethernet"? > > > Apparently something goes wrong here on FreeBSD 10.0. > > > (this problem does not always occur, sometimes snort start just > > > fine) > > > > > > Q2) > > > When the process has frozen in this half-started state, it can't > > > be > > > killed > > > even with a -9. Why? > > > > > > > > > > > > > > > John-Mark asked me for some debugging info. Here it is: > > > > > > I now run 'kill 22648' on the above semi-started process: > > > > > > USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME > > > COMMAND UID > > > PPID CPU PRI NI MWCHAN PGID SID JOBC > > > old root 22648 51.8 1.1 488552 179344 - Rs 8:46AM 53:06.52 > > > /usr/local/bin/s 0 1 0 88 0 - 22648 22648 0 > > > new root 22648 52.3 1.1 488552 179344 - Rs 8:46AM 53:36.48 > > > /usr/local/bin/s 0 1 0 52 0 - 22648 22648 0 > > > > > > No change. > > > > > > > > > > > > kill -9 22648 > > > > > > USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME > > > COMMAND UID > > > PPID CPU PRI NI MWCHAN PGID SID JOBC > > > old root 22648 51.8 1.1 488552 179344 - Rs 8:46AM 53:06.52 > > > /usr/local/bin/s 0 1 0 88 0 - 22648 22648 0 > > > new root 22648 37.7 1.1 488552 179344 - Ts 8:46AM 53:50.87 > > > /usr/local/bin/s 0 1 0 52 0 - 22648 22648 0 > > > > > > Less CPU-usage and STAT changed to "Ts". > > > > > > > > > > > > > > > kill -CONT 22648 > > > > > > USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME > > > COMMAND UID > > > PPID CPU PRI NI MWCHAN PGID SID JOBC > > > old root 22648 51.8 1.1 488552 179344 - Rs 8:46AM 53:06.52 > > > /usr/local/bin/s 0 1 0 88 0 - 22648 22648 0 > > > new root 22648 0.0 1.1 488552 179344 - Ts 8:46AM 53:50.88 > > > /usr/local/bin/s 0 1 0 52 0 - 22648 22648 0 > > > > > > No change except cpu is down to 0. > > > > > > > > > I now start 'kgdb' > > > info threads > > > I found two threads for snort, doing a bt for both of them: > > > 372 Thread 100602 (PID=22648: snort) sched_switch > > > (td=0xfffff802c061f490, > > > newtd=<value optimized out>, flags=<value optimized out>) at > > > /usr/src/sys/kern/sched_ule.c:1962 > > > 371 Thread 100598 (PID=22648: snort) sched_switch > > > (td=0xfffff80221857000, > > > newtd=<value optimized out>, flags=<value optimized out>) at > > > /usr/src/sys/kern/sched_ule.c:1962 > > > thread 372 > > > [Switching to thread 372 (Thread 100602)]#0 sched_switch > > > (td=0xfffff802c061f490, newtd=<value optimized out>, flags=<value > > > optimized > > > out>) at /usr/src/sys/kern/sched_ule.c:1962 > > > 1962 in /usr/src/sys/kern/sched_ule.c > > > bt > > > #0 sched_switch (td=0xfffff802c061f490, newtd=<value optimized > > > out>, > > > flags=<value optimized out>) at > > > /usr/src/sys/kern/sched_ule.c:1962 > > > #1 0xffffffff808b8c1e in mi_switch (flags=266, newtd=0x0) at > > > /usr/src/sys/kern/kern_synch.c:494 > > > #2 0xffffffff808c04b0 in thread_suspend_switch > > > (td=0xfffff802c061f490) at > > > /usr/src/sys/kern/kern_thread.c:883 > > > #3 0xffffffff808c0276 in thread_single (mode=1) at > > > /usr/src/sys/kern/kern_thread.c:713 > > > #4 0xffffffff8087c1bb in exit1 (td=0xfffff802c061f490, rv=9) at > > > /usr/src/sys/kern/kern_exit.c:180 > > > #5 0xffffffff808b2faf in sigexit (td=<value optimized out>, > > > sig=<value > > > optimized out>) at /usr/src/sys/kern/kern_sig.c:2935 > > > #6 0xffffffff808b3669 in postsig (sig=<value optimized out>) at > > > /usr/src/sys/kern/kern_sig.c:2822 > > > #7 0xffffffff808f6f57 in ast (framep=<value optimized out>) at > > > /usr/src/sys/kern/subr_trap.c:271 > > > #8 0xffffffff80c75870 in Xfast_syscall () at > > > /usr/src/sys/amd64/amd64/exception.S:416 > > > #9 0x0000000801d6f19a in ?? () > > > Previous frame inner to this frame (corrupt stack?) > > > > > > > > > thread 371 > > > [Switching to thread 371 (Thread 100598)]#0 sched_switch > > > (td=0xfffff80221857000, newtd=<value optimized out>, flags=<value > > > optimized > > > out>) at /usr/src/sys/kern/sched_ule.c:1962 > > > 1962 in /usr/src/sys/kern/sched_ule.c > > > bt > > > #0 sched_switch (td=0xfffff80221857000, newtd=<value optimized > > > out>, > > > flags=<value optimized out>) at > > > /usr/src/sys/kern/sched_ule.c:1962 > > > #1 0xffffffff808b8c1e in mi_switch (flags=260, newtd=0x0) at > > > /usr/src/sys/kern/kern_synch.c:494 > > > #2 0xffffffff808f2e3a in sleepq_wait (wchan=0x0, pri=0) at > > > /usr/src/sys/kern/subr_sleepqueue.c:620 > > > #3 0xffffffff80864aad in _cv_wait (cvp=0xffffffff8147a500, > > > lock=0xffffffff8147a480) at /usr/src/sys/kern/kern_condvar.c:139 > > > #4 0xffffffff808fb05f in vmem_xalloc (vm=0xffffffff8147a480, > > > size0=<value > > > optimized out>, align=<value optimized out>, phase=0, > > > nocross=<value > > > optimized out>, minaddr=0, maxaddr=18446735286768857088, > > > flags=8194, > > > addrp=<value optimized out>) at > > > /usr/src/sys/kern/subr_vmem.c:1196 > > > #5 0xffffffff808fae6b in vmem_alloc (vm=0x0, size=0, > > > flags=<value > > > optimized > > > out>, addrp=0xfffffe0466e1d6e8) at > > > /usr/src/sys/kern/subr_vmem.c:1082 > This looks vaguely similar to what I get when I run the system out of > boundary tags. (I say "vaguely similar" because you haven't run out > of > boundary tags, but you may have vmem_xalloc() failing.) > > When these functions are called with M_NOWAIT, they return failure > and > a higher level call in the allocation call stack retries it. Then it > loops in the kernel in "R" state, which is why it isn't killable. > For my case, I believe it happens when the kernel address space gets > too > fragmented. Also, alc@ recently fixed a problem for low kernel > memory cases. I've cc'd him in case he may be able to help? > > Btw Alan, I was never able to reproduce the M_WAITOK case although my > kernel was older than the patch you discussed. I can fairly easily > reproduce the M_NOWAIT case, but haven't tried with a recent kernel > yet. > > I have no idea if increasing vm.kmem_size_max might help? > > rick > Btw, the patch I was alluding to is r272071 in head (MFC'd to stable/10 as r272221). It is a one-line change to sys/vm/vm_pageout.c. I have no idea if this patch can be applied safely to 10.0, but it might be worth trying the patch to see if it helps with this.
rick > > > #6 0xffffffff80b0fa58 in kmem_malloc (vmem=0xffffffff8147a480, > > > size=2139729920, flags=2) at /usr/src/sys/vm/vm_kern.c:314 > > > #7 0xffffffff80b08dfb in uma_large_malloc (size=<value optimized > > > out>, > > > wait=2) at /usr/src/sys/vm/uma_core.c:1006 > > > #8 0xffffffff80898cf3 in malloc (size=2139729920, > > > mtp=0xffffffff813a0450, > > > flags=0) at /usr/src/sys/kern/kern_malloc.c:520 > > > #9 0xffffffff8096307b in bpf_buffer_ioctl_sblen > > > (d=0xfffff80159ea9000, > > > i=<value optimized out>) at /usr/src/sys/net/bpf_buffer.c:183 > > > #10 0xffffffff80960a3c in bpfioctl (dev=0x0, cmd=<value optimized > > > out>, > > > addr=0xfffff801fbd06b40 "", flags=0, td=0xfffff80221857000) at > > > /usr/src/sys/net/bpf.c:408 > > > #11 0xffffffff807ac1df in devfs_ioctl_f (fp=0xfffff8002b3d9d20, > > > com=3221504614, data=0xfffff801fbd06b40, cred=<value optimized > > > out>, > > > td=0xfffff80221857000) at /usr/src/sys/fs/devfs/devfs_vnops.c:757 > > > #12 0xffffffff808fdfae in kern_ioctl (td=0xfffff80221857000, > > > fd=<value > > > optimized out>, com=0) at file.h:319 > > > #13 0xffffffff808fdd2f in sys_ioctl (td=0xfffff80221857000, > > > uap=0xfffffe0466e1da40) at /usr/src/sys/kern/sys_generic.c:702 > > > #14 0xffffffff80c8f117 in amd64_syscall (td=0xfffff80221857000, > > > traced=0) at > > > subr_syscall.c:134 > > > #15 0xffffffff80c7580b in Xfast_syscall () at > > > /usr/src/sys/amd64/amd64/exception.S:391 > > > #16 0x0000000801d8f08a in ?? () > > > Previous frame inner to this frame (corrupt stack?) > > > > > > > > > Let me know if I can debug this any further. > > > > > > /Elof > > > > > > > > > > > > On Thu, 9 Oct 2014, John-Mark Gurney wrote: > > > > > >> el...@sentor.se wrote this message on Wed, Oct 08, 2014 at 13:30 > > >> +0200: > > >>> > > >>> I guess this is a bug report for FreeBSD 10.0. > > >>> > > >>> > > >>> > > >>> Sometimes I can't kill my snort process on FreeBSD 10.0. > > >>> It won't die, even with kill -9. > > >>> > > >>> I'm not talking about a zombie process. Snort is a process that > > >>> should > > >>> die normally. > > >>> I've run snort on over 100 nodes since FreeBSD v6.x and I've > > >>> never seen > > >>> this behavior until now in FreeBSD 10.0. > > >>> > > >>> > > >>> Example: > > >>> > > >>> #ps faxuw > > >>> USER PID %CPU %MEM VSZ RSS TT STAT STARTED > > >>> TIME > > >>> COMMAND > > >>> root 49222 53.4 2.2 492648 183012 - Rs 11:46AM > > >>> 7:05.59 > > >>> /usr/local/bin/snort -q -D -c snort.conf > > >>> root 47937 0.0 2.2 488552 182864 - Ts 10:56AM > > >>> 29:35.98 > > >>> /usr/local/bin/snort -q -D -c snort.conf > > >> > > >> What is the MWCHAN? add l to the ps command... > > >> > > >>> The pid 47937 has been killed (repeatedly) with -9. > > >>> Its status is "Ts" meaning it is Stopped. > > >> > > >> have you tried to kill -CONT <pid> to resume it? > > >> > > >>> But it won't actually die and disappear. The only way to get > > >>> rid > > >>> of it > > >>> seem to be to reboot the machine. :-( > > >>> > > >>> (pid 49222 is the new process that was started after 47937 was > > >>> killed) > > >>> > > >>> > > >>> The problem doesn't happen all the time and I haven't found any > > >>> patterns > > >>> as to when it does. :-( > > >>> If I restart snort once every day, it fails to die > > >>> approximately > > >>> 2-4 times > > >>> per month. > > >>> Even though the problem doesn't happen on every kill, it is a > > >>> definately a > > >>> recurring event. > > >> > > >> Can you run kgdb on the machine? (yes, it works on a live > > >> machine), use > > >> info threads to find the thread id, and then use thread > > >> <threadid> > > >> to > > >> switch to it, and run bt to get a back trace... > > >> > > >>> I began to see it on a heavily loaded 10GE sensor, so I thought > > >>> it could > > >>> have something to do with the ix driver, or the heavy load. > > >>> But now another FreeBSD 10.0-sensor had the exact same problem, > > >>> and this > > >>> sensor don't have any 10GE NICs. In fact, this sensor has been > > >>> running > > >>> just fine with both FreeBSD 9.1 and 9.3 for the past years. > > >>> Snort > > >>> has > > >>> always terminated correctly! After I reinstalled this machine > > >>> with FreeBSD > > >>> 10.0 last friday, snort has then terminated correctly every day > > >>> until > > >>> today, when it failed with the above pid 47937. (this sensor > > >>> use > > >>> the 'em' > > >>> driver, not 'ixgbe') > > >>> > > >>> I'm running snort with the same configuration, settings, > > >>> version, > > >>> daq, > > >>> libs, etc on 10.0 as I do on 9.3. > > >>> None of the 9.3 sensors have this problem, so it has to be > > >>> something new > > >>> in FreeBSD 10.0. > > >> > > >> -- > > >> John-Mark Gurney Voice: +1 415 225 5579 > > >> > > >> "All that I will do, has been done, All that I have, has > > >> not." > > >> > > > _______________________________________________ > > > freebsd-net@freebsd.org mailing list > > > http://lists.freebsd.org/mailman/listinfo/freebsd-net > > > To unsubscribe, send any mail to > > > "freebsd-net-unsubscr...@freebsd.org" > > > > > _______________________________________________ > > freebsd-net@freebsd.org mailing list > > http://lists.freebsd.org/mailman/listinfo/freebsd-net > > To unsubscribe, send any mail to > > "freebsd-net-unsubscr...@freebsd.org" > > > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to > "freebsd-net-unsubscr...@freebsd.org" > _______________________________________________ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"