... yes, enable WITNESS already and see if you can find LORs. :) (Sheesh, that's what it's for! :)
Adrian On 11 January 2012 08:47, Garrett Cooper <yaneg...@gmail.com> wrote: > On Wed, Jan 11, 2012 at 6:33 AM, Ivan Voras <ivo...@freebsd.org> wrote: >> On 11 January 2012 14:06, John Baldwin <j...@freebsd.org> wrote: >>> On Wednesday, January 11, 2012 6:21:18 am Ivan Voras wrote: >>>> The lang/python27 port can optionally be built with the support for >>>> POSIX semaphores - i.e. sem(4). This option is labeled as experimental >>>> so it may be that the code is simply incorrect. I've tried it and get >>>> frequent hangs with the python process in the "usem" state. The kernel >>>> stack is as follows and looks reasonable: >>>> >>>> # procstat -kk 19008 >>>> PID TID COMM TDNAME KSTACK >>>> >>>> 19008 101605 python - mi_switch+0x174 >>>> sleepq_catch_signals+0x2f4 sleepq_wait_sig+0x16 _sleep+0x269 >>>> do_sem_wait+0xa19 __umtx_op_sem_wait+0x51 amd64_syscall+0x450 >>>> Xfast_syscall+0xf7 >>>> >>>> The process doesn't react to SIGINT or SIGTERM but fortunately reacts to >>>> SIGKILL. >>>> >>>> This could be an error in Python code but OTOH this code is not >>>> FreeBSD-specific so it's unlikely. >>> >>> This is using the new umtx-based semaphore code that David Xu wrote. He is >>> probably the best person to ask (cc'd). >>> >> >> Ok, I've encountered the problem repeatedly while building databases/tdb: >> it uses Python in the build process (but maybe it needs something else in >> parallel to provoke the problem). > > Glad to see that iXsystems isn't the only one ([1] -- please add a "me > too" to the PR). The problem is that we do FreeNAS nightlies and they > frequently get stuck building tdb (10%~20% of the time) and it sticks > when doing interactive builds as well. The issue appears to be > exacerbated when we have more builds running in parallel on the same > machine. I've also run into the same issue compiling talloc because it > uses the same waf infrastructure as tdb, which was designed to "speed > things up by forcing builds to be parallelized" (It builds > kern.smp.ncpus jobs instead of -j 1). Furthermore, it seems to occur > regardless of whether or not we have the WITH_SEM enabled in python or > not (build.ix's copy of python doesn't have it enabled, but > streetfighter.ix, my system bayonetta, etc do). > > I haven't actually enabled WITNESS or the deadlock resolver and > checked for LORs / deadlocks, but that might be an alternate avenue to > pursue in debugging the issue; my gut is that the issue exists within > the code that handles the subprocessing stuff and/or the GIL stuff in > the python interpreter and that the race condition between a command > actually finishing and not is relatively small (in most cases) and in > most cases python's code wins and continues on as usual. It could also > be some non-threadsafe code trying to run in parallel touching things > that it shouldn't in the python interpreter. It would also be > interesting to see what python3k brings to the table, but using that > would be introducing some extra unknowns into the equation. > > It can be reproduced by running continuous builds of talloc or tdb. > > Thanks! > -Garrett > > 1. http://www.freebsd.org/cgi/query-pr.cgi?pr=ports/163489 > _______________________________________________ > freebsd-hackers@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org" _______________________________________________ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"