I have been having some trouble tracing the source of a CPU stall with open MPI
on Gentoo.
My code is very simple: each process does a Monte Carlo run, saves some data to
disk, and sends back a single MPI_DOUBLE to node zero, which picks the best
value from all the computations (including the one it did itself).
For some reason, this can cause CPUs to "stall" (see the error below, on dmesg
output) -- this stall actually causes the system to crash and reboot, which
seems pretty crazy.
My best guess is that some of the nodes greater than zero have "MPI_Send"s out,
but node zero is not finished with its own computation yet, and so has not put
out an MPI_Recv. They get mad waiting? This happens when I give the Monte Carlo
runs large numbers, and so the variance in end time is larger.
However, the behavior seems a bit extreme, and I am wondering if something more
subtle is going on. My sysadmin was trying to fix something on the machine the
last time it crashed, and it trashed the kernel! So I am also in the sysadmin
doghouse.
Any help or advice greatly appreciated! Is it likely to be an MPI_Send/MPI_Recv
problem, or is there something else going on?
[ 1273.079260] INFO: rcu_sched detected stalls on CPUs/tasks: { 12 13}
(detected by 17, t=60002 jiffies)
[ 1273.079272] Pid: 2626, comm: cluster Not tainted 3.6.11-gentoo #10
[ 1273.079275] Call Trace:
[ 1273.079277] <IRQ> [<ffffffff81099b87>] rcu_check_callbacks+0x5a7/0x600
[ 1273.079294] [<ffffffff8103fae3>] update_process_times+0x43/0x80
[ 1273.079298] [<ffffffff8106d796>] tick_sched_timer+0x76/0xc0
[ 1273.079303] [<ffffffff8105329e>] __run_hrtimer.isra.33+0x4e/0x100
[ 1273.079306] [<ffffffff81053adb>] hrtimer_interrupt+0xeb/0x220
[ 1273.079311] [<ffffffff8101fd94>] smp_apic_timer_interrupt+0x64/0xa0
[ 1273.079316] [<ffffffff81515f07>] apic_timer_interrupt+0x67/0x70
[ 1273.079317] <EOI>
Simon
Research Fellow
Santa Fe Institute
http://santafe.edu/~simon