Dear Barry and Jeff,
using OpenMPI we are experimenting something like the behaviour reported
by Barry.
Let me to introduce the context:
we are using RHEL4 U4 on 2 way, AMD Opteron dual core, nodes.
Each node is equipped with 16 GB of RAM, plus 4 GB of SWAP.
OpenMPi is 1.2.2.
Sometimes, for jobs that runs for many hours (1 - 2 days), it happens
that mpirun generates "kernel memory crisis". This is an excerpt of what
we are seeing in syslog:
Jan 7 10:14:18 node203e0 kernel: mpirun: page allocation failure. order:5,
mode:0xd0
Jan 7 10:14:18 node203e0 kernel:
Jan 7 10:14:18 node203e0 kernel: Call
Trace:<ffffffff8015e0a8>{__alloc_pages+768}
<ffffffff8015e141>{__get_free_pages+11}
Jan 7 10:14:18 node203e0 kernel: <ffffffff8016127c>{kmem_getpages+36}
<ffffffff802d0d2e>{tcp_sendmsg+0}
Jan 7 10:14:18 node203e0 kernel: <ffffffff802d0d2e>{tcp_sendmsg+0}
<ffffffff80161a11>{cache_alloc_refill+609}
Jan 7 10:14:18 node203e0 kernel: <ffffffff801616df>{__kmalloc+123}
<ffffffff802aadd4>{alloc_skb+65}
Jan 7 10:14:18 node203e0 kernel: <ffffffff802d0e99>{tcp_sendmsg+363}
<ffffffff802a7143>{sock_sendmsg+271}
Jan 7 10:14:18 node203e0 kernel:
<ffffffff8015bfc1>{__generic_file_aio_write_nolock+731}
Jan 7 10:14:18 node203e0 kernel:
<ffffffff8015c329>{generic_file_aio_write+126}
<ffffffff80135752>{autoremove_wake_function+0}
Jan 7 10:14:18 node203e0 kernel:
<ffffffffa0254448>{:nfs:nfs_file_write+195}
<ffffffff802a767f>{sock_readv_writev+122}
Jan 7 10:14:18 node203e0 kernel: <ffffffff802a7704>{sock_writev+61}
<ffffffff80179a89>{do_readv_writev+421}
Jan 7 10:14:18 node203e0 kernel:
<ffffffff80135752>{autoremove_wake_function+0}
<ffffffff8018b393>{poll_freewait+64}
Jan 7 10:14:18 node203e0 kernel: <ffffffff801932d4>{dnotify_parent+34}
<ffffffff80179c5b>{sys_writev+69}
Jan 7 10:14:18 node203e0 kernel: <ffffffff8011026a>{system_call+126}
Jan 7 10:14:18 node203e0 kernel: Mem-info:
Jan 7 10:14:18 node203e0 kernel: Node 1 DMA per-cpu: empty
Jan 7 10:14:18 node203e0 kernel: Node 1 Normal per-cpu:
Jan 7 10:14:18 node203e0 kernel: cpu 0 hot: low 32, high 96, batch 16
Jan 7 10:14:18 node203e0 kernel: cpu 0 cold: low 0, high 32, batch 16
Jan 7 10:14:18 node203e0 kernel: cpu 1 hot: low 32, high 96, batch 16
Jan 7 10:14:18 node203e0 kernel: cpu 1 cold: low 0, high 32, batch 16
Jan 7 10:14:19 node203e0 kernel: cpu 2 hot: low 32, high 96, batch 16
Jan 7 10:14:19 node203e0 kernel: cpu 2 cold: low 0, high 32, batch 16
Jan 7 10:14:19 node203e0 kernel: cpu 3 hot: low 32, high 96, batch 16
Jan 7 10:14:19 node203e0 kernel: cpu 3 cold: low 0, high 32, batch 16
Jan 7 10:14:19 node203e0 kernel: Node 1 HighMem per-cpu: empty
Jan 7 10:14:19 node203e0 kernel: Node 0 DMA per-cpu:
Jan 7 10:14:19 node203e0 kernel: cpu 0 hot: low 2, high 6, batch 1
Jan 7 10:14:19 node203e0 kernel: cpu 0 cold: low 0, high 2, batch 1
Jan 7 10:14:19 node203e0 kernel: cpu 1 hot: low 2, high 6, batch 1
Jan 7 10:14:19 node203e0 kernel: cpu 1 cold: low 0, high 2, batch 1
Jan 7 10:14:19 node203e0 kernel: cpu 2 hot: low 2, high 6, batch 1
:
The "crisis" may lead to an "mpirun hang", sometimes.
It seems that mpirun uses aggressively "socket calls", but we are not
sure about the motivation of such behaviour. Maybe there are a set of
synergistic causes, nevertheless when the kernel reports such kind of
"fault" the only implied process is mpirun ...., all the times.
marco
On Fri, 2008-01-18 at 22:13 -0500, Barry Rountree wrote:
> On Fri, Jan 18, 2008 at 08:33:10PM -0500, Jeff Squyres wrote:
> > Barry --
> >
> > Could you check what apps are still running when it hangs? I.e., I
> > assume that all the uptime's are dead; are all the orted's dead on the
> > remote nodes? (orted = our helper process that is launched on the
> > remote nodes to exert process control, funnel I/O back and forth to
> > mpirun, etc.)
> >
> > If any of the orted's are still running, can you connect to them with
> > gdb and get a backtrace to see where they are hung?
> >
> > Likewise, can you connect to mpirun with gdb and get a backtrace of
> > where it's hung?
> >
> > Ralph, the main ORTE developer, is pretty sure that it's stuck in the
> > IO flushing routines that are executed at the end of time (look for
> > function names like iof_flush or similar). We thought we had fixed
> > all of those on the 1.2 branch, but perhaps there's some other weird
> > race condition happening that doesn't happen on our test machines...
>
> I'm happy to help. I've got a paper submission deadline on Tuesday, so
> it might not be until midweek.
>
> Thanks for the reply,
>
> Barry
>
> >
> >
> >
> > On Jan 13, 2008, at 10:17 AM, Barry Rountree wrote:
> >
> > > On Sun, Jan 13, 2008 at 09:54:47AM -0500, Barry Rountree wrote:
> > > > Hello,
> > > >
> > > > The following command
> > > >
> > > > mpirun -np 2 -hostfile ~/hostfile uptime
> > > >
> > > > will occasionally hang after completing. The expected output
> > > appears on
> > > > the screen, but mpirun needs a SIGKILL to return to the console.
> > > >
> > > > This has been verified with OpenMPI v1.2.4 compiled with both icc
> > > 9.1
> > > > 20061101 (aka 9.1.045) and gcc 4.1.0 20060304 (aka Red Hat
> > > 4.1.0-3). I
> > > > have also tried earlier versions of OpenMPI and found the same bug
> > > > (1.1.2 and 1.2.2).
> > > >
> > > > Using -verbose didn't provide any additional output. I'm happy
> > > to help
> > > > tracking down whatever is causing this.
> > >
> > > A couple more data points:
> > >
> > > mpirun -np 4 -hostfile ~/hostfile --no-daemonize uptime
> > >
> > > hung twice over 100 runs. Without the --no-daemonize, the command
> > > hung
> > > 16 times over 100 runs. (This is using the version compiled with
> > > icc.)
> > >
> > > Barry
> > >
> > > >
> > > > Many thanks,
> > > >
> > > > Barry Rountree
> > > > Ph.D. Candidate, Computer Science
> > > > University of Georgia
> > > >
> > > > _______________________________________________
> > > > users mailing list
> > > > us...@open-mpi.org
> > > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > _______________________________________________
> > > users mailing list
> > > us...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > --
> > Jeff Squyres
> > Cisco Systems
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
--
-----------------------------------------------------------------
Marco Sbrighi m.sbri...@cineca.it
HPC Group
CINECA Interuniversity Computing Centre
via Magnanelli, 6/3
40033 Casalecchio di Reno (Bo) ITALY
tel. 051 6171516