[OMPI users] ompi-checkpoint problem on shared storage

2011-09-23 Thread Dave Schulz

Hi Everyone.

I've been trying to figure out an issue with ompi-checkpoint/blcr.  The 
symptoms seem to be related to what filesystem the 
snapc_base_global_snapshot_dir is located on.


I wrote a simple mpi program where rank 0 sends to 1, 1 to 2, etc. then 
the highest sends to 0. then it waits 1 sec and repeats.


I'm using openmpi-1.4.3 and when I run "ompi-checkpoint --term 
" on the shared filesystems, the ompi-checkpoint returns a 
checkpoint reference, the worker processes go away, but the mpirun 
remains but is stuck (It dies right away if I run kill on it -- so it's 
responding to SIGTERM).  If I attach an strace to the mpirun, I get the 
following from strace forever:


poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, 
events=POLLIN}, {fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=0, 
events=POLLIN}], 6, 1000) = 0 (Timeout)
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, 
events=POLLIN}, {fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=0, 
events=POLLIN}], 6, 1000) = 0 (Timeout)
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, 
events=POLLIN}, {fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=0, 
events=POLLIN}], 6, 1000) = 0 (Timeout)


I'm running with:
mpirun -machinefile machines -am ft-enable-cr ./mpiloop
the "machines" file simply has the local hostname listed a few times.  
I've tried 2 and 8.  I can try up to 24 as this node is a pretty big one 
if it's deemed useful.  Also, there's 256Gb of RAM.  And it's Opteron 6 
core, 4 socket if that helps.



I initially installed this on a test system with only local harddisks 
and standard nfs on Centos 5.6 where everything worked as expected.  
When I moved over to the production system things started breaking.  The 
filesystem is the major software difference.  The shared filesystems are 
Ibrix and that is where the above symptoms started to appear.


I haven't even moved on to multi-node mpi runs as I can't even get this 
to work for any number of processes on the local machine except if I set 
the checkpoint directory to /tmp which is on a local xfs harddisk.  If I 
put the checkpoints on any shared directory, things fail.


I've tried a number of *_verbose mca parameters and none of them seem to 
issue any messages at the point of checkpoint, only when I give-up and 
send kill `pidof mpirun` are there any further messages.


openmpi is compiled with:
./configure --prefix=/global/software/openmpi-blcr 
--with-blcr=/global/software/blcr 
--with-blcr-libdir=/global/software/blcr/lib/ --with-ft=cr 
--enable-ft-thread --enable-mpi-threads --with-openib --with-tm


and blcr only has a prefix to put it in /global/software/blcr otherwise 
it's vanilla.  Both are compiled with the default gcc.


One final note, is that occasionally it does succeed and terminate.  But 
it seems completely random.


What I'm wondering is has anyone else seen symptoms like this -- 
especially where the mpirun doesn't quit after a checkpoint with --term 
but the worker processes do?


Also, is there some sort of somewhat unusual filesystem semantic that 
our shared filesystem may not support that ompi/ompi-checkpoint is needing?


Thanks for any insights you may have.

-Dave



Re: [OMPI users] ompi-checkpoint problem on shared storage

2011-09-27 Thread Dave Schulz

Thanks Josh,

Just yesterday I stumbled upon another interesting detail about this 
issue.  While reconfiguring things, I accidentally ran as root, and the 
checkpointing all succeeded.  I'm not sure though how to go about 
finding what file things are hanging up on.  I've compared straces as 
root and user and can't see anything out of the ordinary that root can 
read or if necessary write to.  My /tmp is 777 sticky, my checkpoint 
directory (currently ~) is writable by the user.  The only files that 
root can read but users can't that appear in the strace are these files:


open("/sys/devices/system/cpu/cpu1/online", O_RDONLY) = 3

Root receives a file descriptor as shown above, users get this:

open("/sys/devices/system/cpu/cpu2/online", O_RDONLY) = -1 EACCES 
(Permission denied)


Which is what would be expected as the perms on the "online" files are 
600.  I don't know if this is actually a problem or not.  I did try on a 
"sacrificial" node to change the perms to 644, but it didn't help.  So I 
don't really think this is related to the cause.  I'm simply listing the 
difference that I found between the open syscalls when running as root 
vs. running as a user.


Finally, I tried installing a vanilla CentOS 5.6 machine on the same 
hardware, Installing openmpi 1.4.3, blcr 0.8.3, Mellanox OFED 1.5.3 and 
using a vanilla NFS server on which I see the same symptoms.


So, contrary to what was previously thought, this really isn't caused by 
the particular type of network filesystem.  Still, things work perfectly 
to /tmp (for the checkpoint directory).  I can still read the 
application and input from the NFS server.


I also noted that when I run with lots of verbosity, I get messages like 
this:


Snapshot Ref.:   0 ompi_global_snapshot_9333.ckpt
--
mpirun noticed that process rank 0 with PID 9338 on node b15 exited on 
signal 0 (Unknown signal 0).

--
2 total processes killed (some possibly by mpirun during cleanup)
[b15:09333] sess_dir_finalize: job session dir not empty - leaving
[b15:09333] snapc:full: module_finalize()
[b15:09333] snapc:full: close()
[b15:09333] mca: base: close: component full closed
[b15:09333] mca: base: close: unloading component full
[b15:09333] filem:rsh: close()
[b15:09333] mca: base: close: component rsh closed
[b15:09333] mca: base: close: unloading component rsh
[b15:09333] sess_dir_finalize: proc session dir not empty - leaving
orterun: exiting with status 0

Specifically the sess_dir_finalize line.  It doesn't seem to be cleaning 
itself up.  it leaves these in the /tmp directory for every run (even 
successfully checkpointed and terminated runs):


prw-r- 1 root root0 Sep 27 15:21 opal_cr_prog_read.9584
prw-r- 1 root root0 Sep 27 15:21 opal_cr_prog_read.9585
prw-r- 1 root root0 Sep 27 15:21 opal_cr_prog_write.9584
prw-r- 1 root root0 Sep 27 15:21 opal_cr_prog_write.9585

I believe those are pipes.  But why they aren't cleaned up after the 
checkpoint completes, I don't understand as the job may be restarted on 
a different batch of nodes, so these will start accumulating over time.


Any ideas?

Thanks
-Dave


On 09/23/2011 01:24 PM, Josh Hursey wrote:

It sounds like there is a race happening in the shutdown of the
processes. I wonder if the app is shutting down in a way that mpirun
does not quite like.

I have not tested the C/R functionality in the 1.4 series in a long
time. Can you give it a try with the 1.5 series, and see if there is
any variation? You might also try the trunk, but I have not tested it
recently enough to know if things are still working correctly or not
(have others?).

I'll file a ticket so we do not lose track of the bug. Hopefully we
will get to it soon.
   https://svn.open-mpi.org/trac/ompi/ticket/2872

Thanks,
Josh

On Fri, Sep 23, 2011 at 3:08 PM, Dave Schulz  wrote:

Hi Everyone.

I've been trying to figure out an issue with ompi-checkpoint/blcr.  The
symptoms seem to be related to what filesystem the
snapc_base_global_snapshot_dir is located on.

I wrote a simple mpi program where rank 0 sends to 1, 1 to 2, etc. then the
highest sends to 0. then it waits 1 sec and repeats.

I'm using openmpi-1.4.3 and when I run "ompi-checkpoint --term
" on the shared filesystems, the ompi-checkpoint returns a
checkpoint reference, the worker processes go away, but the mpirun remains
but is stuck (It dies right away if I run kill on it -- so it's responding
to SIGTERM).  If I attach an strace to the mpirun, I get the following from
strace forever:

poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN},
{fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=0, events=POLLIN}], 6,
1000) = 0 (Timeout)
poll([{fd=4, events=POLLIN}, {fd=5, e