Re: [OMPI users] Checkpointing a restarted app fails

Josh Hursey Mon, 22 Sep 2008 12:55:28 -0400

I believe this is now fixed in the trunk. I was able to reproduce withthe current trunk and committed a fix a few minutes ago in r19601. Sothe fix should be in tonight's tarball (or you can grab it from SVN).I've made a request to have the patch applied to v1.3, but that maytake a day or so to complete.


Let me know if this fix eliminates your SIGPIPE issues.


Thanks for the bug report :)

Cheers,
Josh

On Sep 17, 2008, at 11:55 PM, Matthias Hovestadt wrote:

Hi Josh!

First of all, thanks a lot for replying. :-)

When executing this checkpoint command, the running application
directly aborts, even though I did not specify the "--term" option:

--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 14050 on node grid-demo-1.cit.tu-berlin.de exited on signal 13 (Broken pipe).
--------------------------------------------------------------------------
ccs@grid-demo-1:~$
Interesting. This looks like a bug with the restart mechanism inOpen MPI. This was working fine, but something must have changed inthe trunk to break it.


Do you perhaps know a SVN revision number of OMPI that
is known to be working? If this issue is a regression
failure, I would be glad to use the source from an old
but working SVN state...

A useful piece of debugging information for me would be a stacktrace from the failed process. You should be able to get this froma core file it left or If you would set the following MCA variablein $HOME/.openmpi/mca-params.conf:
 opal_cr_debug_sigpipe=1
This will cause the Open MPI app to wait in a sleep loop when itdetects a Broken Pipe signal. Then you should be able to attach adebugger and retrieve a stack trace.


I created this file:

ccs@grid-demo-1:~$ cat .openmpi/mca-params.conf
opal_cr_debug_sigpipe=1
ccs@grid-demo-1:~$

Then I restarted the application from a checkpointed state
and tried to checkpoint this restarted application. Unfortunately
the restated application still terminates, despite of this para-
meter. However, the output slightly changed :


worker fetch area available 1

[grid-demo-1.cit.tu-berlin.de:26220] opal_cr: sigpipe_debug: DebugSIGPIPE [13]: PID (26220)

--------------------------------------------------------------------------

mpirun noticed that process rank 0 with PID 26248 on node grid-demo-1.cit.tu-berlin.de exited on signal 0 (Unknown signal 0).

--------------------------------------------------------------------------
2 total processes killed (some possibly by mpirun during cleanup)
ccs@grid-demo-1:~$


There is now this additional "opal_cr: sigpipe_debug" line, so
he apparently evaluates the .openmpi/mca-params.conf


I also tried to get a corefile by setting "ulimit -c 50000", so
that ulimit -a gives me the following output:

ccs@grid-demo-1:~$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 20
file size               (blocks, -f) unlimited
pending signals                 (-i) unlimited
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) unlimited
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
ccs@grid-demo-1:~$

Unfortunately, no corefile is generated, so that I do not know
how to give you the requested stacktrace.

Are there perhaps other debug parameters I could use?


Best,
Matthias
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Checkpointing a restarted app fails

Reply via email to