Re: [OMPI users] How to restart a job twice

Josh Hursey Thu, 24 Apr 2008 13:59:12 -0400

Tamer,

I'm confident that this particular problem is now fixed in the trunk(r18276). If you are interested in the details on the bug and how itwas fixed the commit message is fairly detailed:

 https://svn.open-mpi.org/trac/ompi/changeset/18276

Let me know if this patch fixes things. Like I said I'm confident thatit does, but there are always more bugs :)


Thanks again for the bug report.

Cheers,
Josh

On Apr 24, 2008, at 11:02 AM, Josh Hursey wrote:

Tamer,

Another user contacted me off list yesterday with a similar problem
with the current trunk. I have been able to reproduce this, and am
currently trying to debug it again. It seems to occur more often with
builds without the checkpoint thread (--disable-ft-thread). It seems
to be a race in our connection wireup which is why it does not always
occur.

Thank you for your patience as I try to track this down. I'll let you
know as soon as I have a fix.

Cheers,
Josh

On Apr 24, 2008, at 10:50 AM, Tamer wrote:

Josh, Thank you for your help. I was able to do the following with
r18241:

start the parallel job
checkpoint and restart
checkpoint and restart
checkpoint but failed to restart with the following message:

ompi-restart ompi_global_snapshot_23800.ckpt
[dhcp-119-202.caltech.edu:23650] [[45699,1],1]-[[45699,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)

[dhcp-119-202.caltech.edu:23650] [[45699,1],1] routed:tree:Connection

to lifeline [[45699,0],0] lost
[dhcp-119-202.caltech.edu:23650] [[45699,1],1]-[[45699,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)

[dhcp-119-202.caltech.edu:23650] [[45699,1],1] routed:tree:Connection

to lifeline [[45699,0],0] lost
[dhcp-119-202:23650] *** Process received signal ***
[dhcp-119-202:23650] Signal: Segmentation fault (11)
[dhcp-119-202:23650] Signal code: Address not mapped (1)
[dhcp-119-202:23650] Failing at address: 0x3e0f50
[dhcp-119-202:23650] [ 0] [0x110440]
[dhcp-119-202:23650] [ 1] /lib/libc.so.6(__libc_start_main+0x107)
[0xc5df97]
[dhcp-119-202:23650] [ 2] ./ares-openmpi-r18241 [0x81703b1]
[dhcp-119-202:23650] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 23857 on node
dhcp-119-202.caltech.edu exited on signal 11 (Segmentation fault).


So, this time the process went further than before. I tested on a
different platform (64 bit machine with fedora core 7) and openmpi
checkpoints and restarts as many times as I want to without any
problems. This means that the issue above must be platform dependent
and I must be missing some option in building the code.

Cheers,
Tamer


On Apr 22, 2008, at 5:52 PM, Josh Hursey wrote:

Tamer,

This should now be fixed in r18241.

Though I was able to replicate this bug, it only occurred

sporadically for me. It seemed to be caused by some socketdescriptor

caching that was not properly cleaned up by the restart procedure.

My testing appears to conclude that this bug is now fixed, but since
it is difficult to reproduce if you see it happen again definitely
let me know.


With the current trunk you may see the following error message:
--------------------------------------
[odin001][[7448,1],0][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
--------------------------------------

This is not caused by the checkpoint/restart code, but by somerecent

changes to our TCP component. We are working on fixing this, but I

just wanted to give you a heads up in case you see this error. Asfar

as I can tell it does not interfere with the checkpoint/restart
functionality.

Let me know if this fixes your problem.

Cheers,
Josh


On Apr 22, 2008, at 9:16 AM, Josh Hursey wrote:

Tamer,

Just wanted to update you on my progress. I am able to reproduce
something similar to this problem. I am currently working on a
solution to it. I'll let you know when it is available, probably in
the next day or two.

Thank you for the bug report.

Cheers,
Josh

On Apr 18, 2008, at 1:11 PM, Tamer wrote:

Hi Josh:

I am running on linux fedora core 7 kernel: 2.6.23.15-80.fc7

The machine is dual-core with shared memory so it's not even a
cluster.

I downloaded r18208 and built it with the following options:

./configure --prefix=/usr/local/openmpi-with-checkpointing-r18208
--
with-ft=cr --with-blcr=/usr/local/blcr

when I run mpirun I pass the following command:

mpirun -np 2 -am ft-enable-cr ./ares-openmpi -c -f madonna-13760

I was able to checkpoint and restart successfully and was able to
checkpoint the restarted job  (mpirun showed up with ps-efa |grep
mpirun under r18208) but was unable to restart again; here's the
error message:

mpi-restart ompi_global_snapshot_23865.ckpt
[dhcp-119-202.caltech.edu:23846] [[45670,1],1]-[[45670,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:23846] [[45670,1],1] routed:unity:
Connection to lifeline [[45670,0],0] lost
[dhcp-119-202.caltech.edu:23845] [[45670,1],0]-[[45670,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:23845] [[45670,1],0] routed:unity:
Connection to lifeline [[45670,0],0] lost
[dhcp-119-202.caltech.edu:23846] [[45670,1],1]-[[45670,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:23846] [[45670,1],1] routed:unity:
Connection to lifeline [[45670,0],0] lost
[dhcp-119-202.caltech.edu:23845] [[45670,1],0]-[[45670,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:23845] [[45670,1],0] routed:unity:
Connection to lifeline [[45670,0],0] lost
---------------------------------------------------------------------
-----
mpirun has exited due to process rank 1 with PID 24012 on
node dhcp-119-202.caltech.edu exiting without calling "finalize".
This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).

Thank you in advance for your help.

Tamer


On Apr 18, 2008, at 7:07 AM, Josh Hursey wrote:

This problem has come up in the past and may have been fixedsince

r14519. Can you update to r18208 and see if the error still
occurs?

A few other questions that will help me try to reproduce the
problem.

Can you tell me more about the configuration of the system youare

running on (number of machines, if there is a resource manager)?
How
did you configure Open MPI and what command line options are you
passing to 'mpirun'?

-- Josh

On Apr 18, 2008, at 9:36 AM, Tamer wrote:

Thanks Josh, I tried what you suggested with my existing r14519,
and I
was able to checkpoint the restarted job but was never able to
restart
it. I looked up the PID for 'orterun' and checkpointed the
restarted

job but when I try to restart from that point I get thefollowing

error:

ompi-restart ompi_global_snapshot_7704.ckpt
[dhcp-119-202.caltech.edu:07292] [[61851,1],1]-[[61851,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:07292] [[61851,1],1] routed:unity:
Connection to lifeline [[61851,0],0] lost
[dhcp-119-202.caltech.edu:07292] [[61851,1],1]-[[61851,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:07292] [[61851,1],1] routed:unity:
Connection to lifeline [[61851,0],0] lost
[dhcp-119-202.caltech.edu:07291] [[61851,1],0]-[[61851,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:07291] [[61851,1],0] routed:unity:
Connection to lifeline [[61851,0],0] lost
[dhcp-119-202.caltech.edu:07291] [[61851,1],0]-[[61851,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:07291] [[61851,1],0] routed:unity:
Connection to lifeline [[61851,0],0] lost
-------------------------------------------------------------------
-------
orterun has exited due to process rank 1 with PID 7737 on

node dhcp-119-202.caltech.edu exiting without calling"finalize".

This
may
have caused other processes in the application to be
terminated by signals sent by orterun (as reported here).

Do I have to run the copenmpi clean command after the first
checkpoint

and before restarting the checkpointed job so I can checkpointit

again or is there something I am missing in this version
completely
and I would have to go to r18208? Thank you in advance for your
help.

Tamer

On Apr 18, 2008, at 6:03 AM, Josh Hursey wrote:

When you use 'ompi-restart' to restart a job it fork/execs the
completely new job using the restarted processes for the ranks.
However instead of calling the 'mpirun' process ompi-restart
currently

calls 'orterun'. These two programs are exactly the same(mpirun

is a
symbolic link to orterun). So if you look for the PID of
'orterun'
that can be used to checkpoint the process.

However it is confusing that Open MPI makes this switch. So I

committed in r18208 a fix for this that uses the 'mpirun'binary

name

instead of the 'orterun' binary name. This fits with thetypical

use
case of checkpoint/restart in Open MPI in which users expect to
find
the 'mpirun' process on restart instead of the lesser known
'orterun'
process.

Sorry for the confusion.

Josh

On Apr 18, 2008, at 1:14 AM, Tamer wrote:

Dear all, I installed the developer's version r14519 and was
able to
get it running. I successfully checkpointed a parallel job and

restarted it. My question is how can I checkpoint therestarted

job?
The problem is once the original job is terminated and
restarted
later

on, the mpirun does not exist anymore (ps -efa|grep mpirun)and

hence
I do not know which PID I should use when I run the ompi-
checkpoint
on
the restarted job. Any help would be greatly appreciated.

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] How to restart a job twice

Reply via email to