[OMPI users] How to restart a job twice

2008-04-18 Thread Tamer
Dear all, I installed the developer's version r14519 and was able to  
get it running. I successfully checkpointed a parallel job and  
restarted it. My question is how can I checkpoint the restarted job?  
The problem is once the original job is terminated and restarted later  
on, the mpirun does not exist anymore (ps -efa|grep mpirun) and hence  
I do not know which PID I should use when I run the ompi-checkpoint on  
the restarted job. Any help would be greatly appreciated.




Re: [OMPI users] How to restart a job twice

2008-04-18 Thread Tamer
Thanks Josh, I tried what you suggested with my existing r14519, and I  
was able to checkpoint the restarted job but was never able to restart  
it. I looked up the PID for 'orterun' and checkpointed the restarted  
job but when I try to restart from that point I get the following error:


ompi-restart ompi_global_snapshot_7704.ckpt
[dhcp-119-202.caltech.edu:07292] [[61851,1],1]-[[61851,0],0]  
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:07292] [[61851,1],1] routed:unity:  
Connection to lifeline [[61851,0],0] lost
[dhcp-119-202.caltech.edu:07292] [[61851,1],1]-[[61851,0],0]  
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:07292] [[61851,1],1] routed:unity:  
Connection to lifeline [[61851,0],0] lost
[dhcp-119-202.caltech.edu:07291] [[61851,1],0]-[[61851,0],0]  
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:07291] [[61851,1],0] routed:unity:  
Connection to lifeline [[61851,0],0] lost
[dhcp-119-202.caltech.edu:07291] [[61851,1],0]-[[61851,0],0]  
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:07291] [[61851,1],0] routed:unity:  
Connection to lifeline [[61851,0],0] lost

--
orterun has exited due to process rank 1 with PID 7737 on
node dhcp-119-202.caltech.edu exiting without calling "finalize". This  
may

have caused other processes in the application to be
terminated by signals sent by orterun (as reported here).

Do I have to run the copenmpi clean command after the first checkpoint  
and before restarting the checkpointed job so I can checkpoint it  
again or is there something I am missing in this version completely  
and I would have to go to r18208? Thank you in advance for your help.


Tamer

On Apr 18, 2008, at 6:03 AM, Josh Hursey wrote:


When you use 'ompi-restart' to restart a job it fork/execs the
completely new job using the restarted processes for the ranks.
However instead of calling the 'mpirun' process ompi-restart currently
calls 'orterun'. These two programs are exactly the same (mpirun is a
symbolic link to orterun). So if you look for the PID of 'orterun'
that can be used to checkpoint the process.

However it is confusing that Open MPI makes this switch. So I
committed in r18208 a fix for this that uses the 'mpirun' binary name
instead of the 'orterun' binary name. This fits with the typical use
case of checkpoint/restart in Open MPI in which users expect to find
the 'mpirun' process on restart instead of the lesser known 'orterun'
process.

Sorry for the confusion.

Josh

On Apr 18, 2008, at 1:14 AM, Tamer wrote:


Dear all, I installed the developer's version r14519 and was able to
get it running. I successfully checkpointed a parallel job and
restarted it. My question is how can I checkpoint the restarted job?
The problem is once the original job is terminated and restarted  
later

on, the mpirun does not exist anymore (ps -efa|grep mpirun) and hence
I do not know which PID I should use when I run the ompi-checkpoint  
on

the restarted job. Any help would be greatly appreciated.

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] How to restart a job twice

2008-04-18 Thread Tamer

Hi Josh:

I am running on linux fedora core 7 kernel: 2.6.23.15-80.fc7

The machine is dual-core with shared memory so it's not even a cluster.

I downloaded r18208 and built it with the following options:

./configure --prefix=/usr/local/openmpi-with-checkpointing-r18208 -- 
with-ft=cr --with-blcr=/usr/local/blcr


when I run mpirun I pass the following command:

mpirun -np 2 -am ft-enable-cr ./ares-openmpi -c -f madonna-13760

I was able to checkpoint and restart successfully and was able to  
checkpoint the restarted job  (mpirun showed up with ps-efa |grep  
mpirun under r18208) but was unable to restart again; here's the error  
message:


mpi-restart ompi_global_snapshot_23865.ckpt
[dhcp-119-202.caltech.edu:23846] [[45670,1],1]-[[45670,0],0]  
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:23846] [[45670,1],1] routed:unity:  
Connection to lifeline [[45670,0],0] lost
[dhcp-119-202.caltech.edu:23845] [[45670,1],0]-[[45670,0],0]  
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:23845] [[45670,1],0] routed:unity:  
Connection to lifeline [[45670,0],0] lost
[dhcp-119-202.caltech.edu:23846] [[45670,1],1]-[[45670,0],0]  
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:23846] [[45670,1],1] routed:unity:  
Connection to lifeline [[45670,0],0] lost
[dhcp-119-202.caltech.edu:23845] [[45670,1],0]-[[45670,0],0]  
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:23845] [[45670,1],0] routed:unity:  
Connection to lifeline [[45670,0],0] lost

--
mpirun has exited due to process rank 1 with PID 24012 on
node dhcp-119-202.caltech.edu exiting without calling "finalize". This  
may

have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).

Thank you in advance for your help.

Tamer


On Apr 18, 2008, at 7:07 AM, Josh Hursey wrote:


This problem has come up in the past and may have been fixed since
r14519. Can you update to r18208 and see if the error still occurs?

A few other questions that will help me try to reproduce the problem.
Can you tell me more about the configuration of the system you are
running on (number of machines, if there is a resource manager)? How
did you configure Open MPI and what command line options are you
passing to 'mpirun'?

-- Josh

On Apr 18, 2008, at 9:36 AM, Tamer wrote:

Thanks Josh, I tried what you suggested with my existing r14519,  
and I
was able to checkpoint the restarted job but was never able to  
restart

it. I looked up the PID for 'orterun' and checkpointed the restarted
job but when I try to restart from that point I get the following
error:

ompi-restart ompi_global_snapshot_7704.ckpt
[dhcp-119-202.caltech.edu:07292] [[61851,1],1]-[[61851,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:07292] [[61851,1],1] routed:unity:
Connection to lifeline [[61851,0],0] lost
[dhcp-119-202.caltech.edu:07292] [[61851,1],1]-[[61851,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:07292] [[61851,1],1] routed:unity:
Connection to lifeline [[61851,0],0] lost
[dhcp-119-202.caltech.edu:07291] [[61851,1],0]-[[61851,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:07291] [[61851,1],0] routed:unity:
Connection to lifeline [[61851,0],0] lost
[dhcp-119-202.caltech.edu:07291] [[61851,1],0]-[[61851,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:07291] [[61851,1],0] routed:unity:
Connection to lifeline [[61851,0],0] lost
--
orterun has exited due to process rank 1 with PID 7737 on
node dhcp-119-202.caltech.edu exiting without calling "finalize".  
This

may
have caused other processes in the application to be
terminated by signals sent by orterun (as reported here).

Do I have to run the copenmpi clean command after the first  
checkpoint

and before restarting the checkpointed job so I can checkpoint it
again or is there something I am missing in this version completely
and I would have to go to r18208? Thank you in advance for your help.

Tamer

On Apr 18, 2008, at 6:03 AM, Josh Hursey wrote:


When you use 'ompi-restart' to restart a job it fork/execs the
completely new job using the restarted processes for the ranks.
However instead of calling the 'mpirun' process ompi-restart
currently
calls 'orterun'. These two programs are exactly the same (mpirun  
is a

symbolic link to orterun). So if you look for the PID of 'orterun'
that can be used to checkpoint the process.

However it is confusing that Open MPI makes this switch. So I
committed in r18208 a fix for this that uses th

Re: [OMPI users] How to restart a job twice

2008-04-24 Thread Tamer
Josh, Thank you for your help. I was able to do the following with  
r18241:


start the parallel job
checkpoint and restart
checkpoint and restart
checkpoint but failed to restart with the following message:

ompi-restart ompi_global_snapshot_23800.ckpt
[dhcp-119-202.caltech.edu:23650] [[45699,1],1]-[[45699,0],0]  
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:23650] [[45699,1],1] routed:tree: Connection  
to lifeline [[45699,0],0] lost
[dhcp-119-202.caltech.edu:23650] [[45699,1],1]-[[45699,0],0]  
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:23650] [[45699,1],1] routed:tree: Connection  
to lifeline [[45699,0],0] lost

[dhcp-119-202:23650] *** Process received signal ***
[dhcp-119-202:23650] Signal: Segmentation fault (11)
[dhcp-119-202:23650] Signal code: Address not mapped (1)
[dhcp-119-202:23650] Failing at address: 0x3e0f50
[dhcp-119-202:23650] [ 0] [0x110440]
[dhcp-119-202:23650] [ 1] /lib/libc.so.6(__libc_start_main+0x107)  
[0xc5df97]

[dhcp-119-202:23650] [ 2] ./ares-openmpi-r18241 [0x81703b1]
[dhcp-119-202:23650] *** End of error message ***
--
mpirun noticed that process rank 1 with PID 23857 on node  
dhcp-119-202.caltech.edu exited on signal 11 (Segmentation fault).



So, this time the process went further than before. I tested on a  
different platform (64 bit machine with fedora core 7) and openmpi  
checkpoints and restarts as many times as I want to without any  
problems. This means that the issue above must be platform dependent  
and I must be missing some option in building the code.


Cheers,
Tamer


On Apr 22, 2008, at 5:52 PM, Josh Hursey wrote:


Tamer,

This should now be fixed in r18241.

Though I was able to replicate this bug, it only occurred
sporadically for me. It seemed to be caused by some socket descriptor
caching that was not properly cleaned up by the restart procedure.

My testing appears to conclude that this bug is now fixed, but since
it is difficult to reproduce if you see it happen again definitely
let me know.


With the current trunk you may see the following error message:
--
[odin001][[7448,1],0][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
--
This is not caused by the checkpoint/restart code, but by some recent
changes to our TCP component. We are working on fixing this, but I
just wanted to give you a heads up in case you see this error. As far
as I can tell it does not interfere with the checkpoint/restart
functionality.

Let me know if this fixes your problem.

Cheers,
Josh


On Apr 22, 2008, at 9:16 AM, Josh Hursey wrote:


Tamer,

Just wanted to update you on my progress. I am able to reproduce
something similar to this problem. I am currently working on a
solution to it. I'll let you know when it is available, probably in
the next day or two.

Thank you for the bug report.

Cheers,
Josh

On Apr 18, 2008, at 1:11 PM, Tamer wrote:


Hi Josh:

I am running on linux fedora core 7 kernel: 2.6.23.15-80.fc7

The machine is dual-core with shared memory so it's not even a
cluster.

I downloaded r18208 and built it with the following options:

./configure --prefix=/usr/local/openmpi-with-checkpointing-r18208 --
with-ft=cr --with-blcr=/usr/local/blcr

when I run mpirun I pass the following command:

mpirun -np 2 -am ft-enable-cr ./ares-openmpi -c -f madonna-13760

I was able to checkpoint and restart successfully and was able to
checkpoint the restarted job  (mpirun showed up with ps-efa |grep
mpirun under r18208) but was unable to restart again; here's the
error message:

mpi-restart ompi_global_snapshot_23865.ckpt
[dhcp-119-202.caltech.edu:23846] [[45670,1],1]-[[45670,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:23846] [[45670,1],1] routed:unity:
Connection to lifeline [[45670,0],0] lost
[dhcp-119-202.caltech.edu:23845] [[45670,1],0]-[[45670,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:23845] [[45670,1],0] routed:unity:
Connection to lifeline [[45670,0],0] lost
[dhcp-119-202.caltech.edu:23846] [[45670,1],1]-[[45670,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:23846] [[45670,1],1] routed:unity:
Connection to lifeline [[45670,0],0] lost
[dhcp-119-202.caltech.edu:23845] [[45670,1],0]-[[45670,0],0]
mca_oob_tcp_msg_send_handler: writev failed: Broken pipe (32)
[dhcp-119-202.caltech.edu:23845] [[45670,1],0] routed:unity:
Connection to lifeline [[45670,0],0] lost
-
-
mpirun has exited due to process rank 1 with PID 24012 on
node dhcp-119-202.caltech.edu exiting without calling "finalize".
This may
have caused other processes in the applic

Re: [OMPI users] openmpi-1.3a1r18241 ompi-restart issue

2008-05-13 Thread Tamer
Hi Josh: I am currently using openmpi r18291 and when I run a 12 task  
job on 3 quad core nodes I am able to checkpoint and restart several  
times at the beginning of the run, however, after a few hours, when I  
try to checkpoint the code just hangs and it just won't checkpoint and  
won't give me an error message. Has this problem been reported before?  
All the required executables and libraries are in my path.


Thanks,
Tamer


On Apr 29, 2008, at 1:37 PM, Sharon Brunett wrote:


Thanks, I'll try the version you recommend below!

Josh Hursey wrote:

Your previous email indicted that you were using r18241. I committed
in r18276 a patch that should fix this problem. Let me know if you
still see it after that update.

Cheers,
Josh

On Apr 29, 2008, at 3:18 PM, Sharon Brunett wrote:


Josh,
I'm also having trouble using ompi-restart on a snapspot made from a
run
which was previously checkpointed. In other words, restarting a
previously restarted run!

(a) start the run
mpirun -np 16 -am ft-enable-cr ./a.out

 <---do an ompi-checkpoint on the mpirun pid from (a) from another
terminal--->>

(b) restart the checkpointed run

ompi-restart ompi_global_snapshot_30086.ckpt

  <--do an ompi-checkpoint on mpirun pid from (b) from another
terminal>>

(c) restart the checkpointed run
  ompi-restart ompi_global_snapshot_30120.ckpt

--
mpirun noticed that process rank 12 with PID 30480 on node shc005
exited
on signal 13 (Broken pipe).
--
-bash-2.05b$

I can restart the previous (30086) ckpt but not the latest one made
from
a restarted run.

Any insights would be appreciated.

thanks,
Sharon



Josh Hursey wrote:

Sharon,

This is, unfortunately, to be expected at the moment for this  
type of
application. Extremely communication intensive applications will  
most
likely cause the implementation of the current coordination  
algorithm
to slow down significantly. This is because on a checkpoint Open  
MPI

does a peerwise check on the description of (possibly) each message
to
make sure there are no messages in flight. So for a huge number of
messages this could take a long time.

This is a performance problem with the current implementation of  
the

algorithm that we use in Open MPI. I've been meaning to go back and
improve this, but it has not been critical to do so since
applications
that perform in this manner are outliers in HPC. The coordination
algorithm I'm using is based on the algorithm used by LAM/MPI, but
implemented at a higher level. There are a number of improvements
that
I can explore in the checkpoint/restart framework in Open MPI.

If this is critical for you I might be able to take a look at it,  
but

I can't say when. :(

-- Josh

On Apr 29, 2008, at 1:07 PM, Sharon Brunett wrote:


Josh Hursey wrote:

On Apr 29, 2008, at 12:55 AM, Sharon Brunett wrote:

I'm finding that using ompi-checkpoint on an application which  
is
very cpu bound takes a very very long time. For example,  
trying to
checkpoint a 4 or 8 way Pallas MPI Benchmark application can  
take

more than an hour. The problem is not where I'm dumping
checkpoints
(I've tried local and an nfs mount with plenty of space, and cpu
intensive apps checkpoint quickly).

I'm using BLCR_VERSION=0.6.5 and openmpi-1.3a1r18241.

Is this condition common and if so, are there possibly mca
paramters
which could help?
It depends on how you configured Open MPI with checkpoint/ 
restart.
There are two modes of operation: No threads, and with a  
checkpoint

thread. They are described a bit more in the Checkpoint/Restart
Fault
Tolerance User's Guide on the wiki:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR

By default we compile without the checkpoint thread. The
restriction
he is that all processes must be in the MPI library in order to
make
progress on the global checkpoint. For CPU intensive applications
this
may cause quite a delay in the time to start, and subsequently
finish,
a checkpoint. I'm guessing that this is what you are seeing.

If you configure with the checkpoint thread (add '--enable-mpi-
threads-
--enable-ft-thread' to ./configure) then Open MPI will create a
thread
that runs with each application process. This thread is fairly
light
weight and will make sure that a checkpoint progresses even when
the
process is not in the Open MPI library.

Try enabling the checkpoint thread and see if that helps improve
the
checkpoint time.

Josh,
First...please pardon the blunder in my earlier mail. Comms bound
apps
are the ones taking a while to checkpoint, not cpu bound. In any
case, I
tried configuring with the above two configure options but still  
no

luck
on improving checkpointing times or gaining completion on larger  
mpi

task runs being checkpointed.

It looks like the checkpointing is just hangi