I just pushed in some new timing code for the CRCP Coord component in
r18439.
https://svn.open-mpi.org/trac/ompi/changeset/1843
This should allow you to see the checkpoint progress through the
coordination protocol, and provide some rough timing on the different
parts of the algorithm.
T
Tamer,
How much communication does your application tend to do? As reported
below if there is a lot of communication between checkpoints then it
may take a while to checkpoint the application since the current
implementation of the coordination algorithm checks every message at
checkpoint
Hi Josh: I am currently using openmpi r18291 and when I run a 12 task
job on 3 quad core nodes I am able to checkpoint and restart several
times at the beginning of the run, however, after a few hours, when I
try to checkpoint the code just hangs and it just won't checkpoint and
won't give
Thanks, I'll try the version you recommend below!
Josh Hursey wrote:
Your previous email indicted that you were using r18241. I committed
in r18276 a patch that should fix this problem. Let me know if you
still see it after that update.
Cheers,
Josh
On Apr 29, 2008, at 3:18 PM, Sharon Brun
Your previous email indicted that you were using r18241. I committed
in r18276 a patch that should fix this problem. Let me know if you
still see it after that update.
Cheers,
Josh
On Apr 29, 2008, at 3:18 PM, Sharon Brunett wrote:
Josh,
I'm also having trouble using ompi-restart on a snap
Josh,
I'm also having trouble using ompi-restart on a snapspot made from a run
which was previously checkpointed. In other words, restarting a
previously restarted run!
(a) start the run
mpirun -np 16 -am ft-enable-cr ./a.out
<---do an ompi-checkpoint on the mpirun pid from (a) from another
Josh,
Thanks for the quick response. I'll test against some key applications
we would like to use blcr checkpointing/restarting against. Perhaps if
we're lucky and careful, we'll be able to get some near term use out of
what we have installed.
Sharon
Josh Hursey wrote:
Sharon,
This is, unfo
Sharon,
This is, unfortunately, to be expected at the moment for this type of
application. Extremely communication intensive applications will most
likely cause the implementation of the current coordination algorithm
to slow down significantly. This is because on a checkpoint Open MPI
do
Josh Hursey wrote:
On Apr 29, 2008, at 12:55 AM, Sharon Brunett wrote:
I'm finding that using ompi-checkpoint on an application which is
very cpu bound takes a very very long time. For example, trying to
checkpoint a 4 or 8 way Pallas MPI Benchmark application can take
more than an hour.
Josh,
Thanks for your inputs.
Yes, I'm able to restart properly outside the hostfile issues. The
problems were with the permissions on
/var/run/nscd/passwd
The hostfile issues have now also been resolved...the problem was
interactions with maui/torque's hostfile and getting a proper hostfile
On Apr 29, 2008, at 12:55 AM, Sharon Brunett wrote:
I'm finding that using ompi-checkpoint on an application which is
very cpu bound takes a very very long time. For example, trying to
checkpoint a 4 or 8 way Pallas MPI Benchmark application can take
more than an hour. The problem is not w
I'm finding that using ompi-checkpoint on an application which is very cpu bound takes a very very long time. For example, trying to checkpoint a 4 or 8 way Pallas MPI Benchmark application can take more than an hour. The problem is not where I'm dumping checkpoints (I've tried local and an nfs mou
On Apr 25, 2008, at 6:12 PM, Sharon Brunett wrote:
Josh,
I'm responding to some outstanding questions about the env. I'm
trying to ompi-restart in.
My answers to your questions are sprinkled below, and include a few
more questions based on attempts I've made to get a multi-node
restart wo
Josh,
I'm responding to some outstanding questions about the env. I'm trying to
ompi-restart in.
My answers to your questions are sprinkled below, and include a few more
questions based on attempts I've made to get a multi-node restart working.
thanks,
Sharon
Sharon Brunett wrote:
Josh Hursey
Josh Hursey wrote:
On Apr 23, 2008, at 4:04 PM, Sharon Brunett wrote:
Hello,
I'm using openmpi-1.3a1r18241 on a 2 node configuration and having
troubles with the ompi-restart. I can successfully ompi-checkpoint
and ompi-restart a 1 way mpi code.
When I try a 2 way job running across 2 nod
On Apr 23, 2008, at 4:04 PM, Sharon Brunett wrote:
Hello,
I'm using openmpi-1.3a1r18241 on a 2 node configuration and having
troubles with the ompi-restart. I can successfully ompi-checkpoint
and ompi-restart a 1 way mpi code.
When I try a 2 way job running across 2 nodes, I get
bash-2.0
16 matches
Mail list logo