Re: [OMPI users] openmpi-1.3a1r18241 ompi-restart issue

2008-05-14 Thread Josh Hursey
I just pushed in some new timing code for the CRCP Coord component in r18439. https://svn.open-mpi.org/trac/ompi/changeset/1843 This should allow you to see the checkpoint progress through the coordination protocol, and provide some rough timing on the different parts of the algorithm. T

Re: [OMPI users] openmpi-1.3a1r18241 ompi-restart issue

2008-05-14 Thread Josh Hursey
Tamer, How much communication does your application tend to do? As reported below if there is a lot of communication between checkpoints then it may take a while to checkpoint the application since the current implementation of the coordination algorithm checks every message at checkpoint

Re: [OMPI users] openmpi-1.3a1r18241 ompi-restart issue

2008-05-13 Thread Tamer
Hi Josh: I am currently using openmpi r18291 and when I run a 12 task job on 3 quad core nodes I am able to checkpoint and restart several times at the beginning of the run, however, after a few hours, when I try to checkpoint the code just hangs and it just won't checkpoint and won't give

Re: [OMPI users] openmpi-1.3a1r18241 ompi-restart issue

2008-04-29 Thread Sharon Brunett
Thanks, I'll try the version you recommend below! Josh Hursey wrote: Your previous email indicted that you were using r18241. I committed in r18276 a patch that should fix this problem. Let me know if you still see it after that update. Cheers, Josh On Apr 29, 2008, at 3:18 PM, Sharon Brun

Re: [OMPI users] openmpi-1.3a1r18241 ompi-restart issue

2008-04-29 Thread Josh Hursey
Your previous email indicted that you were using r18241. I committed in r18276 a patch that should fix this problem. Let me know if you still see it after that update. Cheers, Josh On Apr 29, 2008, at 3:18 PM, Sharon Brunett wrote: Josh, I'm also having trouble using ompi-restart on a snap

Re: [OMPI users] openmpi-1.3a1r18241 ompi-restart issue

2008-04-29 Thread Sharon Brunett
Josh, I'm also having trouble using ompi-restart on a snapspot made from a run which was previously checkpointed. In other words, restarting a previously restarted run! (a) start the run mpirun -np 16 -am ft-enable-cr ./a.out <---do an ompi-checkpoint on the mpirun pid from (a) from another

Re: [OMPI users] openmpi-1.3a1r18241 ompi-restart issue

2008-04-29 Thread Sharon Brunett
Josh, Thanks for the quick response. I'll test against some key applications we would like to use blcr checkpointing/restarting against. Perhaps if we're lucky and careful, we'll be able to get some near term use out of what we have installed. Sharon Josh Hursey wrote: Sharon, This is, unfo

Re: [OMPI users] openmpi-1.3a1r18241 ompi-restart issue

2008-04-29 Thread Josh Hursey
Sharon, This is, unfortunately, to be expected at the moment for this type of application. Extremely communication intensive applications will most likely cause the implementation of the current coordination algorithm to slow down significantly. This is because on a checkpoint Open MPI do

Re: [OMPI users] openmpi-1.3a1r18241 ompi-restart issue

2008-04-29 Thread Sharon Brunett
Josh Hursey wrote: On Apr 29, 2008, at 12:55 AM, Sharon Brunett wrote: I'm finding that using ompi-checkpoint on an application which is very cpu bound takes a very very long time. For example, trying to checkpoint a 4 or 8 way Pallas MPI Benchmark application can take more than an hour.

Re: [OMPI users] openmpi-1.3a1r18241 ompi-restart issue

2008-04-29 Thread Sharon Brunett
Josh, Thanks for your inputs. Yes, I'm able to restart properly outside the hostfile issues. The problems were with the permissions on /var/run/nscd/passwd The hostfile issues have now also been resolved...the problem was interactions with maui/torque's hostfile and getting a proper hostfile

Re: [OMPI users] openmpi-1.3a1r18241 ompi-restart issue

2008-04-29 Thread Josh Hursey
On Apr 29, 2008, at 12:55 AM, Sharon Brunett wrote: I'm finding that using ompi-checkpoint on an application which is very cpu bound takes a very very long time. For example, trying to checkpoint a 4 or 8 way Pallas MPI Benchmark application can take more than an hour. The problem is not w

Re: [OMPI users] openmpi-1.3a1r18241 ompi-restart issue

2008-04-29 Thread Sharon Brunett
I'm finding that using ompi-checkpoint on an application which is very cpu bound takes a very very long time. For example, trying to checkpoint a 4 or 8 way Pallas MPI Benchmark application can take more than an hour. The problem is not where I'm dumping checkpoints (I've tried local and an nfs mou

Re: [OMPI users] openmpi-1.3a1r18241 ompi-restart issue

2008-04-28 Thread Josh Hursey
On Apr 25, 2008, at 6:12 PM, Sharon Brunett wrote: Josh, I'm responding to some outstanding questions about the env. I'm trying to ompi-restart in. My answers to your questions are sprinkled below, and include a few more questions based on attempts I've made to get a multi-node restart wo

Re: [OMPI users] openmpi-1.3a1r18241 ompi-restart issue

2008-04-25 Thread Sharon Brunett
Josh, I'm responding to some outstanding questions about the env. I'm trying to ompi-restart in. My answers to your questions are sprinkled below, and include a few more questions based on attempts I've made to get a multi-node restart working. thanks, Sharon Sharon Brunett wrote: Josh Hursey

Re: [OMPI users] openmpi-1.3a1r18241 ompi-restart issue

2008-04-23 Thread Sharon Brunett
Josh Hursey wrote: On Apr 23, 2008, at 4:04 PM, Sharon Brunett wrote: Hello, I'm using openmpi-1.3a1r18241 on a 2 node configuration and having troubles with the ompi-restart. I can successfully ompi-checkpoint and ompi-restart a 1 way mpi code. When I try a 2 way job running across 2 nod

Re: [OMPI users] openmpi-1.3a1r18241 ompi-restart issue

2008-04-23 Thread Josh Hursey
On Apr 23, 2008, at 4:04 PM, Sharon Brunett wrote: Hello, I'm using openmpi-1.3a1r18241 on a 2 node configuration and having troubles with the ompi-restart. I can successfully ompi-checkpoint and ompi-restart a 1 way mpi code. When I try a 2 way job running across 2 nodes, I get bash-2.0