[OMPI users] ompi-restart using different nodes

2009-12-02 Thread Jonathan Ferland

Hi,

I am trying to use BLCR checkpointing in mpi. I am currently able to run 
my application using some hostfile, checkpoint the run, and then restart 
the application using the same hostfile. The thing I would like to do is 
to restart the application with a different hostfile. But this leads to 
a segfault using 1.3.3.


Is it possible to restart the application using a different hostfile (we 
are using pbs to create the hostfile, so each new restart might be on 
different nodes), how can we do that? If no, do you plan to include this 
in a future release?


thanks



Re: [OMPI users] ompi-restart using different nodes

2009-12-02 Thread Jonathan Ferland

Hi Josh,

In case it help, I am running 1.3.3 compiled as follow :
../configure --enable-ft-thread --with-ft=cr --enable-mpi-threads 
--with-blcr=... --with-blcr-libdir=...--disable-openib-rdmacm --prefix=


I ran my application like this :
mpirun -am ft-enable-cr --hostfile host -np 2 ./a.out

where host contains:
node1
node2

This way it work if I checkpoint restart :
ompi-restart -hostfile host ompi_global_snapshot_ckpt

but if I then change the host to (just swapping nodes):
node2
node1

then it crash...

thanks

Josh Hursey wrote:
Though I do not test this scenario (using hostfiles) very often, it 
used to work. The ompi-restart command takes a --hostfile (or 
--machinefile) argument that is passed directly to the mpirun command. 
I wonder if something broke recently with this handoff. I can 
certainly checkpoint with one set of nodes/allocation and restart with 
another, but most/all of my testing occurs in a SLURM environment, so 
no need for an explicit hostfile.


I'll take a look to see if I can reproduce, but probably will not be 
until next week.


-- Josh

On Dec 2, 2009, at 9:54 AM, Jonathan Ferland wrote:


Hi,

I am trying to use BLCR checkpointing in mpi. I am currently able to 
run my application using some hostfile, checkpoint the run, and then 
restart the application using the same hostfile. The thing I would 
like to do is to restart the application with a different hostfile. 
But this leads to a segfault using 1.3.3.


Is it possible to restart the application using a different hostfile 
(we are using pbs to create the hostfile, so each new restart might 
be on different nodes), how can we do that? If no, do you plan to 
include this in a future release?


thanks

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





Re: [OMPI users] ompi-restart using different nodes

2009-12-08 Thread Jonathan Ferland
I did the same test using 1.3.4 and still the same issue  I also 
tried to use the tm interface instead of specifying the hostfile, same 
result.


thanks,

Jonathan

Josh Hursey wrote:
Though I do not test this scenario (using hostfiles) very often, it 
used to work. The ompi-restart command takes a --hostfile (or 
--machinefile) argument that is passed directly to the mpirun command. 
I wonder if something broke recently with this handoff. I can 
certainly checkpoint with one set of nodes/allocation and restart with 
another, but most/all of my testing occurs in a SLURM environment, so 
no need for an explicit hostfile.


I'll take a look to see if I can reproduce, but probably will not be 
until next week.


-- Josh

On Dec 2, 2009, at 9:54 AM, Jonathan Ferland wrote:


Hi,

I am trying to use BLCR checkpointing in mpi. I am currently able to 
run my application using some hostfile, checkpoint the run, and then 
restart the application using the same hostfile. The thing I would 
like to do is to restart the application with a different hostfile. 
But this leads to a segfault using 1.3.3.


Is it possible to restart the application using a different hostfile 
(we are using pbs to create the hostfile, so each new restart might 
be on different nodes), how can we do that? If no, do you plan to 
include this in a future release?


thanks

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--




--
Jonathan Ferland, analyste en calcul scientifique
RQCHP (Réseau québécois de calcul de haute performance)

bureau S-252, pavillon Roger-Gaudry, Université de Montréal
téléphone   : 514 343-6111 poste 8852
télécopieur : 514 343-2155
--



Re: [OMPI users] ompi-restart using different nodes

2009-12-09 Thread Jonathan Ferland

Hi Josh,

Thanks for helping. That solved the problem!!!

cheers,

Jonathan

Josh Hursey wrote:
So I tried to reproduce this problem today, and everything worked fine 
for me using the trunk. I haven't tested v1.3/v1.4 yet.


I tried checkpointing with one hostfile then restarting with each of 
the following:

 - No hostfile
 - a hostfile with completely different machines
 - a hostfile with the same machines in the opposite order


I suspect that the problem is not with Open MPI, but your system 
interacting with BLCR. Usually when people cannot restart on a 
different node they have problems with the 'prelink' feature on Linux. 
BLCR has a FAQ item on this:

  https://upc-bugs.lbl.gov//blcr/doc/html/FAQ.html#prelink

So if this is your problem then you will probably not be able to 
checkpoint a single process (non-MPI) application on one node and 
restart on another. Sorry I didn't mention it before, must have 
slipped my mind.


If this turns out to not be the problem, let me know and I'll take 
another look. Also send me any error messages that are displayed.


-- Josh


On Dec 8, 2009, at 1:39 PM, Jonathan Ferland wrote:

I did the same test using 1.3.4 and still the same issue  I also 
tried to use the tm interface instead of specifying the hostfile, 
same result.


thanks,

Jonathan

Josh Hursey wrote:
Though I do not test this scenario (using hostfiles) very often, it 
used to work. The ompi-restart command takes a --hostfile (or 
--machinefile) argument that is passed directly to the mpirun 
command. I wonder if something broke recently with this handoff. I 
can certainly checkpoint with one set of nodes/allocation and 
restart with another, but most/all of my testing occurs in a SLURM 
environment, so no need for an explicit hostfile.


I'll take a look to see if I can reproduce, but probably will not be 
until next week.


-- Josh

On Dec 2, 2009, at 9:54 AM, Jonathan Ferland wrote:


Hi,

I am trying to use BLCR checkpointing in mpi. I am currently able 
to run my application using some hostfile, checkpoint the run, and 
then restart the application using the same hostfile. The thing I 
would like to do is to restart the application with a different 
hostfile. But this leads to a segfault using 1.3.3.


Is it possible to restart the application using a different 
hostfile (we are using pbs to create the hostfile, so each new 
restart might be on different nodes), how can we do that? If no, do 
you plan to include this in a future release?


thanks

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--




------
Jonathan Ferland, analyste en calcul scientifique
RQCHP (Réseau québécois de calcul de haute performance)

bureau S-252, pavillon Roger-Gaudry, Université de Montréal
téléphone   : 514 343-6111 poste 8852
télécopieur : 514 343-2155
--

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--




------
Jonathan Ferland, analyste en calcul scientifique
RQCHP (Réseau québécois de calcul de haute performance)

bureau S-252, pavillon Roger-Gaudry, Université de Montréal
téléphone   : 514 343-6111 poste 8852
télécopieur : 514 343-2155
--