[OMPI users] Is there a interrupt-base receiving mode implemented in OPENMPI?

2009-06-18 Thread Hsing-bung Chen

Hi,
Is there a interrupt-base receiving mode implemented in OPENMPI?
How do I  enable it and when I build the openmpi?
Thanks.

HB


Re: [OMPI users] Is there a interrupt-base receiving mode implementedin OPENMPI?

2009-06-18 Thread Jeff Squyres

On Jun 18, 2009, at 11:34 AM, Hsing-bung Chen wrote:


Is there a interrupt-base receiving mode implemented in OPENMPI?
How do I  enable it and when I build the openmpi?




It depends what you mean by "interrupt-based receiving mode" -- OMPI  
currently polls for progress because that's typically how you get the  
lowest latency.  We don't currently have any blocking mode for  
receives, although it has been on the to-do list for a while (but at a  
fairly low priority).


--
Jeff Squyres
Cisco Systems



[OMPI users] mpirun fails on the host

2009-06-18 Thread Honest Guvnor
OpenMPI 1.2.7, ethernet, Centos 5.3 i386 fresh install on host and nodes.

Despite ssh and pdsh working, mpirun hangs when launching a program
from the host to a node:

[cluster@hankel ~]$ ssh n06 hostname
n06
[cluster@hankel ~]$ pdsh -w n06 hostname
n06: n06
[cluster@hankel ~]$ mpirun -np 1 --host n06 hostname
[HANGS]

However, mpirun works fine in reverse:

[cluster@n06 ~]$ mpirun -np 1 --host hankel date
Thu Jun 18 22:53:27 CEST 2009

and from node to node. Paths to bin and lib seem OK:

[cluster@hankel ~]$ printenv PATH
/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/lib/openmpi/1.2.7-gcc/bin:/home/cluster/bin
[cluster@hankel ~]$ printenv LD_LIBRARY_PATH
:/usr/lib/openmpi/1.2.7-gcc/lib
[cluster@hankel ~]$ ssh n06 printenv PATH
/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/lib/openmpi/1.2.7-gcc/bin
[cluster@hankel ~]$ ssh n06 printenv LD_LIBRARY_PATH
:/usr/lib/openmpi/1.2.7-gcc/lib

We are new to openmpi but checked a few mca parameters and turned on a
diagnostic flag or two but without coming up with much. The nodes do
not have access to the hosts external network and we half convinced
ourselves this was the problem because of mentions in the output with
the -d flag but:

[cluster@hankel ~]$ mpirun --mca btl tcp,self --mca btl_tcp_if_exclude
lo,eth0 --mca oob_tcp_if_exclude lo,eth0 -np 1 --host n06 hostname
[STILL HANGS]

where eth0 is the external network.

Suggestions gratefully received on how we can get openmpi to report
what has failed or where to poke and prod further?


Re: [OMPI users] vfs_write returned -14

2009-06-18 Thread Kritiraj Sajadah

Hello Josh,
   ThanK you again for your respond. I tried chekpointing a simple c 
program using BLCR...and got the same error, i.e:

- vfs_write returned -14
- file_header: write returned -14
Checkpoint failed: Bad address


This is how i installed and run mpi programs for checkpointing:

1) configure and install blcr

tar zxf blcr-.tar.gz
cd blcr-
mkdir builddir
cd builddir

../configure --prefix=/usr/local/ --enable-debug=yes --enable-libcr-tracing=yes 
--enable-kernel-tracing=yes --enable-testsuite=yes --enable-all-static=yes 
--enable-static=yes

make
make install

2) configure and install openmpi

./configure --prefix=/usr/local/ --enable-picky --enable-debug 
--enable-mpi-profile --enable-mpi-cxx --enable-pretty-print-stacktrace 
--enable-binaries --enable-trace --enable-static=yes --enable-debug 
--with-devel-headers=1 --with-mpi-param-check=always --with-ft=cr 
--enable-ft-thread --with-blcr=/usr/local/ --with-blcr-libdir=/usr/local/lib 
--enable-mpi-threads=yes

make all install

3)  Compile and run mpi program as follows:

 raj> mpicc helloworld.c -o helloworld
 raj> mpirun -am ft-enable-cr helloworld

4) To checkpoint the running program,

 raj>  ompi-checkpoint [any option] pid 
 for example:   ompi-checkpoint -v 11527

5) To restart your checkpoint, locate the checkpoint file and type the 
following from the command line:

  raj> mpi-restart ompi_global_snapshot_.ckpt


The did another test with BLCR however,

I tried checkpointing my c application from the /tmp directory instead of my 
$HOME directory and it checkpointed fine.

So, it looks like the problem is with my $HOME directory.

I have "drwx" rights on my $HOME directory which seems fine for me.

Then i tried it with open MPI.  However, with open mpi the checkpoint file 
automatically get saved in the $HOME directory. 

Is there a way to have the file saved in a different location? I checked that 
LAM/MPI has some command line  options :

$ mpirun -np 2 -ssi cr_base_dir /somewhere/else a.out

Do we have a similar option for open mpi?

Thanks a lot

regards,

Raj

--- On Wed, 6/17/09, Josh Hursey  wrote:

> From: Josh Hursey 
> Subject: Re: [OMPI users] vfs_write returned -14
> To: "Open MPI Users" 
> Date: Wednesday, June 17, 2009, 1:42 AM
> Did you try checkpointing a non-MPI
> application with BLCR on the  
> cluster? If that does not work then I would suspect that
> BLCR is not  
> working properly on the system.
> 
> However if a non-MPI application can be checkpointed and
> restarted  
> correctly on this machine then it may be something odd with
> the Open  
> MPI installation or runtime environment. To help debug here
> I would  
> need to know how Open MPI was configured and how the
> application was  
> ran on the machine (command line arguments, environment
> variables, ...).
> 
> I should note that for the program that you sent it is
> important that  
> you compile Open MPI with the Fault Tolerance Thread
> enabled to ensure  
> a timely checkpoint. Otherwise the checkpoint will be
> delayed until  
> the MPI program enters the MPI_Finalize function.
> 
> Let me know what you find out.
> 
> Josh
> 
> On Jun 16, 2009, at 5:08 PM, Kritiraj Sajadah wrote:
> 
> >
> > Hi Josh,
> >
> > Thanks for the email. I have install BLCR 0.8.1 and
> openmpi 1.3 on  
> > my laptop with Ubuntu 8.04 on it. It works fine.
> >
> > I now tried the installation on the cluster ( on one
> machine for  
> > now) in my university. ( the administrator installed
> it) i am not  
> > sure if he followed the steps i gave him.
> >
> > I am checkpointing a simple mpi application which
> looks as follows:
> >
> > #include 
> > #include 
> >
> > int main(int argc, char **argv)
> > {
> > int rank,size;
> > MPI_Init(&argc, &argv);
> > MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> > MPI_Comm_size(MPI_COMM_WORLD, &size);
> > printf("I am processor no %d of a total of %d procs
> \n", rank, size);
> > system("sleep 30");
> > printf("I am processor no %d of a total of %d procs
> \n", rank, size);
> > system("sleep 30");
> > printf("I am processor no %d of a total of %d procs
> \n", rank, size);
> > system("sleep 30");
> > printf("bye \n");
> > MPI_Finalize();
> > return 0;
> > }
> >
> > Do you think its better to re install BLCR?
> >
> >
> > Thanks
> >
> > Raj
> > --- On Tue, 6/16/09, Josh Hursey 
> wrote:
> >
> >> From: Josh Hursey 
> >> Subject: Re: [OMPI users] vfs_write returned -14
> >> To: "Open MPI Users" 
> >> Date: Tuesday, June 16, 2009, 6:42 PM
> >>
> >> These are errors from BLCR. It may be a problem
> with your
> >> BLCR installation and/or your application. Are you
> able to
> >> checkpoint/restart a non-MPI application with BLCR
> on these
> >> machines?
> >>
> >> What kind of MPI application are you trying to
> checkpoint?
> >> Some of the MPI interfaces are not fully supported
> at the
> >> moment (outlined in the FT User Document that I
> mentioned in
> >> a previous email).
> >>
> >> -- Josh
> >>
> >> On Jun 

Re: [OMPI users] mpirun fails on the host

2009-06-18 Thread Ralph Castain
Add --debug-devel to your cmd line and you'll get a bunch of diagnostic
info. Did you configure --enable-debug? If so, then additional debug can be
obtained - can let you know how to get it, if necessary.
Ralph


On Thu, Jun 18, 2009 at 3:49 PM, Honest Guvnor
wrote:

> OpenMPI 1.2.7, ethernet, Centos 5.3 i386 fresh install on host and nodes.
>
> Despite ssh and pdsh working, mpirun hangs when launching a program
> from the host to a node:
>
> [cluster@hankel ~]$ ssh n06 hostname
> n06
> [cluster@hankel ~]$ pdsh -w n06 hostname
> n06: n06
> [cluster@hankel ~]$ mpirun -np 1 --host n06 hostname
> [HANGS]
>
> However, mpirun works fine in reverse:
>
> [cluster@n06 ~]$ mpirun -np 1 --host hankel date
> Thu Jun 18 22:53:27 CEST 2009
>
> and from node to node. Paths to bin and lib seem OK:
>
> [cluster@hankel ~]$ printenv PATH
>
> /usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/lib/openmpi/1.2.7-gcc/bin:/home/cluster/bin
> [cluster@hankel ~]$ printenv LD_LIBRARY_PATH
> :/usr/lib/openmpi/1.2.7-gcc/lib
> [cluster@hankel ~]$ ssh n06 printenv PATH
>
> /usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/lib/openmpi/1.2.7-gcc/bin
> [cluster@hankel ~]$ ssh n06 printenv LD_LIBRARY_PATH
> :/usr/lib/openmpi/1.2.7-gcc/lib
>
> We are new to openmpi but checked a few mca parameters and turned on a
> diagnostic flag or two but without coming up with much. The nodes do
> not have access to the hosts external network and we half convinced
> ourselves this was the problem because of mentions in the output with
> the -d flag but:
>
> [cluster@hankel ~]$ mpirun --mca btl tcp,self --mca btl_tcp_if_exclude
> lo,eth0 --mca oob_tcp_if_exclude lo,eth0 -np 1 --host n06 hostname
> [STILL HANGS]
>
> where eth0 is the external network.
>
> Suggestions gratefully received on how we can get openmpi to report
> what has failed or where to poke and prod further?
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>