[OMPI users] problems with OpenMPI-1.0.1 on SunOS 5.9; problems on heterogeneous cluster

2006-03-10 Thread Ravi Manumachu

Hi,

I am facing problems running OpenMPI-1.0.1 on a heterogeneous cluster.

I have a Linux machine and a SunOS machine in this cluster.

linux$ uname -a
Linux pg1cluster01 2.6.8-1.521smp #1 SMP Mon Aug 16 09:25:06 EDT 2004
i686 i686 i386 GNU/Linux

OpenMPI-1.0.1 is installed uisng 

./configure --prefix=...
make all install

sunos$ uname -a
SunOS csultra01 5.9 Generic_112233-10 sun4u sparc SUNW,Ultra-5_10

OpenMPI-1.0.1 is installed uisng 

./configure --prefix=...
make all install


I use ssh. Both nodes are accessible without prompts for password.

I use the following simple application:


#include 

int main(int argc, char** argv)
{
int rc, me;
char pname[MPI_MAX_PROCESSOR_NAME];
int plen;

MPI_Init(
   &argc,
   &argv
);

rc = MPI_Comm_rank(
MPI_COMM_WORLD,
&me
);

if (rc != MPI_SUCCESS)
{
   return rc;
}

MPI_Get_processor_name(
   pname,
   &plen
);

printf("%s:Hello world from %d\n", pname, me);

MPI_Finalize();

return 0;
}


It is compiled as follows:

linux$ mpicc -o mpiinit_linux mpiinit.c
sunos$ mpicc -o mpiinit_sunos mpiinit.c

My hosts file is

hosts.txt
-
pg1cluster01 slots=2
csultra01 slots=1

My app file is

mpiinit_appfile
---
-np 2 /home/cs/manredd/OpenMPI/openmpi-1.0.1/MPITESTS/mpiinit_linux
-np 1 /home/cs/manredd/OpenMPI/openmpi-1.0.1/MPITESTS/mpiinit_sunos

$ mpirun --hostfile hosts.txt --app mpiinit_appfile
ld.so.1: /home/cs/manredd/OpenMPI/openmpi-1.0.1/MPITESTS/mpiinit_sunos:
fatal: relocation error: file
/home/cs/manredd/OpenMPI/openmpi-1.0.1/OpenMPI-SunOS-5.9/lib/libmca_common_sm.so.0:
symbol nanosleep: referenced symbol not found
ld.so.1: /home/cs/manredd/OpenMPI/openmpi-1.0.1/MPITESTS/mpiinit_sunos:
fatal: relocation error: file
/home/cs/manredd/OpenMPI/openmpi-1.0.1/OpenMPI-SunOS-5.9/lib/libmca_common_sm.so.0:
symbol nanosleep: referenced symbol not found

I have fixed this by compiling with "-lrt" option to the linker.

sunos$ mpicc -o mpiinit_sunos mpiinit.c -lrt

However when I run this again, I get the error:

$ mpirun --hostfile hosts.txt --app mpiinit_appfile
[pg1cluster01:19858] ERROR: A daemon on node csultra01 failed to start
as expected.
[pg1cluster01:19858] ERROR: There may be more information available from
[pg1cluster01:19858] ERROR: the remote shell (see above).
[pg1cluster01:19858] ERROR: The daemon exited unexpectedly with status 255.
2 processes killed (possibly by Open MPI)

Sometimes I get the error.

$ mpirun --hostfile hosts.txt --app mpiinit_appfile
[csultra01:06256] mca_common_sm_mmap_init: ftruncate failed with errno=28
[csultra01:06256] mca_mpool_sm_init: unable to create shared memory mapping
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned value -2 instead of OMPI_SUCCESS
--
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)

Please let me know the resolution of this problem. Please let me know if
you need more details.

Regards,
Ravi.



Re: [OMPI users] Open MPI and MultiRail InfiniBand

2006-03-10 Thread Troy Telford


On Mar 9, 2006, at 9:18 PM, Brian Barrett wrote:


On Mar 9, 2006, at 6:41 PM, Troy Telford wrote:


I've got a machine that has the following config:

Each node has two InfiniBand ports:
  * The first port is on fabric 'a' with switches for 'a'
  * The second port is on fabric 'b' with separate switches for 'b'
  * The two fabrics are not shared ('a' and 'b' can't communicate
with one
another)

I believe that Open MPI is perfectly capable of stripeing over both
fabric
'a' and 'b', and IIRC, this is the default behavior.

Does Open MPI handle the case where Open MPI puts all of its
traffic on
the first IB port (ie. fabric 'a'), and leaves the second IB port  
(ie.

fabric 'b') free for other uses (I'll use NFS as a humorous example).

If so, is there any magic required to configure it thusly?


With mvapi, we don't have the functionality in place for the user to
specify which HCA port is used.  The user can say that at most N HCA
ports should be used through the btl_mvapi_max_btls MCA parameter.
So in your case, if you ran Open MPI with:

   mpirun -mca btl_mvapi_max_btls 1 -np X ./foobar

Only the first active port would be used for mvapi communication.
I'm not sure if this is enough for your needs or not.


So long as the second active port isn't touched by Open MPI, it  
sounds just fine.


One thing, though -- You mention mvapi, which IIRC is the 1st  
Generation IB stack.  Is there similar functionality with the openib  
btl (for the 2nd generation IB stack)?


Thanks!

Troy
 Telford



Re: [OMPI users] Run failure on Solaris Opteron with Sun Studio 11

2006-03-10 Thread Jeff Squyres

On Mar 9, 2006, at 12:18 PM, Pierre Valiron wrote:

- However compiling the mpi.f90 takes over 35 *minutes* with -O1.   
This seems a bit excessive...  I tried removing any -O option and  
things are just as slow. Is this behaviour related to open-mpi or  
to some wrong features of the Studio11 compiler ?


You're not the first person to ask about this, so I've added the  
reasons why to the FAQ:


http://www.open-mpi.org/faq/?category=building#f90-bindings-slow-compile
http://www.open-mpi.org/faq/?category=mpi-apps#f90-mpi-slow-compiles

Brian is already working with you on the rest of the issues; I just  
thought I'd pipe in with the F90 stuff since I was one of the guys  
who did the F90 work in Open MPI.


--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/





Re: [OMPI users] [Fwd: MPI_SEND blocks when crossing node boundary]

2006-03-10 Thread Cezary Sliwa

Jeff Squyres wrote:

Please note that I replied to your original post:

 http://www.open-mpi.org/community/lists/users/2006/02/0712.php

Was that not sufficient?  If not, please provide more details on what  
you are attempting to do and what is occurring.  Thanks.


  
I have a simple program in which the rank 0 task dispatches compute 
tasks to other processes. It works fine on one 4-way SMP machine, but 
when I try to run it on two nodes, the processes on the other machine 
seem to spin in a loop inside MPI_SEND (a message is not delivered).


Cezary Sliwa



On Mar 7, 2006, at 2:36 PM, Cezary Sliwa wrote:

  

Hello again,

The problem is that MPI_SEND blocks forever (the message is still  
not delivered after many hours).


Cezary Sliwa


From: Cezary Sliwa 
Date: February 22, 2006 10:07:04 AM EST
To: us...@open-mpi.org
Subject: MPI_SEND blocks when crossing node boundary



My program runs fine with openmpi-1.0.1 when run from the command  
line (5 processes with empty host file), but when I schedule it  
with qsub to run on 2 nodes it blocks on MPI_SEND


(gdb) info stack
#0  0x0034db30c441 in __libc_sigaction () from /lib64/tls/ 
libpthread.so.0

#1  0x00573002 in opal_evsignal_recalc ()
#2  0x00582a3c in poll_dispatch ()
#3  0x005729f2 in opal_event_loop ()
#4  0x00577e68 in opal_progress ()
#5  0x004eed4a in mca_pml_ob1_send ()
#6  0x0049abdd in PMPI_Send ()
#7  0x00499dc0 in pmpi_send__ ()
#8  0x0042d5d8 in MAIN__ () at main.f:90
#9  0x005877de in main (argc=Variable "argc" is not available.
)




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




  




Re: [OMPI users] [Fwd: MPI_SEND blocks when crossing node boundary]

2006-03-10 Thread Cezary Sliwa

Cezary Sliwa wrote:

Jeff Squyres wrote:
  

Please note that I replied to your original post:

 http://www.open-mpi.org/community/lists/users/2006/02/0712.php

Was that not sufficient?  If not, please provide more details on what  
you are attempting to do and what is occurring.  Thanks.


  

I have a simple program in which the rank 0 task dispatches compute 
tasks to other processes. It works fine on one 4-way SMP machine, but 
when I try to run it on two nodes, the processes on the other machine 
seem to spin in a loop inside MPI_SEND (a message is not delivered).


  

And this despite a matching MPI_IRECV has been called in the rank 0 task.


Cezary Sliwa
  



  program ng

  implicit none

  external Nsum, Gsum

  double precision Nsum, Gsum, integrate, ee, resn, resg, n, g, s1

  double precision H, theta, phi, BG, thetaM, phiM, kBT, EF
  common / intparms / H, theta, phi, BG, thetaM, phiM, kBT, EF

  double precision ialpha, hbar, c, e, kB, hartree_eV, hartree_J,
 $ au_T, au_angstrom, T

  parameter ( ialpha = 137.0359991d0, hbar = 1, c = ialpha, e = 1,
 $ kB = 3.1668153d-6,
 $ hartree_eV = 27.2113845d0, hartree_J = 4.35974417d-18,
 $ au_T = 2.35051742d5, au_angstrom = 0.5291772108d0,
 $ T = 10d0 )

  double precision a, b

  double precision pi

  parameter ( pi = 3.141592653589793d0 )


  include 'mpif.h'

  integer ierr, rank, status(MPI_STATUS_SIZE)

  integer cmd(2)
  double precision buf(1), bufx, bufy
  integer nd

  integer size, whatfun
  common / commparms / size, whatfun


  call mpi_init(ierr)

  kBT = kB*T

  H = 2.0d0*c/au_T
  theta = 0d0
  phi = 0d0

  BG = -0.03d0/hartree_eV
  thetaM = 0d0
  phiM = 0d0

  call mpi_comm_rank(MPI_COMM_WORLD, rank, ierr)

  if(rank .ne. 0) then

 do while(.true.)
call mpi_recv(cmd, 2, MPI_INTEGER, 0, 1, MPI_COMM_WORLD,
 $   status, ierr)

print *, rank, ' received ', cmd(1)

select case(cmd(1))

case(1)
   goto 300

case(2)
   call mpi_recv(buf, 1, MPI_DOUBLE_PRECISION, 0, 2,
 $  MPI_COMM_WORLD, status, ierr)
   EF = buf(1)

case(3)
   call mpi_recv(bufx, 1, MPI_DOUBLE_PRECISION, 0, 3,
 $  MPI_COMM_WORLD, status, ierr)

   select case(cmd(2))
   case(1)
  bufy = Nsum(bufx)
   case(2)
  bufy = Gsum(bufx)
   case default
  write(*, *) '***   unknown cmd(2)   ***'
  stop
   end select

   call mpi_send(bufy, 1, MPI_DOUBLE_PRECISION, 0, 3,
 $  MPI_COMM_WORLD, ierr)

case default
   write(*, *) '***   unknown cmd(1)   ***'
   stop

end select

 end do
300  continue

  else

 call mpi_comm_size(MPI_COMM_WORLD, size, ierr)

 open(12, status='OLD', FILE='in_ef.txt')
 open(13, status='UNKNOWN', FILE='out.txt')

 do while(.true.)

read(12, *, end=100) ee

EF = ee/hartree_eV

do nd = 1, size-1
   cmd(1) = 2
   buf = EF
   call mpi_send(cmd, 2, MPI_INTEGER, nd, 1, MPI_COMM_WORLD,
 $  ierr)
   call mpi_send(buf, 1, MPI_DOUBLE_PRECISION, nd, 2,
 $  MPI_COMM_WORLD, ierr)
end do


a = -0.25d0
b = 0.25d0

whatfun = 1
resn = integrate(Nsum, a, b)

whatfun = 2
resg = integrate(Gsum, a, b)

s1 = e*H/(hbar*c) / (2*pi)**2 / (au_angstrom * 1d-8)**3

n = s1 * resn
g = (hartree_J * 1d7) * kBT * s1 * resg

write(13,*) EF, ee, resn, n, resg, g

 end do

 100 close(12)
 close(13)

 cmd(1) = 1
 do nd = 1, size-1
call mpi_send(cmd, 2, MPI_INTEGER, nd, 1, MPI_COMM_WORLD,
 $   ierr)
 end do

  end if

  call mpi_finalize(ierr)

  end program


  subroutine int_fun(n, x, y)

  implicit none

  integer n
  double precision x(n), y(n)

  include 'mpif.h'

  integer ierr, status(MPI_STATUS_SIZE)

  integer cmd(2)
  logical flag

  integer node_status(size-1), requests(size-1), i, nd, pending

  integer size, whatfun
  common / commparms / size, whatfun


  do nd = 1, size-1
 node_status(nd) = 0
  end do

  pending = 0

  i = 0
  do while(i .lt. n .or. pending .ne. 0)

 if(i .ge. n) goto 600

 do nd = 1, size-1
if(node_status(nd) .eq. 0) goto 500
 end do
 goto 600

 500 i = i + 1

 print *, 'sending task ', i, ' to ', nd

 cmd(1) = 3
 cmd(2) = whatfun
 

Re: [OMPI users] [Fwd: MPI_SEND blocks when crossing node boundary]

2006-03-10 Thread Jeff Squyres

On Mar 10, 2006, at 6:01 AM, Cezary Sliwa wrote:


 http://www.open-mpi.org/community/lists/users/2006/02/0712.php


I have a simple program in which the rank 0 task dispatches compute
tasks to other processes. It works fine on one 4-way SMP machine, but
when I try to run it on two nodes, the processes on the other machine
seem to spin in a loop inside MPI_SEND (a message is not delivered).


You still haven't answered whether your application does any of the  
things that I mentioned in my first post.  :-)  Have you examined the  
code to ensure that your application does not rely on buffering?   
This kind of thing can easily show up as blocking in some situations  
and not blocking in others (such as on-node vs. off-node communication).


If it does not, can you send the information requested by the  
"Getting Help" section of the Open MPI web site?  This will give us  
more details that will hopefully enable us to resolve your problem:


http://www.open-mpi.org/community/help/

One additional question: are you using TCP as your communications  
network, and if so, do either of the nodes that you are running on  
have more than one TCP NIC?  We recently fixed a bug for situations  
where at least one node in on multiple TCP networks, not all of which  
were shared by the nodes where the peer MPI processes were running.   
If this situation describes your network setup (e.g., a cluster where  
the head node has a public and a private network, and where the  
cluster nodes only have a private network -- and your MPI process was  
running on the head node and a compute node), can you try upgrading  
to the latest 1.0.2 release candidate tarball:


http://www.open-mpi.org/software/ompi/v1.0/

--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/





Re: [OMPI users] [Fwd: MPI_SEND blocks when crossing node boundary]

2006-03-10 Thread Cezary Sliwa

Jeff Squyres wrote:
One additional question: are you using TCP as your communications  
network, and if so, do either of the nodes that you are running on  
have more than one TCP NIC?  We recently fixed a bug for situations  
  


Yes, precisely.
where at least one node in on multiple TCP networks, not all of which  
were shared by the nodes where the peer MPI processes were running.   
If this situation describes your network setup (e.g., a cluster where  
the head node has a public and a private network, and where the  
cluster nodes only have a private network -- and your MPI process was  
running on the head node and a compute node), can you try upgrading  
to the latest 1.0.2 release candidate tarball:


 http://www.open-mpi.org/software/ompi/v1.0/

  

Thank you, I will try.

Cezary Sliwa





Re: [OMPI users] Open MPI and MultiRail InfiniBand

2006-03-10 Thread Brian Barrett

On Mar 10, 2006, at 2:24 AM, Troy Telford wrote:


On Mar 9, 2006, at 9:18 PM, Brian Barrett wrote:


On Mar 9, 2006, at 6:41 PM, Troy Telford wrote:


I've got a machine that has the following config:

Each node has two InfiniBand ports:
  * The first port is on fabric 'a' with switches for 'a'
  * The second port is on fabric 'b' with separate switches for 'b'
  * The two fabrics are not shared ('a' and 'b' can't communicate
with one
another)

I believe that Open MPI is perfectly capable of stripeing over both
fabric
'a' and 'b', and IIRC, this is the default behavior.

Does Open MPI handle the case where Open MPI puts all of its
traffic on
the first IB port (ie. fabric 'a'), and leaves the second IB port
(ie.
fabric 'b') free for other uses (I'll use NFS as a humorous  
example).


If so, is there any magic required to configure it thusly?


With mvapi, we don't have the functionality in place for the user to
specify which HCA port is used.  The user can say that at most N HCA
ports should be used through the btl_mvapi_max_btls MCA parameter.
So in your case, if you ran Open MPI with:

   mpirun -mca btl_mvapi_max_btls 1 -np X ./foobar

Only the first active port would be used for mvapi communication.
I'm not sure if this is enough for your needs or not.


So long as the second active port isn't touched by Open MPI, it
sounds just fine.

One thing, though -- You mention mvapi, which IIRC is the 1st
Generation IB stack.  Is there similar functionality with the openib
btl (for the 2nd generation IB stack)?


It looks like we never added similar logic to the Open IB transport.   
I'll pass your request on to the developer of our Open IB transport.  
Given our timeframe for releasing Open MPI 1.0.2, it's doubtful any  
change will make that release.  But it should definitely be possible  
to add such functionality in a future release.


Brian


Re: [OMPI users] problems with OpenMPI-1.0.1 on SunOS 5.9; problems on heterogeneous cluster

2006-03-10 Thread Brian Barrett

On Mar 10, 2006, at 12:09 AM, Ravi Manumachu wrote:


I am facing problems running OpenMPI-1.0.1 on a heterogeneous cluster.

I have a Linux machine and a SunOS machine in this cluster.

linux$ uname -a
Linux pg1cluster01 2.6.8-1.521smp #1 SMP Mon Aug 16 09:25:06 EDT 2004
i686 i686 i386 GNU/Linux

sunos$ uname -a
SunOS csultra01 5.9 Generic_112233-10 sun4u sparc SUNW,Ultra-5_10


Unfortunately, this will not work with Open MPI at present.  Open MPI  
1.0.x does not have any support for running across platforms with  
different endianness.  Open MPI 1.1.x has much better support for  
such situations, but is far from complete, as the MPI datatype engine  
does not properly fix up endian issues.  We're working on the issue,  
but can not give a timetable for completion.


Also note that (while not a problem here) Open MPI also does not  
support running in a mixed 32 bit / 64 bit environment.  All  
processes must be 32 or 64 bit, but not a mix.



$ mpirun --hostfile hosts.txt --app mpiinit_appfile
ld.so.1: /home/cs/manredd/OpenMPI/openmpi-1.0.1/MPITESTS/ 
mpiinit_sunos:

fatal: relocation error: file
/home/cs/manredd/OpenMPI/openmpi-1.0.1/OpenMPI-SunOS-5.9/lib/ 
libmca_common_sm.so.0:

symbol nanosleep: referenced symbol not found
ld.so.1: /home/cs/manredd/OpenMPI/openmpi-1.0.1/MPITESTS/ 
mpiinit_sunos:

fatal: relocation error: file
/home/cs/manredd/OpenMPI/openmpi-1.0.1/OpenMPI-SunOS-5.9/lib/ 
libmca_common_sm.so.0:

symbol nanosleep: referenced symbol not found

I have fixed this by compiling with "-lrt" option to the linker.


You shouldn't have to do this...  Could you send me the config.log  
file configure for Open MPI, the installed $prefix/lib/libmpi.la  
file, and the output of mpicc -showme?



sunos$ mpicc -o mpiinit_sunos mpiinit.c -lrt

However when I run this again, I get the error:

$ mpirun --hostfile hosts.txt --app mpiinit_appfile
[pg1cluster01:19858] ERROR: A daemon on node csultra01 failed to start
as expected.
[pg1cluster01:19858] ERROR: There may be more information available  
from

[pg1cluster01:19858] ERROR: the remote shell (see above).
[pg1cluster01:19858] ERROR: The daemon exited unexpectedly with  
status 255.

2 processes killed (possibly by Open MPI)


Both of these are quite unexpected.  It looks like there is something  
wrong with your Solaris build.  Can you run on *just* the Solaris  
machine?  We only have limited resources for testing on Solaris, but  
have not run into this issue before.  What happens if you run mpirun  
on just the Solaris machine with the -d option to mpirun?



Sometimes I get the error.

$ mpirun --hostfile hosts.txt --app mpiinit_appfile
[csultra01:06256] mca_common_sm_mmap_init: ftruncate failed with  
errno=28
[csultra01:06256] mca_mpool_sm_init: unable to create shared memory  
mapping
-- 

It looks like MPI_INIT failed for some reason; your parallel  
process is

likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or  
environment

problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned value -2 instead of OMPI_SUCCESS
-- 


*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)


This looks like you got far enough along that you ran into our  
endianness issues, so this is about the best case you can hope for in  
your configuration.  The ftruncate error worries me, however.  But I  
think this is another symptom of something wrong with your Sun Sparc  
build.


Brian

--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/




Re: [OMPI users] Myrinet on linux cluster

2006-03-10 Thread Brian Barrett


On Mar 9, 2006, at 11:37 PM, Tom Rosmond wrote:

Attached are output files from a build with the adjustments you  
suggested.


setenv FC pgf90
setenv F77 pgf90
setenv CCPFLAGS -I/usr/include/gm

./configure  --prefix=/users/rosmond/ompi  --with-gm

The results are the same.


Yes, I figured the failure would still be there.  Sorry to make you  
do the extra work, but I needed a build without the extra issues so  
that I could try to get a clearer picture of what is going on.   
Unfortunately, it looks like libtool (the GNU project to build  
portable libraries) is doing something I didn't expect and causing  
issues.  I'm passing this on to a friend of Open MPI who works on the  
Libtool project and is extremely good at figuring these issues out.   
I'll relay back what he recommends, but it might not be until Monday.


P.S.  I understand that the mpi2 option is just a dummy.  I use it  
because I am
porting a code from an SGI Origin, which has full mpi2 one-sided  
support.  This
options makes it unnecessary to add my own dummy MPI2 routines to  
my source.
My code has both MPI1 and MPI2 message passing options, so it's one  
of the

reasons I like OPENMPI over MPICH.


Ok, I get a little nervous when I see that option, because it doesn't  
do what most people expect ;).  As long as you're fine with any call  
to the one-sided functions invoking MPI error handlers, there should  
be no problem.  The good news is that Open MPI 1.1 will have complete  
one-sided support.


Brian



Brian Barrett wrote:

On Mar 9, 2006, at 2:51 PM, Tom Rosmond wrote:
I am trying to install OPENMPI on a Linux cluster with 22 dual  
Opteron nodes and a Myrinet interconnect. I am having trouble  
with the build with the GM libraries. I configured with: ./ 
configure --prefix-/users/rosmond/ompi --with-gm=/usr/lib64 --   
enable-mpi2-one-sided
Can you try configuring with --with-gm (no argument) and send the  
output from configure and make again? The --with-gm flag takes as  
an argument the installation prefix, not the library prefix. So in  
this case, it would be --with-gm=/usr, which is kind of pointless,  
as that's a default search location anyway. Open MPI's configure  
script should automatically look in /usr/lib64. In fact, it looks  
like configure looked there and found the right libgm, but  
something went amuck later in the process. Also, you really don't  
want to configure with the --enable-mpi2-one- sided flag. It will  
not do anything useful and will likely cause very bad things to  
happen. Open MPI 1.0.x does not have any MPI-2 onesided support.  
Open MPI 1.1 should have a complete implementation of the onesided  
chapter.
and the environmental variables: setenv FC pgf90 setenv F77 pgf90  
setenv CCPFLAGS /usr/include/gm ! (note this non-standard location)
I assume you mean CPPFLAGS=-I/usr/include/gm, which shouldn't  
cause any problems.
The configure seemed to go OK, but the make failed. As you see at  
the end of the make output, it doesn't like the format of  
libgm.so. It looks to me that it is using a path (/usr/lib/.)  
to 32 bit libraries, rather than 64 bit (/ usr/lib64/). Is  
this correct? What's the solution?
I'm not sure at this point, but I need a build without the  
incorrect flag to be able to determine what went wrong. We've  
built Open MPI with 64 bit builds of GM before, so I'm surprised  
there were any problems... Thanks, Brian




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/




Re: [OMPI users] Myrinet on linux cluster

2006-03-10 Thread Brian Barrett

On Mar 10, 2006, at 8:35 AM, Brian Barrett wrote:


On Mar 9, 2006, at 11:37 PM, Tom Rosmond wrote:

Attached are output files from a build with the adjustments you  
suggested.


setenv FC pgf90
setenv F77 pgf90
setenv CCPFLAGS -I/usr/include/gm

./configure  --prefix=/users/rosmond/ompi  --with-gm

The results are the same.


Yes, I figured the failure would still be there.  Sorry to make you  
do the extra work, but I needed a build without the extra issues so  
that I could try to get a clearer picture of what is going on.   
Unfortunately, it looks like libtool (the GNU project to build  
portable libraries) is doing something I didn't expect and causing  
issues.  I'm passing this on to a friend of Open MPI who works on  
the Libtool project and is extremely good at figuring these issues  
out.  I'll relay back what he recommends, but it might not be until  
Monday.


The Libtool expert was wondering if you could send the contents of  
the files /usr/lib/libgm.la and /usr/lib64/libgm.la.  They should  
both be (fairly short) text files.


Also, as a possible work-around, he suggests compiling from the top  
level like normal (just "make" or "make all") until the failure,  
changing directories into ompi/mca/btl/gm (where the failure  
occurred) and running "make LDFLAGS=-L/usr/lib64", then changing  
directories back to the top level of the Open MPI source code and  
running make (without the extra LDFLAGS option) again.  Let me know  
if that works.


Thanks,

Brian


--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/




Re: [OMPI users] Run failure on Solaris Opteron with Sun Studio 11

2006-03-10 Thread Brian Barrett

On Mar 9, 2006, at 12:18 PM, Pierre Valiron wrote:


- 'mpirun --help' non longer crashes.


Improvement :)


- standard output seems messy:

a) 'mpirun -np 4 pwd' returns randomly 1 or two lines, never 4. The  
same behaviour occurs if the output is redirected to a file.


b) When running some simple "demo" fortran code, the standard  
output is buffered within open-mpi and all results are issued at  
the end. No intermediates are showed.


Ok, I know what the issue here is.  We don't properly support ptys on  
Solaris, so the Fortran code is going into page buffering mode  
causing all kinds of issues.  I think the same problem may be  
responsible for the issues with the race condition for short lived  
programs.  I'm working on a fix for this issue, but it might take a  
bit of time.



- running a slightly more elaborate program fails:

a) compile behaves differently with mpif77 and mpif90.

While mpif90 compiles and builds "silently", mpif77 is talkative:

valiron@icare ~/BENCHES > mpif77 -xtarget=opteron -xarch=amd64 -o  
all all.f
NOTICE: Invoking /opt/Studio11/SUNWspro/bin/f90 -f77 -ftrap=%none - 
I/users/valiron/lib/openmpi-1.1a1r9224/include -xtarget=opteron - 
xarch=amd64 -o all all.f -L/users/valiron/lib/openmpi-1.1a1r9224/ 
lib -lmpi -lorte -lopal -lsocket -lnsl -lrt -lm -lthread -ldl

all.f:
   rw_sched:
MAIN all:
   lam_alltoall:
   my_alltoall1:
   my_alltoall2:
   my_alltoall3:
   my_alltoall4:
   check_buf:
   alltoall_sched_ori:
   alltoall_sched_new:


b) whatever the code was compiled with mpif77 or mpif90, execution  
fails:


valiron@icare ~/BENCHES > mpirun -np 2 all
Signal:11 info.si_errno:0(Error 0) si_code:1(SEGV_MAPERR)
Failing at addr:40
*** End of error message ***
Signal:11 info.si_errno:0(Error 0) si_code:1(SEGV_MAPERR)
Failing at addr:40
*** End of error message ***

Compiling with -g adds no more information.


Doh, that probably shouldn't be happening.  I'll try to investigate  
further once I have the pty issues sorted out.


Brian


--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/




Re: [OMPI users] [Fwd: MPI_SEND blocks when crossing node boundary]

2006-03-10 Thread Cezary Sliwa

Jeff Squyres wrote:
One additional question: are you using TCP as your communications  
network, and if so, do either of the nodes that you are running on  
have more than one TCP NIC?  We recently fixed a bug for situations  
where at least one node in on multiple TCP networks, not all of which  
were shared by the nodes where the peer MPI processes were running.   
If this situation describes your network setup (e.g., a cluster where  
the head node has a public and a private network, and where the  
cluster nodes only have a private network -- and your MPI process was  
running on the head node and a compute node), can you try upgrading  
to the latest 1.0.2 release candidate tarball:


 http://www.open-mpi.org/software/ompi/v1.0/

  

$ mpiexec -machinefile ../bhost -np 9 ./ng
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x6
[0] func:/opt/openmpi/1.0.2a9/lib/libopal.so.0 [0x2c062d0c]
[1] func:/lib64/tls/libpthread.so.0 [0x3b8d60c320]
[2] 
func:/opt/openmpi/1.0.2a9/lib/openmpi/mca_btl_tcp.so(mca_btl_tcp_proc_remove+0xb5) 
[0x2e6e4c65]

[3] func:/opt/openmpi/1.0.2a9/lib/openmpi/mca_btl_tcp.so [0x2e6e2b09]
[4] 
func:/opt/openmpi/1.0.2a9/lib/openmpi/mca_btl_tcp.so(mca_btl_tcp_add_procs+0x157) 
[0x2e6dfdd7]
[5] 
func:/opt/openmpi/1.0.2a9/lib/openmpi/mca_bml_r2.so(mca_bml_r2_add_procs+0x231) 
[0x2e3cd1e1]
[6] 
func:/opt/openmpi/1.0.2a9/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_add_procs+0x94) 
[0x2e1b1f44]
[7] func:/opt/openmpi/1.0.2a9/lib/libmpi.so.0(ompi_mpi_init+0x3af) 
[0x2bdd2d7f]
[8] func:/opt/openmpi/1.0.2a9/lib/libmpi.so.0(MPI_Init+0x93) 
[0x2bdbeb33]
[9] func:/opt/openmpi/1.0.2a9/lib/libmpi.so.0(MPI_INIT+0x28) 
[0x2bdce948]

[10] func:./ng(MAIN__+0x38) [0x4022a8]
[11] func:./ng(main+0xe) [0x4126ce]
[12] func:/lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x3b8cb1c4bb]
[13] func:./ng [0x4021da]
*** End of error message ***

Bye,
Czarek




Re: [OMPI users] Myrinet on linux cluster

2006-03-10 Thread Tom Rosmond

Attached are the two library files you requested, also the output from
ompi_info.

I tried the work-around procedure you suggested, and it worked. I had to
also use it in 'ompi/mca/mpool/gm' and 'ompi/mca/ptl/gm', but I got a
successful make. Then, on a hunch, I went back and added

setenv LDFLAGS -L/usr/lib64

to my environment, did a 'make clean', reran configure (with the MPI2 
support),
and did another 'make all install'. It worked. The ompi_info output is 
attached.
I see 'gm' entries in the list, so I assume things are as expected. I 
now must

have my sysadmin guy transport the installation to the compute nodes, but I
hope that will be routine.

Thanks for the help



Brian Barrett wrote:


On Mar 10, 2006, at 8:35 AM, Brian Barrett wrote:

 


On Mar 9, 2006, at 11:37 PM, Tom Rosmond wrote:

   

Attached are output files from a build with the adjustments you  
suggested.


setenv FC pgf90
setenv F77 pgf90
setenv CCPFLAGS -I/usr/include/gm

./configure  --prefix=/users/rosmond/ompi  --with-gm

The results are the same.
 

Yes, I figured the failure would still be there.  Sorry to make you  
do the extra work, but I needed a build without the extra issues so  
that I could try to get a clearer picture of what is going on.   
Unfortunately, it looks like libtool (the GNU project to build  
portable libraries) is doing something I didn't expect and causing  
issues.  I'm passing this on to a friend of Open MPI who works on  
the Libtool project and is extremely good at figuring these issues  
out.  I'll relay back what he recommends, but it might not be until  
Monday.
   



The Libtool expert was wondering if you could send the contents of  
the files /usr/lib/libgm.la and /usr/lib64/libgm.la.  They should  
both be (fairly short) text files.


Also, as a possible work-around, he suggests compiling from the top  
level like normal (just "make" or "make all") until the failure,  
changing directories into ompi/mca/btl/gm (where the failure  
occurred) and running "make LDFLAGS=-L/usr/lib64", then changing  
directories back to the top level of the Open MPI source code and  
running make (without the extra LDFLAGS option) again.  Let me know  
if that works.


Thanks,

Brian


 

# libgm.la - a libtool library file
# Generated by ltmain.sh - GNU libtool 1.4.2a (1.922.2.100 2002/06/26 07:25:14)
#
# Please DO NOT delete this file!
# It is necessary for linking the library.

# The name that we can dlopen(3).
dlname='libgm.so.0'

# Names of this library.
library_names='libgm.so.0.0.0 libgm.so.0 libgm.so'

# The name of the static archive.
old_library='libgm.a'

# Libraries that this one depends upon.
dependency_libs=''

# Version information for libgm.
current=0
age=0
revision=0

# Is this an already installed library?
installed=yes

# Files to dlopen/dlpreopen
dlopen=''
dlpreopen=''

# Directory that this library needs to be installed in:
libdir='/opt/gm/lib'
# libgm.la - a libtool library file
# Generated by ltmain.sh - GNU libtool 1.4.2a (1.922.2.100 2002/06/26 07:25:14)
#
# Please DO NOT delete this file!
# It is necessary for linking the library.

# The name that we can dlopen(3).
dlname='libgm.so.0'

# Names of this library.
library_names='libgm.so.0.0.0 libgm.so.0 libgm.so'

# The name of the static archive.
old_library='libgm.a'

# Libraries that this one depends upon.
dependency_libs=''

# Version information for libgm.
current=0
age=0
revision=0

# Is this an already installed library?
installed=yes

# Files to dlopen/dlpreopen
dlopen=''
dlpreopen=''

# Directory that this library needs to be installed in:
libdir='/opt/gm/lib64'
Open MPI: 1.0.1r8453
   Open MPI SVN revision: r8453
Open RTE: 1.0.1r8453
   Open RTE SVN revision: r8453
OPAL: 1.0.1r8453
   OPAL SVN revision: r8453
  Prefix: /users/rosmond/ompi
 Configured architecture: x86_64-unknown-linux-gnu
   Configured by: rosmond
   Configured on: Fri Mar 10 09:55:13 PST 2006
  Configure host: cluster0
Built by: rosmond
Built on: Fri Mar 10 10:11:17 PST 2006
  Built host: cluster0
  C bindings: yes
C++ bindings: yes
  Fortran77 bindings: yes (all)
  Fortran90 bindings: yes
  C compiler: gcc
 C compiler absolute: /usr/bin/gcc
C++ compiler: g++
   C++ compiler absolute: /usr/bin/g++
  Fortran77 compiler: pgf90
  Fortran77 compiler abs: /usr/pgi/linux86-64/6.1/bin/pgf90
  Fortran90 compiler: pgf90
  Fortran90 compiler abs: /usr/pgi/linux86-64/6.1/bin/pgf90
 C profiling: yes
   C++ profiling: yes
 Fortran77 profiling: yes
 Fortran90 profiling: yes
  C++ exceptions: no
  Thread support: posix (mpi: no, progress: no)
  Internal debug support: no
 MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
 libltdl support: 1
  MCA memory: malloc_h