Re: [OMPI users] submitted job stops

2008-04-10 Thread Reuti

Hi,

Am 09.04.2008 um 22:17 schrieb Danesh Daroui:

Mark Kosmowski skrev:

Danesh:

Have you tried "mpirun -np 4 --hostfile hosts hostname" to verify  
that

ompi is working?



When I run "mpirun -np 4 --hostfile hosts hostname" same thing happens
and it just hangs. Can it be a clue?


Can you remote access from each node to each other node?


Yes all nodes can have access to each other via SSH and can login
without being prompted for password.


If any node has more than 1 network device, are you using the ompi
options to specify which device to use?



Each node has one network interface which works properly.


do you have any firewall on the machines, blocking certain ports?

-- Reuti



Regards,

Danesh



Good luck,

Mark



Message: 5
Date: Wed, 9 Apr 2008 14:15:34 +0200 (CEST)
From: "danes...@bredband.net" 
Subject: [OMPI users] Ang: Re:  submitted job stops
To: 
Message-ID:
   <24351656.56761207743334738.JavaMail.defaultUser@defaultHost>
Content-Type: text/plain;charset="ISO-8859-15"


Actually my program is very simple MPI program "Hello World" which
just prints rank of each processor and then terminates. When I run
my program on a single processor machine with e.g 4 processors
(oversubscribing) it shows:

Hello world from processor with rank 0
Hello world from processor with rank 3
Hello world from processor with rank 1
Hello world from processor with rank 2

but when I use my remote machines everything just stops when
I run the program.

No I do not use any queuing system. I simply run it like this:

mpirun -np 4 --hostfile hosts ./hw

and then it just tops until I terminate it manually. As I said,
I monitored all machines (master+2 slaves) and found out that
in all machines, "orted" daemon starts when I run the program, but
after few seconds the daemon is terminated. What can be the reason?

Thanks,

Danesh





Ursprungligt meddelande
Fr?n: re...@staff.uni-marburg.de
Datum: 09-04-2008 13:26
Till: "Open MPI Users"
?rende: Re: [OMPI users] submitted job stops

Hi,

Am 08.04.2008 um 21:58 schrieb Danesh Daroui:

I had posted a message about my problem and I did all solutions  
but

the
problem is not solved it. The problem is that
I have installed Open-MPI on three machines (1 master+2 slaves).
When I
submit a job to master I can see that
"orted" daemon is launched on all machines (by running "top" on  
all

machines) but all "orted" daemons terminate after
few seconds and nothing will happen. First I thought that it  
can be

because remote machines can not launch "orted" but
now I am sure that it can be run on all machines without  
problem. Any

suggestion?

the question is more: is your MPI program running successfully  
or is

there simply no output from mpiexec/-run? And: by "submit" you mean
you use any queuingsystem?

-- Reuti
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users






--

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

End of users Digest, Vol 863, Issue 1
*



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] problems with hostfile when doing MPMD

2008-04-10 Thread jody
HI
In my network i have some 32 bit machines and some 64 bit machines.
With --host i successfully call my application:
  mpirun -np 3 --host aim-plankton -x DISPLAY ./run_gdb.sh ./MPITest :
-np 3 --host aim-fanta4 -x DISPLAY ./run_gdb.sh ./MPITest64
(MPITest64 has the same code as MPITest, but was compiled on the 64 bit machine)

But when i use hostfiles:
  mpirun -np 3 --hostfile hosts32 -x DISPLAY ./run_gdb.sh ./MPITest :
-np 3 --hostfile hosts64 -x DISPLAY ./run_gdb.sh ./MPITest64
all 6 processes are started on the 64 bit machine aim-fanta4.

hosts32:
   aim-plankton slots=3
hosts64
  aim-fanta4 slots

Is this a bug or a feature?  ;)

Jody


Re: [OMPI users] problems with hostfile when doing MPMD

2008-04-10 Thread jody
Hi
Using a more realistic application than a simple "Hello, world"
even the --host version doesn't work correctly
Called this way

mpirun -np 3 --host aim-plankton ./QHGLauncher
--read-config=pureveg_new.cfg -o output.txt  : -np 3 --host aim-fanta4
./QHGLauncher_64 --read-config=pureveg_new.cfg -o output.txt

the application starts but seems to hang after a while.

Running the application in gdb:

mpirun -np 3 --host aim-plankton -x DISPLAY ./run_gdb.sh ./QHGLauncher
--read-config=pureveg_new.cfg -o output.txt  : -np 3 --host aim-fanta4
-x DISPLAY ./run_gdb.sh ./QHGLauncher_64 --read-config=pureveg_new.cfg
-o bruzlopf -n 12
--seasonality=3,data/cai_temp2.clim,data/cai_precip2.clim

i can see that the processes on aim-fanta4 have indeed gotten stuck
after a few initial outputs,
and the processes on aim-plankton all have a messsage:

[aim-plankton][0,1,1][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=113

If i opnly use aim-plankton alone or aim-fanta4 alone everythiung runs
as expected.

BTW: i'm, using open MPI 1.2.2

Thanks
  Jody
On Thu, Apr 10, 2008 at 12:40 PM, jody  wrote:
> HI
>  In my network i have some 32 bit machines and some 64 bit machines.
>  With --host i successfully call my application:
>   mpirun -np 3 --host aim-plankton -x DISPLAY ./run_gdb.sh ./MPITest :
>  -np 3 --host aim-fanta4 -x DISPLAY ./run_gdb.sh ./MPITest64
>  (MPITest64 has the same code as MPITest, but was compiled on the 64 bit 
> machine)
>
>  But when i use hostfiles:
>   mpirun -np 3 --hostfile hosts32 -x DISPLAY ./run_gdb.sh ./MPITest :
>  -np 3 --hostfile hosts64 -x DISPLAY ./run_gdb.sh ./MPITest64
>  all 6 processes are started on the 64 bit machine aim-fanta4.
>
>  hosts32:
>aim-plankton slots=3
>  hosts64
>   aim-fanta4 slots
>
>  Is this a bug or a feature?  ;)
>
>  Jody
>


Re: [OMPI users] problems with hostfile when doing MPMD

2008-04-10 Thread jody
i narrowed it down:
The majority of processes get stuck in MPI_Barrier.
My Test application looks like this:

#include 
#include 
#include "mpi.h"

int main(int iArgC, char *apArgV[]) {
int iResult = 0;
int iRank1;
int iNum1;

char sName[256];
gethostname(sName, 255);

MPI_Init(&iArgC, &apArgV);

MPI_Comm_rank(MPI_COMM_WORLD, &iRank1);
MPI_Comm_size(MPI_COMM_WORLD, &iNum1);

printf("[%s]I am #%d/%d before the barrier\n", sName, iRank1, iNum1);
MPI_Barrier(MPI_COMM_WORLD);
printf("[%s]I am #%d/%d after the barrier\n", sName, iRank1, iNum1);

MPI_Finalize();

return iResult;
}


If i make this call:
mpirun -np 3 --debug-daemons --host aim-plankton -x DISPLAY
./run_gdb.sh ./MPITest32 : -np 3 --host aim-fanta4 -x DISPLAY
./run_gdb.sh ./MPITest64

(run_gdb.sh is a script which starts gdb in a xterm for each process)
Process 0 (on aim-plankton) passes the barrier and gets stuck in PMPI_Finalize,
all other processes get stuck in PMPI_Barrier,
Process 1 (on aim-plankton) displays the message
   
[aim-plankton][0,1,1][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=113
Process 2 on (aim-plankton) displays the same message twice.

Any ideas?

  Thanks Jody

On Thu, Apr 10, 2008 at 1:05 PM, jody  wrote:
> Hi
>  Using a more realistic application than a simple "Hello, world"
>  even the --host version doesn't work correctly
>  Called this way
>
>  mpirun -np 3 --host aim-plankton ./QHGLauncher
>  --read-config=pureveg_new.cfg -o output.txt  : -np 3 --host aim-fanta4
>  ./QHGLauncher_64 --read-config=pureveg_new.cfg -o output.txt
>
>  the application starts but seems to hang after a while.
>
>  Running the application in gdb:
>
>  mpirun -np 3 --host aim-plankton -x DISPLAY ./run_gdb.sh ./QHGLauncher
>  --read-config=pureveg_new.cfg -o output.txt  : -np 3 --host aim-fanta4
>  -x DISPLAY ./run_gdb.sh ./QHGLauncher_64 --read-config=pureveg_new.cfg
>  -o bruzlopf -n 12
>  --seasonality=3,data/cai_temp2.clim,data/cai_precip2.clim
>
>  i can see that the processes on aim-fanta4 have indeed gotten stuck
>  after a few initial outputs,
>  and the processes on aim-plankton all have a messsage:
>
>  
> [aim-plankton][0,1,1][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
>  connect() failed with errno=113
>
>  If i opnly use aim-plankton alone or aim-fanta4 alone everythiung runs
>  as expected.
>
>  BTW: i'm, using open MPI 1.2.2
>
>  Thanks
>   Jody
>
>
> On Thu, Apr 10, 2008 at 12:40 PM, jody  wrote:
>  > HI
>  >  In my network i have some 32 bit machines and some 64 bit machines.
>  >  With --host i successfully call my application:
>  >   mpirun -np 3 --host aim-plankton -x DISPLAY ./run_gdb.sh ./MPITest :
>  >  -np 3 --host aim-fanta4 -x DISPLAY ./run_gdb.sh ./MPITest64
>  >  (MPITest64 has the same code as MPITest, but was compiled on the 64 bit 
> machine)
>  >
>  >  But when i use hostfiles:
>  >   mpirun -np 3 --hostfile hosts32 -x DISPLAY ./run_gdb.sh ./MPITest :
>  >  -np 3 --hostfile hosts64 -x DISPLAY ./run_gdb.sh ./MPITest64
>  >  all 6 processes are started on the 64 bit machine aim-fanta4.
>  >
>  >  hosts32:
>  >aim-plankton slots=3
>  >  hosts64
>  >   aim-fanta4 slots
>  >
>  >  Is this a bug or a feature?  ;)
>  >
>  >  Jody
>  >
>


Re: [OMPI users] problems with hostfile when doing MPMD

2008-04-10 Thread Rolf Vandevaart


This worked for me although I am not sure how extensive our 32/64 
interoperability support is.  I tested on Solaris using the TCP 
interconnect and a 1.2.5 version of Open MPI.  Also, we configure with 
the --enable-heterogeneous flag which may make a difference here.  Also 
this did not work for me over the sm btl.


By the way, can you run a simple /bin/hostname across the two nodes?


 burl-ct-v20z-4 61 =>/opt/SUNWhpc/HPC7.1/bin/mpicc -m32 simple.c -o 
simple.32
 burl-ct-v20z-4 62 =>/opt/SUNWhpc/HPC7.1/bin/mpicc -m64 simple.c -o 
simple.64
 burl-ct-v20z-4 63 =>/opt/SUNWhpc/HPC7.1/bin/mpirun -gmca 
btl_tcp_if_include bge1 -gmca btl sm,self,tcp -host burl-ct-v20z-4 -np 3 
simple.32 : -host burl-ct-v20z-5 -np 3 simple.64

[burl-ct-v20z-4]I am #0/6 before the barrier
[burl-ct-v20z-5]I am #3/6 before the barrier
[burl-ct-v20z-5]I am #4/6 before the barrier
[burl-ct-v20z-4]I am #1/6 before the barrier
[burl-ct-v20z-4]I am #2/6 before the barrier
[burl-ct-v20z-5]I am #5/6 before the barrier
[burl-ct-v20z-5]I am #3/6 after the barrier
[burl-ct-v20z-4]I am #1/6 after the barrier
[burl-ct-v20z-5]I am #5/6 after the barrier
[burl-ct-v20z-5]I am #4/6 after the barrier
[burl-ct-v20z-4]I am #2/6 after the barrier
[burl-ct-v20z-4]I am #0/6 after the barrier
 burl-ct-v20z-4 64 =>/opt/SUNWhpc/HPC7.1/bin/mpirun -V mpirun (Open 
MPI) 1.2.5r16572


Report bugs to http://www.open-mpi.org/community/help/
 burl-ct-v20z-4 65 =>


jody wrote:

i narrowed it down:
The majority of processes get stuck in MPI_Barrier.
My Test application looks like this:

#include 
#include 
#include "mpi.h"

int main(int iArgC, char *apArgV[]) {
int iResult = 0;
int iRank1;
int iNum1;

char sName[256];
gethostname(sName, 255);

MPI_Init(&iArgC, &apArgV);

MPI_Comm_rank(MPI_COMM_WORLD, &iRank1);
MPI_Comm_size(MPI_COMM_WORLD, &iNum1);

printf("[%s]I am #%d/%d before the barrier\n", sName, iRank1, iNum1);
MPI_Barrier(MPI_COMM_WORLD);
printf("[%s]I am #%d/%d after the barrier\n", sName, iRank1, iNum1);

MPI_Finalize();

return iResult;
}


If i make this call:
mpirun -np 3 --debug-daemons --host aim-plankton -x DISPLAY
./run_gdb.sh ./MPITest32 : -np 3 --host aim-fanta4 -x DISPLAY
./run_gdb.sh ./MPITest64

(run_gdb.sh is a script which starts gdb in a xterm for each process)
Process 0 (on aim-plankton) passes the barrier and gets stuck in PMPI_Finalize,
all other processes get stuck in PMPI_Barrier,
Process 1 (on aim-plankton) displays the message
   
[aim-plankton][0,1,1][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=113
Process 2 on (aim-plankton) displays the same message twice.

Any ideas?

  Thanks Jody

On Thu, Apr 10, 2008 at 1:05 PM, jody  wrote:

Hi
 Using a more realistic application than a simple "Hello, world"
 even the --host version doesn't work correctly
 Called this way

 mpirun -np 3 --host aim-plankton ./QHGLauncher
 --read-config=pureveg_new.cfg -o output.txt  : -np 3 --host aim-fanta4
 ./QHGLauncher_64 --read-config=pureveg_new.cfg -o output.txt

 the application starts but seems to hang after a while.

 Running the application in gdb:

 mpirun -np 3 --host aim-plankton -x DISPLAY ./run_gdb.sh ./QHGLauncher
 --read-config=pureveg_new.cfg -o output.txt  : -np 3 --host aim-fanta4
 -x DISPLAY ./run_gdb.sh ./QHGLauncher_64 --read-config=pureveg_new.cfg
 -o bruzlopf -n 12
 --seasonality=3,data/cai_temp2.clim,data/cai_precip2.clim

 i can see that the processes on aim-fanta4 have indeed gotten stuck
 after a few initial outputs,
 and the processes on aim-plankton all have a messsage:

 
[aim-plankton][0,1,1][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
 connect() failed with errno=113

 If i opnly use aim-plankton alone or aim-fanta4 alone everythiung runs
 as expected.

 BTW: i'm, using open MPI 1.2.2

 Thanks
  Jody


On Thu, Apr 10, 2008 at 12:40 PM, jody  wrote:
 > HI
 >  In my network i have some 32 bit machines and some 64 bit machines.
 >  With --host i successfully call my application:
 >   mpirun -np 3 --host aim-plankton -x DISPLAY ./run_gdb.sh ./MPITest :
 >  -np 3 --host aim-fanta4 -x DISPLAY ./run_gdb.sh ./MPITest64
 >  (MPITest64 has the same code as MPITest, but was compiled on the 64 bit 
machine)
 >
 >  But when i use hostfiles:
 >   mpirun -np 3 --hostfile hosts32 -x DISPLAY ./run_gdb.sh ./MPITest :
 >  -np 3 --hostfile hosts64 -x DISPLAY ./run_gdb.sh ./MPITest64
 >  all 6 processes are started on the 64 bit machine aim-fanta4.
 >
 >  hosts32:
 >aim-plankton slots=3
 >  hosts64
 >   aim-fanta4 slots
 >
 >  Is this a bug or a feature?  ;)
 >
 >  Jody
 >


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--

=
rolf.vandeva...@sun.com
781-442-3043
=


Re: [OMPI users] problems with hostfile when doing MPMD

2008-04-10 Thread jody
Rolf,
I was able to run hostname on the two noes that way,
and also a simplified version of my testprogram (without a barrier)
works. Only MPI_Barrier shows bad behaviour.

Do you know what this message means?
[aim-plankton][0,1,2][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=113
Does it give an idea what could be the problem?

Jody

On Thu, Apr 10, 2008 at 2:20 PM, Rolf Vandevaart
 wrote:
>
>  This worked for me although I am not sure how extensive our 32/64
>  interoperability support is.  I tested on Solaris using the TCP
>  interconnect and a 1.2.5 version of Open MPI.  Also, we configure with
>  the --enable-heterogeneous flag which may make a difference here.  Also
>  this did not work for me over the sm btl.
>
>  By the way, can you run a simple /bin/hostname across the two nodes?
>
>
>   burl-ct-v20z-4 61 =>/opt/SUNWhpc/HPC7.1/bin/mpicc -m32 simple.c -o
>  simple.32
>   burl-ct-v20z-4 62 =>/opt/SUNWhpc/HPC7.1/bin/mpicc -m64 simple.c -o
>  simple.64
>   burl-ct-v20z-4 63 =>/opt/SUNWhpc/HPC7.1/bin/mpirun -gmca
>  btl_tcp_if_include bge1 -gmca btl sm,self,tcp -host burl-ct-v20z-4 -np 3
>  simple.32 : -host burl-ct-v20z-5 -np 3 simple.64
>  [burl-ct-v20z-4]I am #0/6 before the barrier
>  [burl-ct-v20z-5]I am #3/6 before the barrier
>  [burl-ct-v20z-5]I am #4/6 before the barrier
>  [burl-ct-v20z-4]I am #1/6 before the barrier
>  [burl-ct-v20z-4]I am #2/6 before the barrier
>  [burl-ct-v20z-5]I am #5/6 before the barrier
>  [burl-ct-v20z-5]I am #3/6 after the barrier
>  [burl-ct-v20z-4]I am #1/6 after the barrier
>  [burl-ct-v20z-5]I am #5/6 after the barrier
>  [burl-ct-v20z-5]I am #4/6 after the barrier
>  [burl-ct-v20z-4]I am #2/6 after the barrier
>  [burl-ct-v20z-4]I am #0/6 after the barrier
>   burl-ct-v20z-4 64 =>/opt/SUNWhpc/HPC7.1/bin/mpirun -V mpirun (Open
>  MPI) 1.2.5r16572
>
>  Report bugs to http://www.open-mpi.org/community/help/
>   burl-ct-v20z-4 65 =>
>
>
>
>
>  jody wrote:
>  > i narrowed it down:
>  > The majority of processes get stuck in MPI_Barrier.
>  > My Test application looks like this:
>  >
>  > #include 
>  > #include 
>  > #include "mpi.h"
>  >
>  > int main(int iArgC, char *apArgV[]) {
>  > int iResult = 0;
>  > int iRank1;
>  > int iNum1;
>  >
>  > char sName[256];
>  > gethostname(sName, 255);
>  >
>  > MPI_Init(&iArgC, &apArgV);
>  >
>  > MPI_Comm_rank(MPI_COMM_WORLD, &iRank1);
>  > MPI_Comm_size(MPI_COMM_WORLD, &iNum1);
>  >
>  > printf("[%s]I am #%d/%d before the barrier\n", sName, iRank1, iNum1);
>  > MPI_Barrier(MPI_COMM_WORLD);
>  > printf("[%s]I am #%d/%d after the barrier\n", sName, iRank1, iNum1);
>  >
>  > MPI_Finalize();
>  >
>  > return iResult;
>  > }
>  >
>  >
>  > If i make this call:
>  > mpirun -np 3 --debug-daemons --host aim-plankton -x DISPLAY
>  > ./run_gdb.sh ./MPITest32 : -np 3 --host aim-fanta4 -x DISPLAY
>  > ./run_gdb.sh ./MPITest64
>  >
>  > (run_gdb.sh is a script which starts gdb in a xterm for each process)
>  > Process 0 (on aim-plankton) passes the barrier and gets stuck in 
> PMPI_Finalize,
>  > all other processes get stuck in PMPI_Barrier,
>  > Process 1 (on aim-plankton) displays the message
>  >
> [aim-plankton][0,1,1][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
>  > connect() failed with errno=113
>  > Process 2 on (aim-plankton) displays the same message twice.
>  >
>  > Any ideas?
>  >
>  >   Thanks Jody
>  >
>  > On Thu, Apr 10, 2008 at 1:05 PM, jody  wrote:
>  >> Hi
>  >>  Using a more realistic application than a simple "Hello, world"
>  >>  even the --host version doesn't work correctly
>  >>  Called this way
>  >>
>  >>  mpirun -np 3 --host aim-plankton ./QHGLauncher
>  >>  --read-config=pureveg_new.cfg -o output.txt  : -np 3 --host aim-fanta4
>  >>  ./QHGLauncher_64 --read-config=pureveg_new.cfg -o output.txt
>  >>
>  >>  the application starts but seems to hang after a while.
>  >>
>  >>  Running the application in gdb:
>  >>
>  >>  mpirun -np 3 --host aim-plankton -x DISPLAY ./run_gdb.sh ./QHGLauncher
>  >>  --read-config=pureveg_new.cfg -o output.txt  : -np 3 --host aim-fanta4
>  >>  -x DISPLAY ./run_gdb.sh ./QHGLauncher_64 --read-config=pureveg_new.cfg
>  >>  -o bruzlopf -n 12
>  >>  --seasonality=3,data/cai_temp2.clim,data/cai_precip2.clim
>  >>
>  >>  i can see that the processes on aim-fanta4 have indeed gotten stuck
>  >>  after a few initial outputs,
>  >>  and the processes on aim-plankton all have a messsage:
>  >>
>  >>  
> [aim-plankton][0,1,1][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
>  >>  connect() failed with errno=113
>  >>
>  >>  If i opnly use aim-plankton alone or aim-fanta4 alone everythiung runs
>  >>  as expected.
>  >>
>  >>  BTW: i'm, using open MPI 1.2.2
>  >>
>  >>  Thanks
>  >>   Jody
>  >>
>  >>
>  >> On Thu, Apr 10, 2008 at 12:40 PM, jody  wrote:
>  >>  > HI
>  >>  >  In my network i have some 32 bit machines and some 64 bit machines.
>  >>  >  With --host 

Re: [OMPI users] problems with hostfile when doing MPMD

2008-04-10 Thread Rolf Vandevaart


On a CentOS Linux box, I see the following:

> grep 113 /usr/include/asm-i386/errno.h
#define EHOSTUNREACH113 /* No route to host */

I have also seen folks do this to figure out the errno.

> perl -e 'die$!=113'
No route to host at -e line 1.

I am not sure why this is happening, but you could also check the Open 
MPI User's Mailing List Archives where there are other examples of 
people running into this error.  A search of "113" had a few hits.


http://www.open-mpi.org/community/lists/users

Also, I assume you would see this problem with or without the 
MPI_Barrier if you add this parameter to your mpirun line:


--mca mpi_preconnect_all 1

The MPI_Barrier is causing the bad behavior because by default 
connections are setup up lazily. Therefore only when the MPI_Barrier 
call is made and we start communicating and establishing connections do 
we start seeing the communication problems.


Rolf

jody wrote:

Rolf,
I was able to run hostname on the two noes that way,
and also a simplified version of my testprogram (without a barrier)
works. Only MPI_Barrier shows bad behaviour.

Do you know what this message means?
[aim-plankton][0,1,2][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=113
Does it give an idea what could be the problem?

Jody

On Thu, Apr 10, 2008 at 2:20 PM, Rolf Vandevaart
 wrote:

 This worked for me although I am not sure how extensive our 32/64
 interoperability support is.  I tested on Solaris using the TCP
 interconnect and a 1.2.5 version of Open MPI.  Also, we configure with
 the --enable-heterogeneous flag which may make a difference here.  Also
 this did not work for me over the sm btl.

 By the way, can you run a simple /bin/hostname across the two nodes?


  burl-ct-v20z-4 61 =>/opt/SUNWhpc/HPC7.1/bin/mpicc -m32 simple.c -o
 simple.32
  burl-ct-v20z-4 62 =>/opt/SUNWhpc/HPC7.1/bin/mpicc -m64 simple.c -o
 simple.64
  burl-ct-v20z-4 63 =>/opt/SUNWhpc/HPC7.1/bin/mpirun -gmca
 btl_tcp_if_include bge1 -gmca btl sm,self,tcp -host burl-ct-v20z-4 -np 3
 simple.32 : -host burl-ct-v20z-5 -np 3 simple.64
 [burl-ct-v20z-4]I am #0/6 before the barrier
 [burl-ct-v20z-5]I am #3/6 before the barrier
 [burl-ct-v20z-5]I am #4/6 before the barrier
 [burl-ct-v20z-4]I am #1/6 before the barrier
 [burl-ct-v20z-4]I am #2/6 before the barrier
 [burl-ct-v20z-5]I am #5/6 before the barrier
 [burl-ct-v20z-5]I am #3/6 after the barrier
 [burl-ct-v20z-4]I am #1/6 after the barrier
 [burl-ct-v20z-5]I am #5/6 after the barrier
 [burl-ct-v20z-5]I am #4/6 after the barrier
 [burl-ct-v20z-4]I am #2/6 after the barrier
 [burl-ct-v20z-4]I am #0/6 after the barrier
  burl-ct-v20z-4 64 =>/opt/SUNWhpc/HPC7.1/bin/mpirun -V mpirun (Open
 MPI) 1.2.5r16572

 Report bugs to http://www.open-mpi.org/community/help/
  burl-ct-v20z-4 65 =>




 jody wrote:
 > i narrowed it down:
 > The majority of processes get stuck in MPI_Barrier.
 > My Test application looks like this:
 >
 > #include 
 > #include 
 > #include "mpi.h"
 >
 > int main(int iArgC, char *apArgV[]) {
 > int iResult = 0;
 > int iRank1;
 > int iNum1;
 >
 > char sName[256];
 > gethostname(sName, 255);
 >
 > MPI_Init(&iArgC, &apArgV);
 >
 > MPI_Comm_rank(MPI_COMM_WORLD, &iRank1);
 > MPI_Comm_size(MPI_COMM_WORLD, &iNum1);
 >
 > printf("[%s]I am #%d/%d before the barrier\n", sName, iRank1, iNum1);
 > MPI_Barrier(MPI_COMM_WORLD);
 > printf("[%s]I am #%d/%d after the barrier\n", sName, iRank1, iNum1);
 >
 > MPI_Finalize();
 >
 > return iResult;
 > }
 >
 >
 > If i make this call:
 > mpirun -np 3 --debug-daemons --host aim-plankton -x DISPLAY
 > ./run_gdb.sh ./MPITest32 : -np 3 --host aim-fanta4 -x DISPLAY
 > ./run_gdb.sh ./MPITest64
 >
 > (run_gdb.sh is a script which starts gdb in a xterm for each process)
 > Process 0 (on aim-plankton) passes the barrier and gets stuck in 
PMPI_Finalize,
 > all other processes get stuck in PMPI_Barrier,
 > Process 1 (on aim-plankton) displays the message
 >
[aim-plankton][0,1,1][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
 > connect() failed with errno=113
 > Process 2 on (aim-plankton) displays the same message twice.
 >
 > Any ideas?
 >
 >   Thanks Jody
 >
 > On Thu, Apr 10, 2008 at 1:05 PM, jody  wrote:
 >> Hi
 >>  Using a more realistic application than a simple "Hello, world"
 >>  even the --host version doesn't work correctly
 >>  Called this way
 >>
 >>  mpirun -np 3 --host aim-plankton ./QHGLauncher
 >>  --read-config=pureveg_new.cfg -o output.txt  : -np 3 --host aim-fanta4
 >>  ./QHGLauncher_64 --read-config=pureveg_new.cfg -o output.txt
 >>
 >>  the application starts but seems to hang after a while.
 >>
 >>  Running the application in gdb:
 >>
 >>  mpirun -np 3 --host aim-plankton -x DISPLAY ./run_gdb.sh ./QHGLauncher
 >>  --read-config=pureveg_new.cfg -o output.txt  : -np 3 --host aim-fanta4
 >>  -x DISPLAY ./run_gdb.sh ./QHGLauncher_64 --read-config=pureveg_new.cfg
 >>  -o bruzlopf -n 12
 >>  --

[OMPI users] cross compiler make problem with mpi 1.2.6

2008-04-10 Thread Bailey, Eric
Hi,
 
I found an archive email with the same basic error I am running into for
mpi 1.2.6, unfortunately other then the question and request for the
output, there was not an email response on how it was solved.
 
the error
 
../../../opal/.libs/libopen-pal.so: undefined reference to
`lt_libltdlc_LTX_preloaded_symbols'
 
Here is the email link for the 1.2.4 problem..
 
http://www.open-mpi.org/community/lists/users/2007/10/4310.php
 
The email is a response by Jeff Squyres to the originator Jorge Parra.
Can either of you help?
 
here is my make output failure.. basically identical to the one reported
for mpi 1.2.4
 
make[2]: Entering directory
`/tmp/MPI/openmpi-1.2.6-7448/opal/tools/wrappers'
/bin/sh ../../../libtool --tag=CC   --mode=link ppc74xx-linux-gcc  -O3
-DNDEBUG -finline-functions -fno-strict-aliasing -pthread
-export-dynamic   -o opal_wrapper opal_wrapper.o
../../../opal/libopen-pal.la -lnsl -lutil  -lm 
libtool: link: ppc74xx-linux-gcc -O3 -DNDEBUG -finline-functions
-fno-strict-aliasing -pthread -o .libs/opal_wrapper opal_wrapper.o
-Wl,--export-dynamic  ../../../opal/.libs/libopen-pal.so -ldl -lnsl
-lutil -lm -pthread -Wl,-rpath
-Wl,/home/MPI/openmpi-1.2.6-install-7448/lib
../../../opal/.libs/libopen-pal.so: undefined reference to
`lt_libltdlc_LTX_preloaded_symbols'
collect2: ld returned 1 exit status
make[2]: *** [opal_wrapper] Error 1
make[2]: Leaving directory
`/tmp/MPI/openmpi-1.2.6-7448/opal/tools/wrappers'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/tmp/MPI/openmpi-1.2.6-7448/opal'
make: *** [all-recursive] Error 1

Any help is greatly appreciated.
 
thanks,
Eric Bailey


[OMPI users] configuring with --enable-mpi-profile option

2008-04-10 Thread Swati Kher
Hi,

 

If I configure openmpi with "-enable-mpi-profile" option:

 

1)   Once build is complete, how do I specify profile name and
location in the "mpirun" command? Do I have to set any flags with the
"mpirun" command to view profile?

2)   If vampire trace by default is built with openmpi, if I set
VT_CC flag for compiling my application, where I can view ".vtf" files
after a parallel run ?

 

Thanks in advance

 

--

Swati Kher

Application Performance Optimization Engineer

Mellanox Technologies

Work: 408-916-0037 x337

sw...@mellanox.com

 



Re: [OMPI users] Problem with MPI_Scatter() on inter-communicator...

2008-04-10 Thread Edgar Gabriel
thanks for reporting the bug, it is fixed on the trunk. The problem was 
this time not in the algorithm, but in the checking of the 
preconditions. If recvcount was zero and the rank not equal to the rank 
of the root, than we did not even start the scatter, assuming that there 
was nothing to do. For inter-communicators the check has to be however 
extended to accept recvcount=0 for root=MPI_ROOT. The fix is in the 
trunk in rev. 18123.


Thanks
Edgar

Edgar Gabriel wrote:
I don't think that anybody answered to your email so far, I'll have a 
look at it on thursday...


Thanks
Edgar

Audet, Martin wrote:

Hi,

I don't know if it is my sample code or if it is a problem whit MPI_Scatter() 
on inter-communicator (maybe similar to the problem we found with 
MPI_Allgather() on inter-communicator a few weeks ago) but a simple program I 
wrote freeze during its second iteration of a loop doing an MPI_Scatter() over 
an inter-communicator.

For example if I compile as follows:

  mpicc -Wall scatter_bug.c -o scatter_bug

I get no error or warning. Then if a start it with np=2 as follows:

mpiexec -n 2 ./scatter_bug

it prints:

   beginning Scatter i_root_group=0
   ending Scatter i_root_group=0
   beginning Scatter i_root_group=1

and then hang...

Note also that if I change the for loop to execute only the MPI_Scatter() of the second iteration 
(e.g. replacing "i_root_group=0;" by "i_root_group=1;"), it prints:

beginning Scatter i_root_group=1

and then hang...

The problem therefore seems to be related with the second iteration itself.

Please note that this program run fine with mpich2 1.0.7rc2 (ch3:sock device) 
for many different number of process (np) when the executable is ran with or 
without valgrind.

The OpenMPI version I use is 1.2.6rc3 and was configured as follows:

   ./configure --prefix=/usr/local/openmpi-1.2.6rc3 --disable-mpi-f77 
--disable-mpi-f90 --disable-mpi-cxx --disable-cxx-exceptions 
--with-io-romio-flags=--with-file-system=ufs+nfs

Note also that all process (when using OpenMPI or mpich2) were started on the 
same machine.

Also if you look at source code, you will notice that some arguments to MPI_Scatter() are 
NULL or 0. This may look strange and problematic when using a normal intra-communicator. 
However according to the book "MPI - The complete reference" vol 2 about MPI-2, 
for MPI_Scatter() with an inter-communicator:

  "The sendbuf, sendcount and sendtype arguments are significant only at the root 
process. The recvbuf, recvcount, and recvtype arguments are significant only at the 
processes of the leaf group."

If anyone else can have a look at this program and try it it would be helpful.

Thanks,

Martin


#include 
#include 
#include 

int main(int argc, char **argv)
{
   int ret_code = 0;
   int comm_size, comm_rank;

   MPI_Init(&argc, &argv);

   MPI_Comm_size(MPI_COMM_WORLD, &comm_size);
   MPI_Comm_rank(MPI_COMM_WORLD, &comm_rank);

   if (comm_size > 1) {
  MPI_Comm subcomm, intercomm;
  const int group_id = comm_rank % 2;
  int i_root_group;

  /* split process in two groups:  even and odd comm_ranks. */
  MPI_Comm_split(MPI_COMM_WORLD, group_id, 0, &subcomm);

  /* The remote leader comm_rank for even and odd groups are respectively: 
1 and 0 */
  MPI_Intercomm_create(subcomm, 0, MPI_COMM_WORLD, 1-group_id, 0, 
&intercomm);

  /* for i_root_group==0 process with comm_rank==0 scatter data to all 
process with odd  comm_rank */
  /* for i_root_group==1 process with comm_rank==1 scatter data to all 
process with even comm_rank */
  for (i_root_group=0; i_root_group < 2; i_root_group++) {
 if (comm_rank == 0) {
printf("beginning Scatter i_root_group=%d\n",i_root_group);
 }
 if (group_id == i_root_group) {
const int  is_root  = (comm_rank == i_root_group);
int   *send_buf = NULL;
if (is_root) {
   const int nbr_other = (comm_size+i_root_group)/2;
   int   ii;
   send_buf = malloc(nbr_other*sizeof(*send_buf));
   for (ii=0; ii < nbr_other; ii++) {
   send_buf[ii] = ii;
   }
}
MPI_Scatter(send_buf, 1, MPI_INT,
NULL, 0, MPI_INT, (is_root ? MPI_ROOT : 
MPI_PROC_NULL), intercomm);

if (is_root) {
   free(send_buf);
}
 }
 else {
int an_int;
MPI_Scatter(NULL,0, MPI_INT,
&an_int, 1, MPI_INT, 0, intercomm);
 }
 if (comm_rank == 0) {
printf("ending Scatter i_root_group=%d\n",i_root_group);
 }
  }

  MPI_Comm_free(&intercomm);
  MPI_Comm_free(&subcomm);
   }
   else {
  fprintf(stderr, "%s: error this program must be started np > 1\n", 
argv[0]);
  ret_code = 1;
   }

   MPI_Finalize();

   return ret_code;
}

___
users mailing

Re: [OMPI users] configuring with --enable-mpi-profile option

2008-04-10 Thread George Bosilca
I think you're expect something that the MPI profiling interface is  
not supposed to provide you. There is no tool to dump or print any  
profile information by default (and it is not mandated by the  
standard). What this option does, is compile the profiling interface  
(as defined by the MPI standard) allowing external tools to gather  
information about the MPI application.


But you need an extra tool.

  george.

On Apr 10, 2008, at 10:41 AM, Swati Kher wrote:

Hi,

If I configure openmpi with “—enable-mpi-profile” option:

1)   Once build is complete, how do I specify profile name and  
location in the “mpirun” command? Do I have to set any flags with  
the “mpirun” command to view profile?
2)   If vampire trace by default is built with openmpi, if I set  
VT_CC flag for compiling my application, where I can view “.vtf”  
files after a parallel run ?


Thanks in advance

--
Swati Kher
Application Performance Optimization Engineer
Mellanox Technologies
Work: 408-916-0037 x337
sw...@mellanox.com

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




smime.p7s
Description: S/MIME cryptographic signature


Re: [OMPI users] Problem with MPI_Scatter() on inter-communicator...

2008-04-10 Thread Jeff Squyres

Edgar --

Can you file a CMR for v1.2?

On Apr 10, 2008, at 8:10 AM, Edgar Gabriel wrote:
thanks for reporting the bug, it is fixed on the trunk. The problem  
was

this time not in the algorithm, but in the checking of the
preconditions. If recvcount was zero and the rank not equal to the  
rank
of the root, than we did not even start the scatter, assuming that  
there

was nothing to do. For inter-communicators the check has to be however
extended to accept recvcount=0 for root=MPI_ROOT. The fix is in the
trunk in rev. 18123.

Thanks
Edgar

Edgar Gabriel wrote:

I don't think that anybody answered to your email so far, I'll have a
look at it on thursday...

Thanks
Edgar

Audet, Martin wrote:

Hi,

I don't know if it is my sample code or if it is a problem whit  
MPI_Scatter() on inter-communicator (maybe similar to the problem  
we found with MPI_Allgather() on inter-communicator a few weeks  
ago) but a simple program I wrote freeze during its second  
iteration of a loop doing an MPI_Scatter() over an inter- 
communicator.


For example if I compile as follows:

 mpicc -Wall scatter_bug.c -o scatter_bug

I get no error or warning. Then if a start it with np=2 as follows:

   mpiexec -n 2 ./scatter_bug

it prints:

  beginning Scatter i_root_group=0
  ending Scatter i_root_group=0
  beginning Scatter i_root_group=1

and then hang...

Note also that if I change the for loop to execute only the  
MPI_Scatter() of the second iteration (e.g. replacing  
"i_root_group=0;" by "i_root_group=1;"), it prints:


   beginning Scatter i_root_group=1

and then hang...

The problem therefore seems to be related with the second  
iteration itself.


Please note that this program run fine with mpich2 1.0.7rc2  
(ch3:sock device) for many different number of process (np) when  
the executable is ran with or without valgrind.


The OpenMPI version I use is 1.2.6rc3 and was configured as follows:

  ./configure --prefix=/usr/local/openmpi-1.2.6rc3 --disable-mpi- 
f77 --disable-mpi-f90 --disable-mpi-cxx --disable-cxx-exceptions -- 
with-io-romio-flags=--with-file-system=ufs+nfs


Note also that all process (when using OpenMPI or mpich2) were  
started on the same machine.


Also if you look at source code, you will notice that some  
arguments to MPI_Scatter() are NULL or 0. This may look strange  
and problematic when using a normal intra-communicator. However  
according to the book "MPI - The complete reference" vol 2 about  
MPI-2, for MPI_Scatter() with an inter-communicator:


 "The sendbuf, sendcount and sendtype arguments are significant  
only at the root process. The recvbuf, recvcount, and recvtype  
arguments are significant only at the processes of the leaf group."


If anyone else can have a look at this program and try it it would  
be helpful.


Thanks,

Martin


#include 
#include 
#include 

int main(int argc, char **argv)
{
  int ret_code = 0;
  int comm_size, comm_rank;

  MPI_Init(&argc, &argv);

  MPI_Comm_size(MPI_COMM_WORLD, &comm_size);
  MPI_Comm_rank(MPI_COMM_WORLD, &comm_rank);

  if (comm_size > 1) {
 MPI_Comm subcomm, intercomm;
 const int group_id = comm_rank % 2;
 int i_root_group;

 /* split process in two groups:  even and odd comm_ranks. */
 MPI_Comm_split(MPI_COMM_WORLD, group_id, 0, &subcomm);

 /* The remote leader comm_rank for even and odd groups are  
respectively: 1 and 0 */
 MPI_Intercomm_create(subcomm, 0, MPI_COMM_WORLD, 1-group_id,  
0, &intercomm);


 /* for i_root_group==0 process with comm_rank==0 scatter data  
to all process with odd  comm_rank */
 /* for i_root_group==1 process with comm_rank==1 scatter data  
to all process with even comm_rank */

 for (i_root_group=0; i_root_group < 2; i_root_group++) {
if (comm_rank == 0) {
   printf("beginning Scatter i_root_group=%d 
\n",i_root_group);

}
if (group_id == i_root_group) {
   const int  is_root  = (comm_rank == i_root_group);
   int   *send_buf = NULL;
   if (is_root) {
  const int nbr_other = (comm_size+i_root_group)/2;
  int   ii;
  send_buf = malloc(nbr_other*sizeof(*send_buf));
  for (ii=0; ii < nbr_other; ii++) {
  send_buf[ii] = ii;
  }
   }
   MPI_Scatter(send_buf, 1, MPI_INT,
   NULL, 0, MPI_INT, (is_root ? MPI_ROOT :  
MPI_PROC_NULL), intercomm);


   if (is_root) {
  free(send_buf);
   }
}
else {
   int an_int;
   MPI_Scatter(NULL,0, MPI_INT,
   &an_int, 1, MPI_INT, 0, intercomm);
}
if (comm_rank == 0) {
   printf("ending Scatter i_root_group=%d\n",i_root_group);
}
 }

 MPI_Comm_free(&intercomm);
 MPI_Comm_free(&subcomm);
  }
  else {
 fprintf(stderr, "%s: error this program must be started np >  
1\n", argv[0]);

 ret_code = 1;
  }

  MPI_Finalize();

  retur

Re: [OMPI users] Problem with MPI_Scatter() on inter-communicator...

2008-04-10 Thread Edgar Gabriel

done...

Jeff Squyres wrote:

Edgar --

Can you file a CMR for v1.2?

On Apr 10, 2008, at 8:10 AM, Edgar Gabriel wrote:
thanks for reporting the bug, it is fixed on the trunk. The problem  
was

this time not in the algorithm, but in the checking of the
preconditions. If recvcount was zero and the rank not equal to the  
rank
of the root, than we did not even start the scatter, assuming that  
there

was nothing to do. For inter-communicators the check has to be however
extended to accept recvcount=0 for root=MPI_ROOT. The fix is in the
trunk in rev. 18123.

Thanks
Edgar

Edgar Gabriel wrote:

I don't think that anybody answered to your email so far, I'll have a
look at it on thursday...

Thanks
Edgar

Audet, Martin wrote:

Hi,

I don't know if it is my sample code or if it is a problem whit  
MPI_Scatter() on inter-communicator (maybe similar to the problem  
we found with MPI_Allgather() on inter-communicator a few weeks  
ago) but a simple program I wrote freeze during its second  
iteration of a loop doing an MPI_Scatter() over an inter- 
communicator.


For example if I compile as follows:

 mpicc -Wall scatter_bug.c -o scatter_bug

I get no error or warning. Then if a start it with np=2 as follows:

   mpiexec -n 2 ./scatter_bug

it prints:

  beginning Scatter i_root_group=0
  ending Scatter i_root_group=0
  beginning Scatter i_root_group=1

and then hang...

Note also that if I change the for loop to execute only the  
MPI_Scatter() of the second iteration (e.g. replacing  
"i_root_group=0;" by "i_root_group=1;"), it prints:


   beginning Scatter i_root_group=1

and then hang...

The problem therefore seems to be related with the second  
iteration itself.


Please note that this program run fine with mpich2 1.0.7rc2  
(ch3:sock device) for many different number of process (np) when  
the executable is ran with or without valgrind.


The OpenMPI version I use is 1.2.6rc3 and was configured as follows:

  ./configure --prefix=/usr/local/openmpi-1.2.6rc3 --disable-mpi- 
f77 --disable-mpi-f90 --disable-mpi-cxx --disable-cxx-exceptions -- 
with-io-romio-flags=--with-file-system=ufs+nfs


Note also that all process (when using OpenMPI or mpich2) were  
started on the same machine.


Also if you look at source code, you will notice that some  
arguments to MPI_Scatter() are NULL or 0. This may look strange  
and problematic when using a normal intra-communicator. However  
according to the book "MPI - The complete reference" vol 2 about  
MPI-2, for MPI_Scatter() with an inter-communicator:


 "The sendbuf, sendcount and sendtype arguments are significant  
only at the root process. The recvbuf, recvcount, and recvtype  
arguments are significant only at the processes of the leaf group."


If anyone else can have a look at this program and try it it would  
be helpful.


Thanks,

Martin


#include 
#include 
#include 

int main(int argc, char **argv)
{
  int ret_code = 0;
  int comm_size, comm_rank;

  MPI_Init(&argc, &argv);

  MPI_Comm_size(MPI_COMM_WORLD, &comm_size);
  MPI_Comm_rank(MPI_COMM_WORLD, &comm_rank);

  if (comm_size > 1) {
 MPI_Comm subcomm, intercomm;
 const int group_id = comm_rank % 2;
 int i_root_group;

 /* split process in two groups:  even and odd comm_ranks. */
 MPI_Comm_split(MPI_COMM_WORLD, group_id, 0, &subcomm);

 /* The remote leader comm_rank for even and odd groups are  
respectively: 1 and 0 */
 MPI_Intercomm_create(subcomm, 0, MPI_COMM_WORLD, 1-group_id,  
0, &intercomm);


 /* for i_root_group==0 process with comm_rank==0 scatter data  
to all process with odd  comm_rank */
 /* for i_root_group==1 process with comm_rank==1 scatter data  
to all process with even comm_rank */

 for (i_root_group=0; i_root_group < 2; i_root_group++) {
if (comm_rank == 0) {
   printf("beginning Scatter i_root_group=%d 
\n",i_root_group);

}
if (group_id == i_root_group) {
   const int  is_root  = (comm_rank == i_root_group);
   int   *send_buf = NULL;
   if (is_root) {
  const int nbr_other = (comm_size+i_root_group)/2;
  int   ii;
  send_buf = malloc(nbr_other*sizeof(*send_buf));
  for (ii=0; ii < nbr_other; ii++) {
  send_buf[ii] = ii;
  }
   }
   MPI_Scatter(send_buf, 1, MPI_INT,
   NULL, 0, MPI_INT, (is_root ? MPI_ROOT :  
MPI_PROC_NULL), intercomm);


   if (is_root) {
  free(send_buf);
   }
}
else {
   int an_int;
   MPI_Scatter(NULL,0, MPI_INT,
   &an_int, 1, MPI_INT, 0, intercomm);
}
if (comm_rank == 0) {
   printf("ending Scatter i_root_group=%d\n",i_root_group);
}
 }

 MPI_Comm_free(&intercomm);
 MPI_Comm_free(&subcomm);
  }
  else {
 fprintf(stderr, "%s: error this program must be started np >  
1\n", argv[0]);

 ret_code = 1;
 

Re: [OMPI users] cross compiler make problem with mpi 1.2.6

2008-04-10 Thread Brian W. Barrett
Well, as a quick hack, you can try adding --disable-dlopen to the 
configure line.  It will disable the building of individual components 
(instead linking them into the main shared libraries).  It means that you 
have to be slightly more careful about which components you build, but in 
practice usually makes things a little bit easier, especially when cross 
compiling (less things to move around).


Brian

On Thu, 10 Apr 2008, Bailey, Eric wrote:


Hi,

I found an archive email with the same basic error I am running into for
mpi 1.2.6, unfortunately other then the question and request for the
output, there was not an email response on how it was solved.

the error

../../../opal/.libs/libopen-pal.so: undefined reference to
`lt_libltdlc_LTX_preloaded_symbols'

Here is the email link for the 1.2.4 problem..

http://www.open-mpi.org/community/lists/users/2007/10/4310.php

The email is a response by Jeff Squyres to the originator Jorge Parra.
Can either of you help?

here is my make output failure.. basically identical to the one reported
for mpi 1.2.4

make[2]: Entering directory
`/tmp/MPI/openmpi-1.2.6-7448/opal/tools/wrappers'
/bin/sh ../../../libtool --tag=CC   --mode=link ppc74xx-linux-gcc  -O3
-DNDEBUG -finline-functions -fno-strict-aliasing -pthread
-export-dynamic   -o opal_wrapper opal_wrapper.o
../../../opal/libopen-pal.la -lnsl -lutil  -lm
libtool: link: ppc74xx-linux-gcc -O3 -DNDEBUG -finline-functions
-fno-strict-aliasing -pthread -o .libs/opal_wrapper opal_wrapper.o
-Wl,--export-dynamic  ../../../opal/.libs/libopen-pal.so -ldl -lnsl
-lutil -lm -pthread -Wl,-rpath
-Wl,/home/MPI/openmpi-1.2.6-install-7448/lib
../../../opal/.libs/libopen-pal.so: undefined reference to
`lt_libltdlc_LTX_preloaded_symbols'
collect2: ld returned 1 exit status
make[2]: *** [opal_wrapper] Error 1
make[2]: Leaving directory
`/tmp/MPI/openmpi-1.2.6-7448/opal/tools/wrappers'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/tmp/MPI/openmpi-1.2.6-7448/opal'
make: *** [all-recursive] Error 1

Any help is greatly appreciated.

thanks,
Eric Bailey



Re: [OMPI users] configuring with --enable-mpi-profile option

2008-04-10 Thread Swati Kher

But if openmpi is installed, I can automatically instrument my
application with Vampir  (ie I don't have to install vtf separately -
right?)

And I can view with Vampir Trace the results of my app's parallel run?

-Original Message-
From: George Bosilca [mailto:bosi...@eecs.utk.edu] 
Sent: Thursday, April 10, 2008 8:31 AM
To: Open MPI Users
Cc: Swati Kher
Subject: Re: [OMPI users] configuring with --enable-mpi-profile option

I think you're expect something that the MPI profiling interface is  
not supposed to provide you. There is no tool to dump or print any  
profile information by default (and it is not mandated by the  
standard). What this option does, is compile the profiling interface  
(as defined by the MPI standard) allowing external tools to gather  
information about the MPI application.

But you need an extra tool.

   george.

On Apr 10, 2008, at 10:41 AM, Swati Kher wrote:
> Hi,
>
> If I configure openmpi with "-enable-mpi-profile" option:
>
> 1)   Once build is complete, how do I specify profile name and  
> location in the "mpirun" command? Do I have to set any flags with  
> the "mpirun" command to view profile?
> 2)   If vampire trace by default is built with openmpi, if I set  
> VT_CC flag for compiling my application, where I can view ".vtf"  
> files after a parallel run ?
>
> Thanks in advance
>
> --
> Swati Kher
> Application Performance Optimization Engineer
> Mellanox Technologies
> Work: 408-916-0037 x337
> sw...@mellanox.com
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] submitted job stops

2008-04-10 Thread Danesh Daroui
Thanks Rueti. It works now. I just disabled firewall on all machines 
since Open-MPI uses random port each time.


Thanks again!

Danesh



Reuti skrev:

Hi,

Am 09.04.2008 um 22:17 schrieb Danesh Daroui:
  

Mark Kosmowski skrev:


Danesh:

Have you tried "mpirun -np 4 --hostfile hosts hostname" to verify  
that

ompi is working?

  

When I run "mpirun -np 4 --hostfile hosts hostname" same thing happens
and it just hangs. Can it be a clue?



Can you remote access from each node to each other node?

  

Yes all nodes can have access to each other via SSH and can login
without being prompted for password.



If any node has more than 1 network device, are you using the ompi
options to specify which device to use?

  

Each node has one network interface which works properly.



do you have any firewall on the machines, blocking certain ports?

-- Reuti


  

Regards,

Danesh




Good luck,

Mark


  

Message: 5
Date: Wed, 9 Apr 2008 14:15:34 +0200 (CEST)
From: "danes...@bredband.net" 
Subject: [OMPI users] Ang: Re:  submitted job stops
To: 
Message-ID:
   <24351656.56761207743334738.JavaMail.defaultUser@defaultHost>
Content-Type: text/plain;charset="ISO-8859-15"


Actually my program is very simple MPI program "Hello World" which
just prints rank of each processor and then terminates. When I run
my program on a single processor machine with e.g 4 processors
(oversubscribing) it shows:

Hello world from processor with rank 0
Hello world from processor with rank 3
Hello world from processor with rank 1
Hello world from processor with rank 2

but when I use my remote machines everything just stops when
I run the program.

No I do not use any queuing system. I simply run it like this:

mpirun -np 4 --hostfile hosts ./hw

and then it just tops until I terminate it manually. As I said,
I monitored all machines (master+2 slaves) and found out that
in all machines, "orted" daemon starts when I run the program, but
after few seconds the daemon is terminated. What can be the reason?

Thanks,

Danesh






Ursprungligt meddelande
Fr?n: re...@staff.uni-marburg.de
Datum: 09-04-2008 13:26
Till: "Open MPI Users"
?rende: Re: [OMPI users] submitted job stops

Hi,

Am 08.04.2008 um 21:58 schrieb Danesh Daroui:

  
I had posted a message about my problem and I did all solutions  
but

the
problem is not solved it. The problem is that
I have installed Open-MPI on three machines (1 master+2 slaves).
When I
submit a job to master I can see that
"orted" daemon is launched on all machines (by running "top" on  
all

machines) but all "orted" daemons terminate after
few seconds and nothing will happen. First I thought that it  
can be

because remote machines can not launch "orted" but
now I am sure that it can be run on all machines without  
problem. Any

suggestion?


the question is more: is your MPI program running successfully  
or is

there simply no output from mpiexec/-run? And: by "submit" you mean
you use any queuingsystem?

-- Reuti
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


  


--

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

End of users Digest, Vol 863, Issue 1
*




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


  

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

  


[OMPI users] Troubles with MPI-IO Test and Torque/PVFS

2008-04-10 Thread Davi Vercillo C. Garcia
Hi all,

I have a Cluster with Torque and PVFS. I'm trying to test my
environment with MPI-IO Test but some segfault are occurring.
Does anyone know what is happening ? The error output is below:

Rank 1 Host campogrande03.dcc.ufrj.br WARNING ERROR 1207853304: 1 bad
bytes at file offset 0.  Expected (null), received (null)
Rank 2 Host campogrande02.dcc.ufrj.br WARNING ERROR 1207853304: 1 bad
bytes at file offset 0.  Expected (null), received (null)
[campogrande01:10646] *** Process received signal ***
Rank 0 Host campogrande04.dcc.ufrj.br WARNING ERROR 1207853304: 1 bad
bytes at file offset 0.  Expected (null), received (null)
Rank 0 Host campogrande04.dcc.ufrj.br WARNING ERROR 1207853304: 65537
bad bytes at file offset 0.  Expected (null), received (null)
[campogrande04:05192] *** Process received signal ***
[campogrande04:05192] Signal: Segmentation fault (11)
[campogrande04:05192] Signal code: Address not mapped (1)
[campogrande04:05192] Failing at address: 0x1
Rank 1 Host campogrande03.dcc.ufrj.br WARNING ERROR 1207853304: 65537
bad bytes at file offset 0.  Expected (null), received (null)
[campogrande03:05377] *** Process received signal ***
[campogrande03:05377] Signal: Segmentation fault (11)
[campogrande03:05377] Signal code: Address not mapped (1)
[campogrande03:05377] Failing at address: 0x1
[campogrande03:05377] [ 0] [0xe440]
[campogrande03:05377] [ 1]
/lib/tls/i686/cmov/libc.so.6(vsnprintf+0xb4) [0xb7d5fef4]
[campogrande03:05377] [ 2] mpiIO_test(make_error_messages+0xcf) [0x80502e4]
[campogrande03:05377] [ 3] mpiIO_test(warning_msg+0x8c) [0x8050569]
[campogrande03:05377] [ 4] mpiIO_test(report_errs+0xe2) [0x804d413]
[campogrande03:05377] [ 5] mpiIO_test(read_write_file+0x594) [0x804d9c2]
[campogrande03:05377] [ 6] mpiIO_test(main+0x1d0) [0x804aa14]
[campogrande03:05377] [ 7]
/lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe0) [0xb7d15050]
[campogrande03:05377] [ 8] mpiIO_test [0x804a7e1]
[campogrande03:05377] *** End of error message ***
Rank 2 Host campogrande02.dcc.ufrj.br WARNING ERROR 1207853304: 65537
bad bytes at file offset 0.  Expected (null), received (null)
[campogrande02:05187] *** Process received signal ***
[campogrande02:05187] Signal: Segmentation fault (11)
[campogrande02:05187] Signal code: Address not mapped (1)
[campogrande02:05187] Failing at address: 0x1
[campogrande01:10646] Signal: Segmentation fault (11)
[campogrande01:10646] Signal code: Address not mapped (1)
[campogrande01:10646] Failing at address: 0x1a
[campogrande02:05187] [ 0] [0xe440]
[campogrande02:05187] [ 1]
/lib/tls/i686/cmov/libc.so.6(vsnprintf+0xb4) [0xb7d5fef4]
[campogrande02:05187] [ 2] mpiIO_test(make_error_messages+0xcf) [0x80502e4]
[campogrande02:05187] [ 3] mpiIO_test(warning_msg+0x8c) [0x8050569]
[campogrande02:05187] [ 4] mpiIO_test(report_errs+0xe2) [0x804d413]
[campogrande02:05187] [ 5] mpiIO_test(read_write_file+0x594) [0x804d9c2]
[campogrande02:05187] [ 6] mpiIO_test(main+0x1d0) [0x804aa14]
[campogrande02:05187] [ 7]
/lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe0) [0xb7d15050]
[campogrande02:05187] [ 8] mpiIO_test [0x804a7e1]
[campogrande02:05187] *** End of error message ***
[campogrande04:05192] [ 0] [0xe440]
[campogrande04:05192] [ 1]
/lib/tls/i686/cmov/libc.so.6(vsnprintf+0xb4) [0xb7d5fef4]
[campogrande04:05192] [ 2] mpiIO_test(make_error_messages+0xcf) [0x80502e4]
[campogrande04:05192] [ 3] mpiIO_test(warning_msg+0x8c) [0x8050569]
[campogrande04:05192] [ 4] mpiIO_test(report_errs+0xe2) [0x804d413]
[campogrande04:05192] [ 5] mpiIO_test(read_write_file+0x594) [0x804d9c2]
[campogrande04:05192] [ 6] mpiIO_test(main+0x1d0) [0x804aa14]
[campogrande04:05192] [ 7]
/lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe0) [0xb7d15050]
[campogrande04:05192] [ 8] mpiIO_test [0x804a7e1]
[campogrande04:05192] *** End of error message ***
[campogrande01:10646] [ 0] [0xe440]
[campogrande01:10646] [ 1]
/lib/tls/i686/cmov/libc.so.6(vsnprintf+0xb4) [0xb7d5fef4]
[campogrande01:10646] [ 2] mpiIO_test(make_error_messages+0xcf) [0x80502e4]
[campogrande01:10646] [ 3] mpiIO_test(warning_msg+0x8c) [0x8050569]
[campogrande01:10646] [ 4] mpiIO_test(report_errs+0xe2) [0x804d413]
[campogrande01:10646] [ 5] mpiIO_test(read_write_file+0x594) [0x804d9c2]
[campogrande01:10646] [ 6] mpiIO_test(main+0x1d0) [0x804aa14]
[campogrande01:10646] [ 7]
/lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe0) [0xb7d15050]
[campogrande01:10646] [ 8] mpiIO_test [0x804a7e1]
[campogrande01:10646] *** End of error message ***
mpiexec noticed that job rank 0 with PID 5192 on node campogrande04
exited on signal 11 (Segmentation fault).

-- 
Davi Vercillo Carneiro Garcia

Universidade Federal do Rio de Janeiro
Departamento de Ciência da Computação
DCC-IM/UFRJ - http://www.dcc.ufrj.br

"Good things come to those who... wait." - Debian Project

"A computer is like air conditioning: it becomes useless when you open
windows." - Linus Torvalds

"Há duas coisas infinitas, o universo e a burrice humana. E eu estou
em dú

[OMPI users] i386 with x64

2008-04-10 Thread clarkmpi
Thanks to those who answered my post in the past. I have to admit that you lost 
me about half way through the thread.

I was able to get 2 of my systems cranked up and was about to put a third 
system online when I remembered it was running x64 version of OS.
Can I just recompile the code on the x64 system and put it in the same home 
directory used by all the systems? I'm not sharing the directory across 
systems, but after doing this three or four times across just 2 systems, I can 
see why sharing would be advantages.





Re: [OMPI users] i386 with x64

2008-04-10 Thread Aurélien Bouteiller
Open MPI can manage heterogeneous system. Though you prefer to avoid  
this because it has a performance penalty. I suggest you compile on  
the 32bit machine and use the same version everywhere.


Aurelien
Le 10 avr. 08 à 18:09, clark...@clarktx.com a écrit :
Thanks to those who answered my post in the past.  I have to admit  
that you lost me about half way through the thread.


I was able to get 2 of my systems cranked up and was about to put a  
third system online when I remembered it was running x64 version of  
OS.
Can I just recompile the code on the x64 system and put it in the  
same home directory used by all the systems?  I'm not sharing the  
directory across systems, but after doing this three or four times  
across just 2 systems, I can see why sharing would be advantages.


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





Re: [OMPI users] i386 with x64

2008-04-10 Thread clarkmpi
Thanks for the information. I'll try it out.


>Open MPI can manage heterogeneous system. Though you prefer to avoid
>this because it has a performance penalty. I suggest you compile on
>the 32bit machine and use the same version everywhere.

Aurelien
Le 10 avr. 08 à 18:09, clarkmpi_at_[hidden] a écrit :
>> Thanks to those who answered my post in the past. I have to admit
>> that you lost me about half way through the thread.
>>
>> I was able to get 2 of my systems cranked up and was about to put a
>> third system online when I remembered it was running x64 version of
>> OS.
>> Can I just recompile the code on the x64 system and put it in the
>> same home directory used by all the systems? I'm not sharing the
>> directory across systems, but after doing this three or four times
>> across just 2 systems, I can see why sharing would be advantages.
>>
> ___
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users