Re: [OMPI users] submitted job stops
Hi, Am 09.04.2008 um 22:17 schrieb Danesh Daroui: Mark Kosmowski skrev: Danesh: Have you tried "mpirun -np 4 --hostfile hosts hostname" to verify that ompi is working? When I run "mpirun -np 4 --hostfile hosts hostname" same thing happens and it just hangs. Can it be a clue? Can you remote access from each node to each other node? Yes all nodes can have access to each other via SSH and can login without being prompted for password. If any node has more than 1 network device, are you using the ompi options to specify which device to use? Each node has one network interface which works properly. do you have any firewall on the machines, blocking certain ports? -- Reuti Regards, Danesh Good luck, Mark Message: 5 Date: Wed, 9 Apr 2008 14:15:34 +0200 (CEST) From: "danes...@bredband.net" Subject: [OMPI users] Ang: Re: submitted job stops To: Message-ID: <24351656.56761207743334738.JavaMail.defaultUser@defaultHost> Content-Type: text/plain;charset="ISO-8859-15" Actually my program is very simple MPI program "Hello World" which just prints rank of each processor and then terminates. When I run my program on a single processor machine with e.g 4 processors (oversubscribing) it shows: Hello world from processor with rank 0 Hello world from processor with rank 3 Hello world from processor with rank 1 Hello world from processor with rank 2 but when I use my remote machines everything just stops when I run the program. No I do not use any queuing system. I simply run it like this: mpirun -np 4 --hostfile hosts ./hw and then it just tops until I terminate it manually. As I said, I monitored all machines (master+2 slaves) and found out that in all machines, "orted" daemon starts when I run the program, but after few seconds the daemon is terminated. What can be the reason? Thanks, Danesh Ursprungligt meddelande Fr?n: re...@staff.uni-marburg.de Datum: 09-04-2008 13:26 Till: "Open MPI Users" ?rende: Re: [OMPI users] submitted job stops Hi, Am 08.04.2008 um 21:58 schrieb Danesh Daroui: I had posted a message about my problem and I did all solutions but the problem is not solved it. The problem is that I have installed Open-MPI on three machines (1 master+2 slaves). When I submit a job to master I can see that "orted" daemon is launched on all machines (by running "top" on all machines) but all "orted" daemons terminate after few seconds and nothing will happen. First I thought that it can be because remote machines can not launch "orted" but now I am sure that it can be run on all machines without problem. Any suggestion? the question is more: is your MPI program running successfully or is there simply no output from mpiexec/-run? And: by "submit" you mean you use any queuingsystem? -- Reuti ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users End of users Digest, Vol 863, Issue 1 * ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] problems with hostfile when doing MPMD
HI In my network i have some 32 bit machines and some 64 bit machines. With --host i successfully call my application: mpirun -np 3 --host aim-plankton -x DISPLAY ./run_gdb.sh ./MPITest : -np 3 --host aim-fanta4 -x DISPLAY ./run_gdb.sh ./MPITest64 (MPITest64 has the same code as MPITest, but was compiled on the 64 bit machine) But when i use hostfiles: mpirun -np 3 --hostfile hosts32 -x DISPLAY ./run_gdb.sh ./MPITest : -np 3 --hostfile hosts64 -x DISPLAY ./run_gdb.sh ./MPITest64 all 6 processes are started on the 64 bit machine aim-fanta4. hosts32: aim-plankton slots=3 hosts64 aim-fanta4 slots Is this a bug or a feature? ;) Jody
Re: [OMPI users] problems with hostfile when doing MPMD
Hi Using a more realistic application than a simple "Hello, world" even the --host version doesn't work correctly Called this way mpirun -np 3 --host aim-plankton ./QHGLauncher --read-config=pureveg_new.cfg -o output.txt : -np 3 --host aim-fanta4 ./QHGLauncher_64 --read-config=pureveg_new.cfg -o output.txt the application starts but seems to hang after a while. Running the application in gdb: mpirun -np 3 --host aim-plankton -x DISPLAY ./run_gdb.sh ./QHGLauncher --read-config=pureveg_new.cfg -o output.txt : -np 3 --host aim-fanta4 -x DISPLAY ./run_gdb.sh ./QHGLauncher_64 --read-config=pureveg_new.cfg -o bruzlopf -n 12 --seasonality=3,data/cai_temp2.clim,data/cai_precip2.clim i can see that the processes on aim-fanta4 have indeed gotten stuck after a few initial outputs, and the processes on aim-plankton all have a messsage: [aim-plankton][0,1,1][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113 If i opnly use aim-plankton alone or aim-fanta4 alone everythiung runs as expected. BTW: i'm, using open MPI 1.2.2 Thanks Jody On Thu, Apr 10, 2008 at 12:40 PM, jody wrote: > HI > In my network i have some 32 bit machines and some 64 bit machines. > With --host i successfully call my application: > mpirun -np 3 --host aim-plankton -x DISPLAY ./run_gdb.sh ./MPITest : > -np 3 --host aim-fanta4 -x DISPLAY ./run_gdb.sh ./MPITest64 > (MPITest64 has the same code as MPITest, but was compiled on the 64 bit > machine) > > But when i use hostfiles: > mpirun -np 3 --hostfile hosts32 -x DISPLAY ./run_gdb.sh ./MPITest : > -np 3 --hostfile hosts64 -x DISPLAY ./run_gdb.sh ./MPITest64 > all 6 processes are started on the 64 bit machine aim-fanta4. > > hosts32: >aim-plankton slots=3 > hosts64 > aim-fanta4 slots > > Is this a bug or a feature? ;) > > Jody >
Re: [OMPI users] problems with hostfile when doing MPMD
i narrowed it down: The majority of processes get stuck in MPI_Barrier. My Test application looks like this: #include #include #include "mpi.h" int main(int iArgC, char *apArgV[]) { int iResult = 0; int iRank1; int iNum1; char sName[256]; gethostname(sName, 255); MPI_Init(&iArgC, &apArgV); MPI_Comm_rank(MPI_COMM_WORLD, &iRank1); MPI_Comm_size(MPI_COMM_WORLD, &iNum1); printf("[%s]I am #%d/%d before the barrier\n", sName, iRank1, iNum1); MPI_Barrier(MPI_COMM_WORLD); printf("[%s]I am #%d/%d after the barrier\n", sName, iRank1, iNum1); MPI_Finalize(); return iResult; } If i make this call: mpirun -np 3 --debug-daemons --host aim-plankton -x DISPLAY ./run_gdb.sh ./MPITest32 : -np 3 --host aim-fanta4 -x DISPLAY ./run_gdb.sh ./MPITest64 (run_gdb.sh is a script which starts gdb in a xterm for each process) Process 0 (on aim-plankton) passes the barrier and gets stuck in PMPI_Finalize, all other processes get stuck in PMPI_Barrier, Process 1 (on aim-plankton) displays the message [aim-plankton][0,1,1][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113 Process 2 on (aim-plankton) displays the same message twice. Any ideas? Thanks Jody On Thu, Apr 10, 2008 at 1:05 PM, jody wrote: > Hi > Using a more realistic application than a simple "Hello, world" > even the --host version doesn't work correctly > Called this way > > mpirun -np 3 --host aim-plankton ./QHGLauncher > --read-config=pureveg_new.cfg -o output.txt : -np 3 --host aim-fanta4 > ./QHGLauncher_64 --read-config=pureveg_new.cfg -o output.txt > > the application starts but seems to hang after a while. > > Running the application in gdb: > > mpirun -np 3 --host aim-plankton -x DISPLAY ./run_gdb.sh ./QHGLauncher > --read-config=pureveg_new.cfg -o output.txt : -np 3 --host aim-fanta4 > -x DISPLAY ./run_gdb.sh ./QHGLauncher_64 --read-config=pureveg_new.cfg > -o bruzlopf -n 12 > --seasonality=3,data/cai_temp2.clim,data/cai_precip2.clim > > i can see that the processes on aim-fanta4 have indeed gotten stuck > after a few initial outputs, > and the processes on aim-plankton all have a messsage: > > > [aim-plankton][0,1,1][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] > connect() failed with errno=113 > > If i opnly use aim-plankton alone or aim-fanta4 alone everythiung runs > as expected. > > BTW: i'm, using open MPI 1.2.2 > > Thanks > Jody > > > On Thu, Apr 10, 2008 at 12:40 PM, jody wrote: > > HI > > In my network i have some 32 bit machines and some 64 bit machines. > > With --host i successfully call my application: > > mpirun -np 3 --host aim-plankton -x DISPLAY ./run_gdb.sh ./MPITest : > > -np 3 --host aim-fanta4 -x DISPLAY ./run_gdb.sh ./MPITest64 > > (MPITest64 has the same code as MPITest, but was compiled on the 64 bit > machine) > > > > But when i use hostfiles: > > mpirun -np 3 --hostfile hosts32 -x DISPLAY ./run_gdb.sh ./MPITest : > > -np 3 --hostfile hosts64 -x DISPLAY ./run_gdb.sh ./MPITest64 > > all 6 processes are started on the 64 bit machine aim-fanta4. > > > > hosts32: > >aim-plankton slots=3 > > hosts64 > > aim-fanta4 slots > > > > Is this a bug or a feature? ;) > > > > Jody > > >
Re: [OMPI users] problems with hostfile when doing MPMD
This worked for me although I am not sure how extensive our 32/64 interoperability support is. I tested on Solaris using the TCP interconnect and a 1.2.5 version of Open MPI. Also, we configure with the --enable-heterogeneous flag which may make a difference here. Also this did not work for me over the sm btl. By the way, can you run a simple /bin/hostname across the two nodes? burl-ct-v20z-4 61 =>/opt/SUNWhpc/HPC7.1/bin/mpicc -m32 simple.c -o simple.32 burl-ct-v20z-4 62 =>/opt/SUNWhpc/HPC7.1/bin/mpicc -m64 simple.c -o simple.64 burl-ct-v20z-4 63 =>/opt/SUNWhpc/HPC7.1/bin/mpirun -gmca btl_tcp_if_include bge1 -gmca btl sm,self,tcp -host burl-ct-v20z-4 -np 3 simple.32 : -host burl-ct-v20z-5 -np 3 simple.64 [burl-ct-v20z-4]I am #0/6 before the barrier [burl-ct-v20z-5]I am #3/6 before the barrier [burl-ct-v20z-5]I am #4/6 before the barrier [burl-ct-v20z-4]I am #1/6 before the barrier [burl-ct-v20z-4]I am #2/6 before the barrier [burl-ct-v20z-5]I am #5/6 before the barrier [burl-ct-v20z-5]I am #3/6 after the barrier [burl-ct-v20z-4]I am #1/6 after the barrier [burl-ct-v20z-5]I am #5/6 after the barrier [burl-ct-v20z-5]I am #4/6 after the barrier [burl-ct-v20z-4]I am #2/6 after the barrier [burl-ct-v20z-4]I am #0/6 after the barrier burl-ct-v20z-4 64 =>/opt/SUNWhpc/HPC7.1/bin/mpirun -V mpirun (Open MPI) 1.2.5r16572 Report bugs to http://www.open-mpi.org/community/help/ burl-ct-v20z-4 65 => jody wrote: i narrowed it down: The majority of processes get stuck in MPI_Barrier. My Test application looks like this: #include #include #include "mpi.h" int main(int iArgC, char *apArgV[]) { int iResult = 0; int iRank1; int iNum1; char sName[256]; gethostname(sName, 255); MPI_Init(&iArgC, &apArgV); MPI_Comm_rank(MPI_COMM_WORLD, &iRank1); MPI_Comm_size(MPI_COMM_WORLD, &iNum1); printf("[%s]I am #%d/%d before the barrier\n", sName, iRank1, iNum1); MPI_Barrier(MPI_COMM_WORLD); printf("[%s]I am #%d/%d after the barrier\n", sName, iRank1, iNum1); MPI_Finalize(); return iResult; } If i make this call: mpirun -np 3 --debug-daemons --host aim-plankton -x DISPLAY ./run_gdb.sh ./MPITest32 : -np 3 --host aim-fanta4 -x DISPLAY ./run_gdb.sh ./MPITest64 (run_gdb.sh is a script which starts gdb in a xterm for each process) Process 0 (on aim-plankton) passes the barrier and gets stuck in PMPI_Finalize, all other processes get stuck in PMPI_Barrier, Process 1 (on aim-plankton) displays the message [aim-plankton][0,1,1][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113 Process 2 on (aim-plankton) displays the same message twice. Any ideas? Thanks Jody On Thu, Apr 10, 2008 at 1:05 PM, jody wrote: Hi Using a more realistic application than a simple "Hello, world" even the --host version doesn't work correctly Called this way mpirun -np 3 --host aim-plankton ./QHGLauncher --read-config=pureveg_new.cfg -o output.txt : -np 3 --host aim-fanta4 ./QHGLauncher_64 --read-config=pureveg_new.cfg -o output.txt the application starts but seems to hang after a while. Running the application in gdb: mpirun -np 3 --host aim-plankton -x DISPLAY ./run_gdb.sh ./QHGLauncher --read-config=pureveg_new.cfg -o output.txt : -np 3 --host aim-fanta4 -x DISPLAY ./run_gdb.sh ./QHGLauncher_64 --read-config=pureveg_new.cfg -o bruzlopf -n 12 --seasonality=3,data/cai_temp2.clim,data/cai_precip2.clim i can see that the processes on aim-fanta4 have indeed gotten stuck after a few initial outputs, and the processes on aim-plankton all have a messsage: [aim-plankton][0,1,1][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113 If i opnly use aim-plankton alone or aim-fanta4 alone everythiung runs as expected. BTW: i'm, using open MPI 1.2.2 Thanks Jody On Thu, Apr 10, 2008 at 12:40 PM, jody wrote: > HI > In my network i have some 32 bit machines and some 64 bit machines. > With --host i successfully call my application: > mpirun -np 3 --host aim-plankton -x DISPLAY ./run_gdb.sh ./MPITest : > -np 3 --host aim-fanta4 -x DISPLAY ./run_gdb.sh ./MPITest64 > (MPITest64 has the same code as MPITest, but was compiled on the 64 bit machine) > > But when i use hostfiles: > mpirun -np 3 --hostfile hosts32 -x DISPLAY ./run_gdb.sh ./MPITest : > -np 3 --hostfile hosts64 -x DISPLAY ./run_gdb.sh ./MPITest64 > all 6 processes are started on the 64 bit machine aim-fanta4. > > hosts32: >aim-plankton slots=3 > hosts64 > aim-fanta4 slots > > Is this a bug or a feature? ;) > > Jody > ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- = rolf.vandeva...@sun.com 781-442-3043 =
Re: [OMPI users] problems with hostfile when doing MPMD
Rolf, I was able to run hostname on the two noes that way, and also a simplified version of my testprogram (without a barrier) works. Only MPI_Barrier shows bad behaviour. Do you know what this message means? [aim-plankton][0,1,2][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113 Does it give an idea what could be the problem? Jody On Thu, Apr 10, 2008 at 2:20 PM, Rolf Vandevaart wrote: > > This worked for me although I am not sure how extensive our 32/64 > interoperability support is. I tested on Solaris using the TCP > interconnect and a 1.2.5 version of Open MPI. Also, we configure with > the --enable-heterogeneous flag which may make a difference here. Also > this did not work for me over the sm btl. > > By the way, can you run a simple /bin/hostname across the two nodes? > > > burl-ct-v20z-4 61 =>/opt/SUNWhpc/HPC7.1/bin/mpicc -m32 simple.c -o > simple.32 > burl-ct-v20z-4 62 =>/opt/SUNWhpc/HPC7.1/bin/mpicc -m64 simple.c -o > simple.64 > burl-ct-v20z-4 63 =>/opt/SUNWhpc/HPC7.1/bin/mpirun -gmca > btl_tcp_if_include bge1 -gmca btl sm,self,tcp -host burl-ct-v20z-4 -np 3 > simple.32 : -host burl-ct-v20z-5 -np 3 simple.64 > [burl-ct-v20z-4]I am #0/6 before the barrier > [burl-ct-v20z-5]I am #3/6 before the barrier > [burl-ct-v20z-5]I am #4/6 before the barrier > [burl-ct-v20z-4]I am #1/6 before the barrier > [burl-ct-v20z-4]I am #2/6 before the barrier > [burl-ct-v20z-5]I am #5/6 before the barrier > [burl-ct-v20z-5]I am #3/6 after the barrier > [burl-ct-v20z-4]I am #1/6 after the barrier > [burl-ct-v20z-5]I am #5/6 after the barrier > [burl-ct-v20z-5]I am #4/6 after the barrier > [burl-ct-v20z-4]I am #2/6 after the barrier > [burl-ct-v20z-4]I am #0/6 after the barrier > burl-ct-v20z-4 64 =>/opt/SUNWhpc/HPC7.1/bin/mpirun -V mpirun (Open > MPI) 1.2.5r16572 > > Report bugs to http://www.open-mpi.org/community/help/ > burl-ct-v20z-4 65 => > > > > > jody wrote: > > i narrowed it down: > > The majority of processes get stuck in MPI_Barrier. > > My Test application looks like this: > > > > #include > > #include > > #include "mpi.h" > > > > int main(int iArgC, char *apArgV[]) { > > int iResult = 0; > > int iRank1; > > int iNum1; > > > > char sName[256]; > > gethostname(sName, 255); > > > > MPI_Init(&iArgC, &apArgV); > > > > MPI_Comm_rank(MPI_COMM_WORLD, &iRank1); > > MPI_Comm_size(MPI_COMM_WORLD, &iNum1); > > > > printf("[%s]I am #%d/%d before the barrier\n", sName, iRank1, iNum1); > > MPI_Barrier(MPI_COMM_WORLD); > > printf("[%s]I am #%d/%d after the barrier\n", sName, iRank1, iNum1); > > > > MPI_Finalize(); > > > > return iResult; > > } > > > > > > If i make this call: > > mpirun -np 3 --debug-daemons --host aim-plankton -x DISPLAY > > ./run_gdb.sh ./MPITest32 : -np 3 --host aim-fanta4 -x DISPLAY > > ./run_gdb.sh ./MPITest64 > > > > (run_gdb.sh is a script which starts gdb in a xterm for each process) > > Process 0 (on aim-plankton) passes the barrier and gets stuck in > PMPI_Finalize, > > all other processes get stuck in PMPI_Barrier, > > Process 1 (on aim-plankton) displays the message > > > [aim-plankton][0,1,1][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] > > connect() failed with errno=113 > > Process 2 on (aim-plankton) displays the same message twice. > > > > Any ideas? > > > > Thanks Jody > > > > On Thu, Apr 10, 2008 at 1:05 PM, jody wrote: > >> Hi > >> Using a more realistic application than a simple "Hello, world" > >> even the --host version doesn't work correctly > >> Called this way > >> > >> mpirun -np 3 --host aim-plankton ./QHGLauncher > >> --read-config=pureveg_new.cfg -o output.txt : -np 3 --host aim-fanta4 > >> ./QHGLauncher_64 --read-config=pureveg_new.cfg -o output.txt > >> > >> the application starts but seems to hang after a while. > >> > >> Running the application in gdb: > >> > >> mpirun -np 3 --host aim-plankton -x DISPLAY ./run_gdb.sh ./QHGLauncher > >> --read-config=pureveg_new.cfg -o output.txt : -np 3 --host aim-fanta4 > >> -x DISPLAY ./run_gdb.sh ./QHGLauncher_64 --read-config=pureveg_new.cfg > >> -o bruzlopf -n 12 > >> --seasonality=3,data/cai_temp2.clim,data/cai_precip2.clim > >> > >> i can see that the processes on aim-fanta4 have indeed gotten stuck > >> after a few initial outputs, > >> and the processes on aim-plankton all have a messsage: > >> > >> > [aim-plankton][0,1,1][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] > >> connect() failed with errno=113 > >> > >> If i opnly use aim-plankton alone or aim-fanta4 alone everythiung runs > >> as expected. > >> > >> BTW: i'm, using open MPI 1.2.2 > >> > >> Thanks > >> Jody > >> > >> > >> On Thu, Apr 10, 2008 at 12:40 PM, jody wrote: > >> > HI > >> > In my network i have some 32 bit machines and some 64 bit machines. > >> > With --host
Re: [OMPI users] problems with hostfile when doing MPMD
On a CentOS Linux box, I see the following: > grep 113 /usr/include/asm-i386/errno.h #define EHOSTUNREACH113 /* No route to host */ I have also seen folks do this to figure out the errno. > perl -e 'die$!=113' No route to host at -e line 1. I am not sure why this is happening, but you could also check the Open MPI User's Mailing List Archives where there are other examples of people running into this error. A search of "113" had a few hits. http://www.open-mpi.org/community/lists/users Also, I assume you would see this problem with or without the MPI_Barrier if you add this parameter to your mpirun line: --mca mpi_preconnect_all 1 The MPI_Barrier is causing the bad behavior because by default connections are setup up lazily. Therefore only when the MPI_Barrier call is made and we start communicating and establishing connections do we start seeing the communication problems. Rolf jody wrote: Rolf, I was able to run hostname on the two noes that way, and also a simplified version of my testprogram (without a barrier) works. Only MPI_Barrier shows bad behaviour. Do you know what this message means? [aim-plankton][0,1,2][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113 Does it give an idea what could be the problem? Jody On Thu, Apr 10, 2008 at 2:20 PM, Rolf Vandevaart wrote: This worked for me although I am not sure how extensive our 32/64 interoperability support is. I tested on Solaris using the TCP interconnect and a 1.2.5 version of Open MPI. Also, we configure with the --enable-heterogeneous flag which may make a difference here. Also this did not work for me over the sm btl. By the way, can you run a simple /bin/hostname across the two nodes? burl-ct-v20z-4 61 =>/opt/SUNWhpc/HPC7.1/bin/mpicc -m32 simple.c -o simple.32 burl-ct-v20z-4 62 =>/opt/SUNWhpc/HPC7.1/bin/mpicc -m64 simple.c -o simple.64 burl-ct-v20z-4 63 =>/opt/SUNWhpc/HPC7.1/bin/mpirun -gmca btl_tcp_if_include bge1 -gmca btl sm,self,tcp -host burl-ct-v20z-4 -np 3 simple.32 : -host burl-ct-v20z-5 -np 3 simple.64 [burl-ct-v20z-4]I am #0/6 before the barrier [burl-ct-v20z-5]I am #3/6 before the barrier [burl-ct-v20z-5]I am #4/6 before the barrier [burl-ct-v20z-4]I am #1/6 before the barrier [burl-ct-v20z-4]I am #2/6 before the barrier [burl-ct-v20z-5]I am #5/6 before the barrier [burl-ct-v20z-5]I am #3/6 after the barrier [burl-ct-v20z-4]I am #1/6 after the barrier [burl-ct-v20z-5]I am #5/6 after the barrier [burl-ct-v20z-5]I am #4/6 after the barrier [burl-ct-v20z-4]I am #2/6 after the barrier [burl-ct-v20z-4]I am #0/6 after the barrier burl-ct-v20z-4 64 =>/opt/SUNWhpc/HPC7.1/bin/mpirun -V mpirun (Open MPI) 1.2.5r16572 Report bugs to http://www.open-mpi.org/community/help/ burl-ct-v20z-4 65 => jody wrote: > i narrowed it down: > The majority of processes get stuck in MPI_Barrier. > My Test application looks like this: > > #include > #include > #include "mpi.h" > > int main(int iArgC, char *apArgV[]) { > int iResult = 0; > int iRank1; > int iNum1; > > char sName[256]; > gethostname(sName, 255); > > MPI_Init(&iArgC, &apArgV); > > MPI_Comm_rank(MPI_COMM_WORLD, &iRank1); > MPI_Comm_size(MPI_COMM_WORLD, &iNum1); > > printf("[%s]I am #%d/%d before the barrier\n", sName, iRank1, iNum1); > MPI_Barrier(MPI_COMM_WORLD); > printf("[%s]I am #%d/%d after the barrier\n", sName, iRank1, iNum1); > > MPI_Finalize(); > > return iResult; > } > > > If i make this call: > mpirun -np 3 --debug-daemons --host aim-plankton -x DISPLAY > ./run_gdb.sh ./MPITest32 : -np 3 --host aim-fanta4 -x DISPLAY > ./run_gdb.sh ./MPITest64 > > (run_gdb.sh is a script which starts gdb in a xterm for each process) > Process 0 (on aim-plankton) passes the barrier and gets stuck in PMPI_Finalize, > all other processes get stuck in PMPI_Barrier, > Process 1 (on aim-plankton) displays the message > [aim-plankton][0,1,1][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] > connect() failed with errno=113 > Process 2 on (aim-plankton) displays the same message twice. > > Any ideas? > > Thanks Jody > > On Thu, Apr 10, 2008 at 1:05 PM, jody wrote: >> Hi >> Using a more realistic application than a simple "Hello, world" >> even the --host version doesn't work correctly >> Called this way >> >> mpirun -np 3 --host aim-plankton ./QHGLauncher >> --read-config=pureveg_new.cfg -o output.txt : -np 3 --host aim-fanta4 >> ./QHGLauncher_64 --read-config=pureveg_new.cfg -o output.txt >> >> the application starts but seems to hang after a while. >> >> Running the application in gdb: >> >> mpirun -np 3 --host aim-plankton -x DISPLAY ./run_gdb.sh ./QHGLauncher >> --read-config=pureveg_new.cfg -o output.txt : -np 3 --host aim-fanta4 >> -x DISPLAY ./run_gdb.sh ./QHGLauncher_64 --read-config=pureveg_new.cfg >> -o bruzlopf -n 12 >> --
[OMPI users] cross compiler make problem with mpi 1.2.6
Hi, I found an archive email with the same basic error I am running into for mpi 1.2.6, unfortunately other then the question and request for the output, there was not an email response on how it was solved. the error ../../../opal/.libs/libopen-pal.so: undefined reference to `lt_libltdlc_LTX_preloaded_symbols' Here is the email link for the 1.2.4 problem.. http://www.open-mpi.org/community/lists/users/2007/10/4310.php The email is a response by Jeff Squyres to the originator Jorge Parra. Can either of you help? here is my make output failure.. basically identical to the one reported for mpi 1.2.4 make[2]: Entering directory `/tmp/MPI/openmpi-1.2.6-7448/opal/tools/wrappers' /bin/sh ../../../libtool --tag=CC --mode=link ppc74xx-linux-gcc -O3 -DNDEBUG -finline-functions -fno-strict-aliasing -pthread -export-dynamic -o opal_wrapper opal_wrapper.o ../../../opal/libopen-pal.la -lnsl -lutil -lm libtool: link: ppc74xx-linux-gcc -O3 -DNDEBUG -finline-functions -fno-strict-aliasing -pthread -o .libs/opal_wrapper opal_wrapper.o -Wl,--export-dynamic ../../../opal/.libs/libopen-pal.so -ldl -lnsl -lutil -lm -pthread -Wl,-rpath -Wl,/home/MPI/openmpi-1.2.6-install-7448/lib ../../../opal/.libs/libopen-pal.so: undefined reference to `lt_libltdlc_LTX_preloaded_symbols' collect2: ld returned 1 exit status make[2]: *** [opal_wrapper] Error 1 make[2]: Leaving directory `/tmp/MPI/openmpi-1.2.6-7448/opal/tools/wrappers' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/tmp/MPI/openmpi-1.2.6-7448/opal' make: *** [all-recursive] Error 1 Any help is greatly appreciated. thanks, Eric Bailey
[OMPI users] configuring with --enable-mpi-profile option
Hi, If I configure openmpi with "-enable-mpi-profile" option: 1) Once build is complete, how do I specify profile name and location in the "mpirun" command? Do I have to set any flags with the "mpirun" command to view profile? 2) If vampire trace by default is built with openmpi, if I set VT_CC flag for compiling my application, where I can view ".vtf" files after a parallel run ? Thanks in advance -- Swati Kher Application Performance Optimization Engineer Mellanox Technologies Work: 408-916-0037 x337 sw...@mellanox.com
Re: [OMPI users] Problem with MPI_Scatter() on inter-communicator...
thanks for reporting the bug, it is fixed on the trunk. The problem was this time not in the algorithm, but in the checking of the preconditions. If recvcount was zero and the rank not equal to the rank of the root, than we did not even start the scatter, assuming that there was nothing to do. For inter-communicators the check has to be however extended to accept recvcount=0 for root=MPI_ROOT. The fix is in the trunk in rev. 18123. Thanks Edgar Edgar Gabriel wrote: I don't think that anybody answered to your email so far, I'll have a look at it on thursday... Thanks Edgar Audet, Martin wrote: Hi, I don't know if it is my sample code or if it is a problem whit MPI_Scatter() on inter-communicator (maybe similar to the problem we found with MPI_Allgather() on inter-communicator a few weeks ago) but a simple program I wrote freeze during its second iteration of a loop doing an MPI_Scatter() over an inter-communicator. For example if I compile as follows: mpicc -Wall scatter_bug.c -o scatter_bug I get no error or warning. Then if a start it with np=2 as follows: mpiexec -n 2 ./scatter_bug it prints: beginning Scatter i_root_group=0 ending Scatter i_root_group=0 beginning Scatter i_root_group=1 and then hang... Note also that if I change the for loop to execute only the MPI_Scatter() of the second iteration (e.g. replacing "i_root_group=0;" by "i_root_group=1;"), it prints: beginning Scatter i_root_group=1 and then hang... The problem therefore seems to be related with the second iteration itself. Please note that this program run fine with mpich2 1.0.7rc2 (ch3:sock device) for many different number of process (np) when the executable is ran with or without valgrind. The OpenMPI version I use is 1.2.6rc3 and was configured as follows: ./configure --prefix=/usr/local/openmpi-1.2.6rc3 --disable-mpi-f77 --disable-mpi-f90 --disable-mpi-cxx --disable-cxx-exceptions --with-io-romio-flags=--with-file-system=ufs+nfs Note also that all process (when using OpenMPI or mpich2) were started on the same machine. Also if you look at source code, you will notice that some arguments to MPI_Scatter() are NULL or 0. This may look strange and problematic when using a normal intra-communicator. However according to the book "MPI - The complete reference" vol 2 about MPI-2, for MPI_Scatter() with an inter-communicator: "The sendbuf, sendcount and sendtype arguments are significant only at the root process. The recvbuf, recvcount, and recvtype arguments are significant only at the processes of the leaf group." If anyone else can have a look at this program and try it it would be helpful. Thanks, Martin #include #include #include int main(int argc, char **argv) { int ret_code = 0; int comm_size, comm_rank; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &comm_size); MPI_Comm_rank(MPI_COMM_WORLD, &comm_rank); if (comm_size > 1) { MPI_Comm subcomm, intercomm; const int group_id = comm_rank % 2; int i_root_group; /* split process in two groups: even and odd comm_ranks. */ MPI_Comm_split(MPI_COMM_WORLD, group_id, 0, &subcomm); /* The remote leader comm_rank for even and odd groups are respectively: 1 and 0 */ MPI_Intercomm_create(subcomm, 0, MPI_COMM_WORLD, 1-group_id, 0, &intercomm); /* for i_root_group==0 process with comm_rank==0 scatter data to all process with odd comm_rank */ /* for i_root_group==1 process with comm_rank==1 scatter data to all process with even comm_rank */ for (i_root_group=0; i_root_group < 2; i_root_group++) { if (comm_rank == 0) { printf("beginning Scatter i_root_group=%d\n",i_root_group); } if (group_id == i_root_group) { const int is_root = (comm_rank == i_root_group); int *send_buf = NULL; if (is_root) { const int nbr_other = (comm_size+i_root_group)/2; int ii; send_buf = malloc(nbr_other*sizeof(*send_buf)); for (ii=0; ii < nbr_other; ii++) { send_buf[ii] = ii; } } MPI_Scatter(send_buf, 1, MPI_INT, NULL, 0, MPI_INT, (is_root ? MPI_ROOT : MPI_PROC_NULL), intercomm); if (is_root) { free(send_buf); } } else { int an_int; MPI_Scatter(NULL,0, MPI_INT, &an_int, 1, MPI_INT, 0, intercomm); } if (comm_rank == 0) { printf("ending Scatter i_root_group=%d\n",i_root_group); } } MPI_Comm_free(&intercomm); MPI_Comm_free(&subcomm); } else { fprintf(stderr, "%s: error this program must be started np > 1\n", argv[0]); ret_code = 1; } MPI_Finalize(); return ret_code; } ___ users mailing
Re: [OMPI users] configuring with --enable-mpi-profile option
I think you're expect something that the MPI profiling interface is not supposed to provide you. There is no tool to dump or print any profile information by default (and it is not mandated by the standard). What this option does, is compile the profiling interface (as defined by the MPI standard) allowing external tools to gather information about the MPI application. But you need an extra tool. george. On Apr 10, 2008, at 10:41 AM, Swati Kher wrote: Hi, If I configure openmpi with “—enable-mpi-profile” option: 1) Once build is complete, how do I specify profile name and location in the “mpirun” command? Do I have to set any flags with the “mpirun” command to view profile? 2) If vampire trace by default is built with openmpi, if I set VT_CC flag for compiling my application, where I can view “.vtf” files after a parallel run ? Thanks in advance -- Swati Kher Application Performance Optimization Engineer Mellanox Technologies Work: 408-916-0037 x337 sw...@mellanox.com ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users smime.p7s Description: S/MIME cryptographic signature
Re: [OMPI users] Problem with MPI_Scatter() on inter-communicator...
Edgar -- Can you file a CMR for v1.2? On Apr 10, 2008, at 8:10 AM, Edgar Gabriel wrote: thanks for reporting the bug, it is fixed on the trunk. The problem was this time not in the algorithm, but in the checking of the preconditions. If recvcount was zero and the rank not equal to the rank of the root, than we did not even start the scatter, assuming that there was nothing to do. For inter-communicators the check has to be however extended to accept recvcount=0 for root=MPI_ROOT. The fix is in the trunk in rev. 18123. Thanks Edgar Edgar Gabriel wrote: I don't think that anybody answered to your email so far, I'll have a look at it on thursday... Thanks Edgar Audet, Martin wrote: Hi, I don't know if it is my sample code or if it is a problem whit MPI_Scatter() on inter-communicator (maybe similar to the problem we found with MPI_Allgather() on inter-communicator a few weeks ago) but a simple program I wrote freeze during its second iteration of a loop doing an MPI_Scatter() over an inter- communicator. For example if I compile as follows: mpicc -Wall scatter_bug.c -o scatter_bug I get no error or warning. Then if a start it with np=2 as follows: mpiexec -n 2 ./scatter_bug it prints: beginning Scatter i_root_group=0 ending Scatter i_root_group=0 beginning Scatter i_root_group=1 and then hang... Note also that if I change the for loop to execute only the MPI_Scatter() of the second iteration (e.g. replacing "i_root_group=0;" by "i_root_group=1;"), it prints: beginning Scatter i_root_group=1 and then hang... The problem therefore seems to be related with the second iteration itself. Please note that this program run fine with mpich2 1.0.7rc2 (ch3:sock device) for many different number of process (np) when the executable is ran with or without valgrind. The OpenMPI version I use is 1.2.6rc3 and was configured as follows: ./configure --prefix=/usr/local/openmpi-1.2.6rc3 --disable-mpi- f77 --disable-mpi-f90 --disable-mpi-cxx --disable-cxx-exceptions -- with-io-romio-flags=--with-file-system=ufs+nfs Note also that all process (when using OpenMPI or mpich2) were started on the same machine. Also if you look at source code, you will notice that some arguments to MPI_Scatter() are NULL or 0. This may look strange and problematic when using a normal intra-communicator. However according to the book "MPI - The complete reference" vol 2 about MPI-2, for MPI_Scatter() with an inter-communicator: "The sendbuf, sendcount and sendtype arguments are significant only at the root process. The recvbuf, recvcount, and recvtype arguments are significant only at the processes of the leaf group." If anyone else can have a look at this program and try it it would be helpful. Thanks, Martin #include #include #include int main(int argc, char **argv) { int ret_code = 0; int comm_size, comm_rank; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &comm_size); MPI_Comm_rank(MPI_COMM_WORLD, &comm_rank); if (comm_size > 1) { MPI_Comm subcomm, intercomm; const int group_id = comm_rank % 2; int i_root_group; /* split process in two groups: even and odd comm_ranks. */ MPI_Comm_split(MPI_COMM_WORLD, group_id, 0, &subcomm); /* The remote leader comm_rank for even and odd groups are respectively: 1 and 0 */ MPI_Intercomm_create(subcomm, 0, MPI_COMM_WORLD, 1-group_id, 0, &intercomm); /* for i_root_group==0 process with comm_rank==0 scatter data to all process with odd comm_rank */ /* for i_root_group==1 process with comm_rank==1 scatter data to all process with even comm_rank */ for (i_root_group=0; i_root_group < 2; i_root_group++) { if (comm_rank == 0) { printf("beginning Scatter i_root_group=%d \n",i_root_group); } if (group_id == i_root_group) { const int is_root = (comm_rank == i_root_group); int *send_buf = NULL; if (is_root) { const int nbr_other = (comm_size+i_root_group)/2; int ii; send_buf = malloc(nbr_other*sizeof(*send_buf)); for (ii=0; ii < nbr_other; ii++) { send_buf[ii] = ii; } } MPI_Scatter(send_buf, 1, MPI_INT, NULL, 0, MPI_INT, (is_root ? MPI_ROOT : MPI_PROC_NULL), intercomm); if (is_root) { free(send_buf); } } else { int an_int; MPI_Scatter(NULL,0, MPI_INT, &an_int, 1, MPI_INT, 0, intercomm); } if (comm_rank == 0) { printf("ending Scatter i_root_group=%d\n",i_root_group); } } MPI_Comm_free(&intercomm); MPI_Comm_free(&subcomm); } else { fprintf(stderr, "%s: error this program must be started np > 1\n", argv[0]); ret_code = 1; } MPI_Finalize(); retur
Re: [OMPI users] Problem with MPI_Scatter() on inter-communicator...
done... Jeff Squyres wrote: Edgar -- Can you file a CMR for v1.2? On Apr 10, 2008, at 8:10 AM, Edgar Gabriel wrote: thanks for reporting the bug, it is fixed on the trunk. The problem was this time not in the algorithm, but in the checking of the preconditions. If recvcount was zero and the rank not equal to the rank of the root, than we did not even start the scatter, assuming that there was nothing to do. For inter-communicators the check has to be however extended to accept recvcount=0 for root=MPI_ROOT. The fix is in the trunk in rev. 18123. Thanks Edgar Edgar Gabriel wrote: I don't think that anybody answered to your email so far, I'll have a look at it on thursday... Thanks Edgar Audet, Martin wrote: Hi, I don't know if it is my sample code or if it is a problem whit MPI_Scatter() on inter-communicator (maybe similar to the problem we found with MPI_Allgather() on inter-communicator a few weeks ago) but a simple program I wrote freeze during its second iteration of a loop doing an MPI_Scatter() over an inter- communicator. For example if I compile as follows: mpicc -Wall scatter_bug.c -o scatter_bug I get no error or warning. Then if a start it with np=2 as follows: mpiexec -n 2 ./scatter_bug it prints: beginning Scatter i_root_group=0 ending Scatter i_root_group=0 beginning Scatter i_root_group=1 and then hang... Note also that if I change the for loop to execute only the MPI_Scatter() of the second iteration (e.g. replacing "i_root_group=0;" by "i_root_group=1;"), it prints: beginning Scatter i_root_group=1 and then hang... The problem therefore seems to be related with the second iteration itself. Please note that this program run fine with mpich2 1.0.7rc2 (ch3:sock device) for many different number of process (np) when the executable is ran with or without valgrind. The OpenMPI version I use is 1.2.6rc3 and was configured as follows: ./configure --prefix=/usr/local/openmpi-1.2.6rc3 --disable-mpi- f77 --disable-mpi-f90 --disable-mpi-cxx --disable-cxx-exceptions -- with-io-romio-flags=--with-file-system=ufs+nfs Note also that all process (when using OpenMPI or mpich2) were started on the same machine. Also if you look at source code, you will notice that some arguments to MPI_Scatter() are NULL or 0. This may look strange and problematic when using a normal intra-communicator. However according to the book "MPI - The complete reference" vol 2 about MPI-2, for MPI_Scatter() with an inter-communicator: "The sendbuf, sendcount and sendtype arguments are significant only at the root process. The recvbuf, recvcount, and recvtype arguments are significant only at the processes of the leaf group." If anyone else can have a look at this program and try it it would be helpful. Thanks, Martin #include #include #include int main(int argc, char **argv) { int ret_code = 0; int comm_size, comm_rank; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &comm_size); MPI_Comm_rank(MPI_COMM_WORLD, &comm_rank); if (comm_size > 1) { MPI_Comm subcomm, intercomm; const int group_id = comm_rank % 2; int i_root_group; /* split process in two groups: even and odd comm_ranks. */ MPI_Comm_split(MPI_COMM_WORLD, group_id, 0, &subcomm); /* The remote leader comm_rank for even and odd groups are respectively: 1 and 0 */ MPI_Intercomm_create(subcomm, 0, MPI_COMM_WORLD, 1-group_id, 0, &intercomm); /* for i_root_group==0 process with comm_rank==0 scatter data to all process with odd comm_rank */ /* for i_root_group==1 process with comm_rank==1 scatter data to all process with even comm_rank */ for (i_root_group=0; i_root_group < 2; i_root_group++) { if (comm_rank == 0) { printf("beginning Scatter i_root_group=%d \n",i_root_group); } if (group_id == i_root_group) { const int is_root = (comm_rank == i_root_group); int *send_buf = NULL; if (is_root) { const int nbr_other = (comm_size+i_root_group)/2; int ii; send_buf = malloc(nbr_other*sizeof(*send_buf)); for (ii=0; ii < nbr_other; ii++) { send_buf[ii] = ii; } } MPI_Scatter(send_buf, 1, MPI_INT, NULL, 0, MPI_INT, (is_root ? MPI_ROOT : MPI_PROC_NULL), intercomm); if (is_root) { free(send_buf); } } else { int an_int; MPI_Scatter(NULL,0, MPI_INT, &an_int, 1, MPI_INT, 0, intercomm); } if (comm_rank == 0) { printf("ending Scatter i_root_group=%d\n",i_root_group); } } MPI_Comm_free(&intercomm); MPI_Comm_free(&subcomm); } else { fprintf(stderr, "%s: error this program must be started np > 1\n", argv[0]); ret_code = 1;
Re: [OMPI users] cross compiler make problem with mpi 1.2.6
Well, as a quick hack, you can try adding --disable-dlopen to the configure line. It will disable the building of individual components (instead linking them into the main shared libraries). It means that you have to be slightly more careful about which components you build, but in practice usually makes things a little bit easier, especially when cross compiling (less things to move around). Brian On Thu, 10 Apr 2008, Bailey, Eric wrote: Hi, I found an archive email with the same basic error I am running into for mpi 1.2.6, unfortunately other then the question and request for the output, there was not an email response on how it was solved. the error ../../../opal/.libs/libopen-pal.so: undefined reference to `lt_libltdlc_LTX_preloaded_symbols' Here is the email link for the 1.2.4 problem.. http://www.open-mpi.org/community/lists/users/2007/10/4310.php The email is a response by Jeff Squyres to the originator Jorge Parra. Can either of you help? here is my make output failure.. basically identical to the one reported for mpi 1.2.4 make[2]: Entering directory `/tmp/MPI/openmpi-1.2.6-7448/opal/tools/wrappers' /bin/sh ../../../libtool --tag=CC --mode=link ppc74xx-linux-gcc -O3 -DNDEBUG -finline-functions -fno-strict-aliasing -pthread -export-dynamic -o opal_wrapper opal_wrapper.o ../../../opal/libopen-pal.la -lnsl -lutil -lm libtool: link: ppc74xx-linux-gcc -O3 -DNDEBUG -finline-functions -fno-strict-aliasing -pthread -o .libs/opal_wrapper opal_wrapper.o -Wl,--export-dynamic ../../../opal/.libs/libopen-pal.so -ldl -lnsl -lutil -lm -pthread -Wl,-rpath -Wl,/home/MPI/openmpi-1.2.6-install-7448/lib ../../../opal/.libs/libopen-pal.so: undefined reference to `lt_libltdlc_LTX_preloaded_symbols' collect2: ld returned 1 exit status make[2]: *** [opal_wrapper] Error 1 make[2]: Leaving directory `/tmp/MPI/openmpi-1.2.6-7448/opal/tools/wrappers' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/tmp/MPI/openmpi-1.2.6-7448/opal' make: *** [all-recursive] Error 1 Any help is greatly appreciated. thanks, Eric Bailey
Re: [OMPI users] configuring with --enable-mpi-profile option
But if openmpi is installed, I can automatically instrument my application with Vampir (ie I don't have to install vtf separately - right?) And I can view with Vampir Trace the results of my app's parallel run? -Original Message- From: George Bosilca [mailto:bosi...@eecs.utk.edu] Sent: Thursday, April 10, 2008 8:31 AM To: Open MPI Users Cc: Swati Kher Subject: Re: [OMPI users] configuring with --enable-mpi-profile option I think you're expect something that the MPI profiling interface is not supposed to provide you. There is no tool to dump or print any profile information by default (and it is not mandated by the standard). What this option does, is compile the profiling interface (as defined by the MPI standard) allowing external tools to gather information about the MPI application. But you need an extra tool. george. On Apr 10, 2008, at 10:41 AM, Swati Kher wrote: > Hi, > > If I configure openmpi with "-enable-mpi-profile" option: > > 1) Once build is complete, how do I specify profile name and > location in the "mpirun" command? Do I have to set any flags with > the "mpirun" command to view profile? > 2) If vampire trace by default is built with openmpi, if I set > VT_CC flag for compiling my application, where I can view ".vtf" > files after a parallel run ? > > Thanks in advance > > -- > Swati Kher > Application Performance Optimization Engineer > Mellanox Technologies > Work: 408-916-0037 x337 > sw...@mellanox.com > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] submitted job stops
Thanks Rueti. It works now. I just disabled firewall on all machines since Open-MPI uses random port each time. Thanks again! Danesh Reuti skrev: Hi, Am 09.04.2008 um 22:17 schrieb Danesh Daroui: Mark Kosmowski skrev: Danesh: Have you tried "mpirun -np 4 --hostfile hosts hostname" to verify that ompi is working? When I run "mpirun -np 4 --hostfile hosts hostname" same thing happens and it just hangs. Can it be a clue? Can you remote access from each node to each other node? Yes all nodes can have access to each other via SSH and can login without being prompted for password. If any node has more than 1 network device, are you using the ompi options to specify which device to use? Each node has one network interface which works properly. do you have any firewall on the machines, blocking certain ports? -- Reuti Regards, Danesh Good luck, Mark Message: 5 Date: Wed, 9 Apr 2008 14:15:34 +0200 (CEST) From: "danes...@bredband.net" Subject: [OMPI users] Ang: Re: submitted job stops To: Message-ID: <24351656.56761207743334738.JavaMail.defaultUser@defaultHost> Content-Type: text/plain;charset="ISO-8859-15" Actually my program is very simple MPI program "Hello World" which just prints rank of each processor and then terminates. When I run my program on a single processor machine with e.g 4 processors (oversubscribing) it shows: Hello world from processor with rank 0 Hello world from processor with rank 3 Hello world from processor with rank 1 Hello world from processor with rank 2 but when I use my remote machines everything just stops when I run the program. No I do not use any queuing system. I simply run it like this: mpirun -np 4 --hostfile hosts ./hw and then it just tops until I terminate it manually. As I said, I monitored all machines (master+2 slaves) and found out that in all machines, "orted" daemon starts when I run the program, but after few seconds the daemon is terminated. What can be the reason? Thanks, Danesh Ursprungligt meddelande Fr?n: re...@staff.uni-marburg.de Datum: 09-04-2008 13:26 Till: "Open MPI Users" ?rende: Re: [OMPI users] submitted job stops Hi, Am 08.04.2008 um 21:58 schrieb Danesh Daroui: I had posted a message about my problem and I did all solutions but the problem is not solved it. The problem is that I have installed Open-MPI on three machines (1 master+2 slaves). When I submit a job to master I can see that "orted" daemon is launched on all machines (by running "top" on all machines) but all "orted" daemons terminate after few seconds and nothing will happen. First I thought that it can be because remote machines can not launch "orted" but now I am sure that it can be run on all machines without problem. Any suggestion? the question is more: is your MPI program running successfully or is there simply no output from mpiexec/-run? And: by "submit" you mean you use any queuingsystem? -- Reuti ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users End of users Digest, Vol 863, Issue 1 * ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] Troubles with MPI-IO Test and Torque/PVFS
Hi all, I have a Cluster with Torque and PVFS. I'm trying to test my environment with MPI-IO Test but some segfault are occurring. Does anyone know what is happening ? The error output is below: Rank 1 Host campogrande03.dcc.ufrj.br WARNING ERROR 1207853304: 1 bad bytes at file offset 0. Expected (null), received (null) Rank 2 Host campogrande02.dcc.ufrj.br WARNING ERROR 1207853304: 1 bad bytes at file offset 0. Expected (null), received (null) [campogrande01:10646] *** Process received signal *** Rank 0 Host campogrande04.dcc.ufrj.br WARNING ERROR 1207853304: 1 bad bytes at file offset 0. Expected (null), received (null) Rank 0 Host campogrande04.dcc.ufrj.br WARNING ERROR 1207853304: 65537 bad bytes at file offset 0. Expected (null), received (null) [campogrande04:05192] *** Process received signal *** [campogrande04:05192] Signal: Segmentation fault (11) [campogrande04:05192] Signal code: Address not mapped (1) [campogrande04:05192] Failing at address: 0x1 Rank 1 Host campogrande03.dcc.ufrj.br WARNING ERROR 1207853304: 65537 bad bytes at file offset 0. Expected (null), received (null) [campogrande03:05377] *** Process received signal *** [campogrande03:05377] Signal: Segmentation fault (11) [campogrande03:05377] Signal code: Address not mapped (1) [campogrande03:05377] Failing at address: 0x1 [campogrande03:05377] [ 0] [0xe440] [campogrande03:05377] [ 1] /lib/tls/i686/cmov/libc.so.6(vsnprintf+0xb4) [0xb7d5fef4] [campogrande03:05377] [ 2] mpiIO_test(make_error_messages+0xcf) [0x80502e4] [campogrande03:05377] [ 3] mpiIO_test(warning_msg+0x8c) [0x8050569] [campogrande03:05377] [ 4] mpiIO_test(report_errs+0xe2) [0x804d413] [campogrande03:05377] [ 5] mpiIO_test(read_write_file+0x594) [0x804d9c2] [campogrande03:05377] [ 6] mpiIO_test(main+0x1d0) [0x804aa14] [campogrande03:05377] [ 7] /lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe0) [0xb7d15050] [campogrande03:05377] [ 8] mpiIO_test [0x804a7e1] [campogrande03:05377] *** End of error message *** Rank 2 Host campogrande02.dcc.ufrj.br WARNING ERROR 1207853304: 65537 bad bytes at file offset 0. Expected (null), received (null) [campogrande02:05187] *** Process received signal *** [campogrande02:05187] Signal: Segmentation fault (11) [campogrande02:05187] Signal code: Address not mapped (1) [campogrande02:05187] Failing at address: 0x1 [campogrande01:10646] Signal: Segmentation fault (11) [campogrande01:10646] Signal code: Address not mapped (1) [campogrande01:10646] Failing at address: 0x1a [campogrande02:05187] [ 0] [0xe440] [campogrande02:05187] [ 1] /lib/tls/i686/cmov/libc.so.6(vsnprintf+0xb4) [0xb7d5fef4] [campogrande02:05187] [ 2] mpiIO_test(make_error_messages+0xcf) [0x80502e4] [campogrande02:05187] [ 3] mpiIO_test(warning_msg+0x8c) [0x8050569] [campogrande02:05187] [ 4] mpiIO_test(report_errs+0xe2) [0x804d413] [campogrande02:05187] [ 5] mpiIO_test(read_write_file+0x594) [0x804d9c2] [campogrande02:05187] [ 6] mpiIO_test(main+0x1d0) [0x804aa14] [campogrande02:05187] [ 7] /lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe0) [0xb7d15050] [campogrande02:05187] [ 8] mpiIO_test [0x804a7e1] [campogrande02:05187] *** End of error message *** [campogrande04:05192] [ 0] [0xe440] [campogrande04:05192] [ 1] /lib/tls/i686/cmov/libc.so.6(vsnprintf+0xb4) [0xb7d5fef4] [campogrande04:05192] [ 2] mpiIO_test(make_error_messages+0xcf) [0x80502e4] [campogrande04:05192] [ 3] mpiIO_test(warning_msg+0x8c) [0x8050569] [campogrande04:05192] [ 4] mpiIO_test(report_errs+0xe2) [0x804d413] [campogrande04:05192] [ 5] mpiIO_test(read_write_file+0x594) [0x804d9c2] [campogrande04:05192] [ 6] mpiIO_test(main+0x1d0) [0x804aa14] [campogrande04:05192] [ 7] /lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe0) [0xb7d15050] [campogrande04:05192] [ 8] mpiIO_test [0x804a7e1] [campogrande04:05192] *** End of error message *** [campogrande01:10646] [ 0] [0xe440] [campogrande01:10646] [ 1] /lib/tls/i686/cmov/libc.so.6(vsnprintf+0xb4) [0xb7d5fef4] [campogrande01:10646] [ 2] mpiIO_test(make_error_messages+0xcf) [0x80502e4] [campogrande01:10646] [ 3] mpiIO_test(warning_msg+0x8c) [0x8050569] [campogrande01:10646] [ 4] mpiIO_test(report_errs+0xe2) [0x804d413] [campogrande01:10646] [ 5] mpiIO_test(read_write_file+0x594) [0x804d9c2] [campogrande01:10646] [ 6] mpiIO_test(main+0x1d0) [0x804aa14] [campogrande01:10646] [ 7] /lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe0) [0xb7d15050] [campogrande01:10646] [ 8] mpiIO_test [0x804a7e1] [campogrande01:10646] *** End of error message *** mpiexec noticed that job rank 0 with PID 5192 on node campogrande04 exited on signal 11 (Segmentation fault). -- Davi Vercillo Carneiro Garcia Universidade Federal do Rio de Janeiro Departamento de Ciência da Computação DCC-IM/UFRJ - http://www.dcc.ufrj.br "Good things come to those who... wait." - Debian Project "A computer is like air conditioning: it becomes useless when you open windows." - Linus Torvalds "Há duas coisas infinitas, o universo e a burrice humana. E eu estou em dú
[OMPI users] i386 with x64
Thanks to those who answered my post in the past. I have to admit that you lost me about half way through the thread. I was able to get 2 of my systems cranked up and was about to put a third system online when I remembered it was running x64 version of OS. Can I just recompile the code on the x64 system and put it in the same home directory used by all the systems? I'm not sharing the directory across systems, but after doing this three or four times across just 2 systems, I can see why sharing would be advantages.
Re: [OMPI users] i386 with x64
Open MPI can manage heterogeneous system. Though you prefer to avoid this because it has a performance penalty. I suggest you compile on the 32bit machine and use the same version everywhere. Aurelien Le 10 avr. 08 à 18:09, clark...@clarktx.com a écrit : Thanks to those who answered my post in the past. I have to admit that you lost me about half way through the thread. I was able to get 2 of my systems cranked up and was about to put a third system online when I remembered it was running x64 version of OS. Can I just recompile the code on the x64 system and put it in the same home directory used by all the systems? I'm not sharing the directory across systems, but after doing this three or four times across just 2 systems, I can see why sharing would be advantages. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] i386 with x64
Thanks for the information. I'll try it out. >Open MPI can manage heterogeneous system. Though you prefer to avoid >this because it has a performance penalty. I suggest you compile on >the 32bit machine and use the same version everywhere. Aurelien Le 10 avr. 08 à 18:09, clarkmpi_at_[hidden] a écrit : >> Thanks to those who answered my post in the past. I have to admit >> that you lost me about half way through the thread. >> >> I was able to get 2 of my systems cranked up and was about to put a >> third system online when I remembered it was running x64 version of >> OS. >> Can I just recompile the code on the x64 system and put it in the >> same home directory used by all the systems? I'm not sharing the >> directory across systems, but after doing this three or four times >> across just 2 systems, I can see why sharing would be advantages. >> > ___ > users mailing list > users_at_[hidden] > http://www.open-mpi.org/mailman/listinfo.cgi/users