[OMPI users] mpirun example program fail on multiple nodes - unable to launch specified application on client node
Dear Sir/Madam, I'm having problem running example program. Please kindly help --- I've been fooling with it for days, kind of getting lost. - MPIRUN fails on example hello prgram -unable to launch the specified application on client node - 1) I'm trying to run opemMPI with the following setting: 1 PC (as master node) and 1 notebook (as client node) connected to an ethernet router through ethernet cable. Both running Ubuntu 8.10. There's no other connections. - Is this setting OK to run OpenMPI? 2) Prerequisites SSH has been set up so that the master node can access the client node through passwordless ssh. I do notice that it takes 10~15 seconds between me entering '>ssh 'command and getting onto the client node. --- Could this be too slow for openmpi to run properlly? I do not have programs like network file system, network time protocol, resource management, scheduler, etc installed. --- Does OpenMPI need any prerequites other than passwordless ssh? 3) OpenMPI is installed on both nodes - downloaded from open-mpi.org, and do configure/make all using Default Settings. 4) PATH and LD_LIBRARY_PATH On both nodes, PATH is /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games, which is the default setting in ubuntu. LD_LIBRARY_PATH is set in ~/.bashrc - I added one line at the end of the file, 'export LD_LIBRARY_PATH=usr/local/lib:usr/lib' So when I echo them on both nodes, I get: >echo $PATH >/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games >echo $LD_LIBRARY_PATH >usr/local/lib:usr/lib But, if I do >ssh 'echo $LD_LIBRARY_PATH' nothing comes back. while >ssh 'echo $PATH' comes back with the right path. Is that a problem? 4) Problem: I compiled the example Hello_c using >mpicc hello_c.c -o hello_c.out and run them on both nodes locally, everything works fine. But when I tried to run it on 2 nodes (-np 2) >mpirun -machinefile machine.linux -np 2 $(pwd)/hello_c.out I got the following error: gordon@gordon-desktop:~/Desktop/openmpi-1.3.3/examples$ mpirun --machinefile machine.linux -np 2 $(pwd)/hello_c.out -- mpirun was unable to launch the specified application as it could not access or execute an executable: Executable: /home/gordon/Desktop/openmpi-1.3.3/examples/hello_c.out Node: 192.168.0.194 while attempting to start process rank 1. -- Sometimes I get one other error message after that: -- [gordon-desktop:30748] [[25975,0],0]-[[25975,1],0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104) -- 5) Infomation attached: ifconfig_masternode - output of ifconfig on masternode ifconfig_slavenode - output of ifconfig on slavenode ompi_info.txt - output of ompi_info -all config.log - OpenMPI logfile machine.linux - the machinefile used in mpirun command -- Sincerely, Qing Pang (601) 979 0270 mpirun_info.tar.gz Description: application/gzip - MPIRUN fails on example hello prgram -unable to launch the specified application on client node - 1) I'm trying to run opemMPI with the following setting: 1 PC (as master node) and 1 notebook (as client node) connected to an ethernet router through ethernet cable. Both running Ubuntu 8.10. There's no other connections. - Is this setting OK to run OpenMPI? 2) Prerequisites SSH has been set up so that the master node can access the client node through passwordless ssh. I do notice that it takes 10~15 seconds between me entering '>ssh 'command and getting onto the client node. - Can this be too slow for openmpi to run properlly? I do not have programs like network file system, network time protocol, resource management, scheduler, etc installed. - Does OpenMPI have any prerequites other than passwordless ssh? 3) OpenMPI is installed on both nodes - downloaded from open-mpi.org, and do configure/make all using Default Settings. 4) PATH and LD_LIBRARY_PATH On both nodes, PATH is /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games, which is the default setting in ubuntu. LD_LIBRARY_PATH is set in ~/.bashrc - I added one line at the end of the file, 'export LD_LIBRARY_PATH=usr/local/lib:usr/lib' So when I echo them on both nodes, I get: >echo $P
Re: [OMPI users] mpirun example program fail on multiple nodes - unable to launch specified application on client node
Thank you Jeff! That solves the problem. :-) You are the lifesaver! So does that means I always need to copy my application to all the nodes? Or should I give the pathname of the my executable in a different way to avoid this? Do I need a network file system for that? Jeff Squyres wrote: The short version of the answer is to check to see that the executable is in the same location on both nodes (apparently: /home/gordon/Desktop/openmpi-1.3.3/examples/hello_c.out). Open MPI is complaining that it can't find that specific executable on the .194 node. See below for more detail. On Nov 5, 2009, at 3:19 PM, qing pang wrote: 1) I'm trying to run opemMPI with the following setting: 1 PC (as master node) and 1 notebook (as client node) connected to an ethernet router through ethernet cable. Both running Ubuntu 8.10. There's no other connections. - Is this setting OK to run OpenMPI? Yes. 2) Prerequisites SSH has been set up so that the master node can access the client node through passwordless ssh. I do notice that it takes 10~15 seconds between me entering '>ssh 'command and getting onto the client node. --- Could this be too slow for openmpi to run properlly? Nope -- should be ok. I do not have programs like network file system, network time protocol, resource management, scheduler, etc installed. --- Does OpenMPI need any prerequites other than passwordless ssh? Not in this case, no. 3) OpenMPI is installed on both nodes - downloaded from open-mpi.org, and do configure/make all using Default Settings. 4) PATH and LD_LIBRARY_PATH On both nodes, PATH is /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games, which is the default setting in ubuntu. LD_LIBRARY_PATH is set in ~/.bashrc - I added one line at the end of the file, 'export LD_LIBRARY_PATH=usr/local/lib:usr/lib' So when I echo them on both nodes, I get: >echo $PATH >/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games >echo $LD_LIBRARY_PATH >usr/local/lib:usr/lib But, if I do >ssh 'echo $LD_LIBRARY_PATH' nothing comes back. while >ssh 'echo $PATH' comes back with the right path. Is that a problem? No. 4) Problem: I compiled the example Hello_c using >mpicc hello_c.c -o hello_c.out and run them on both nodes locally, everything works fine. But when I tried to run it on 2 nodes (-np 2) >mpirun -machinefile machine.linux -np 2 $(pwd)/hello_c.out I got the following error: gordon@gordon-desktop:~/Desktop/openmpi-1.3.3/examples$ mpirun --machinefile machine.linux -np 2 $(pwd)/hello_c.out -- mpirun was unable to launch the specified application as it could not access or execute an executable: Executable: /home/gordon/Desktop/openmpi-1.3.3/examples/hello_c.out Node: 192.168.0.194 You are giving an absolute pathname in the mpirun command line: mpirun -machinefile machine.linux -np 2 $(pwd)/hello_c.out Hence, it's looking for exactly /home/gordon/Desktop/openmpi-1.3.3/examples/hello_c.out on both nodes. If the executable is in a different directory on the other node, that's where you're probably running into the problem.
[OMPI users] Problem with mpirun -preload-binary option
I'm having problem getting the mpirun "preload-binary" option to work. I'm using ubutu8.10 with openmpi 1.3.3, nodes connected with Ethernet cable. If I copy the executable to client nodes using scp, then do mpirun, everything works. But I really want to avoid the copying, so I tried the -preload-binary option. When I typed the command on my master node as below (gordon-desktop is my master node, and gordon-laptop is the client node): -- gordon_at_gordon-desktop:~/Desktop/openmpi-1.3.3/examples$ mpirun -machinefile machine.linux -np 2 --preload-binary $(pwd)/hello_c.out -- I got the following: gordon_at_gordon-desktop's password: (I entered my password here, why am I asked for the password? I am working under this account anyway) WARNING: Remote peer ([[18118,0],1]) failed to preload a file. Exit Status: 256 Local File: /tmp/openmpi-sessions-gordon_at_gordon-laptop_0/18118/0/hello_c.out Remote File: /home/gordon/Desktop/openmpi-1.3.3/examples/hello_c.out Command: scp gordon-desktop:/home/gordon/Desktop/openmpi-1.3.3/examples/hello_c.out /tmp/openmpi-sessions-gordon_at_gordon-laptop_0/18118/0/hello_c.out Will continue attempting to launch the process(es). -- -- mpirun was unable to launch the specified application as it could not access or execute an executable: Executable: /home/gordon/Desktop/openmpi-1.3.3/examples/hello_c.out Node: node1 while attempting to start process rank 1. -- Had anyone succeeded with the 'preload-binary' option with the similar settings? I assume this mpirun option should work when compiling openmpi with default options? Anything I need to set? --qing
Re: [OMPI users] Problem with mpirun -preload-binary option
Thank you very much for your help! I believe I do have password-less ssh set up, at least from master node to client node (desktop -> laptop in my case). If I type >ssh node1 on my desktop terminal, I am able to get to the laptop node without being asked for password. And as I mentioned, if I copy the example executable from desktop to the laptop node using scp, then I am able to run it from desktop using both nodes. Back to the preload-binary problem - I am asked for the password of my master node - the node I am working on - not the remote client node. Do you mean that I should set up password-less ssh in both direction? Does the client node need to access master node through password-less ssh to make the preload-binary option work? Ralph Castain Wrote: It -should- work, but you need password-less ssh setup. See our FAQ for how to do that, if you are unfamiliar with it. On Nov 10, 2009, at 2:02 PM, Qing Pang wrote: I'm having problem getting the mpirun "preload-binary" option to work. I'm using ubutu8.10 with openmpi 1.3.3, nodes connected with Ethernet cable. If I copy the executable to client nodes using scp, then do mpirun, everything works. But I really want to avoid the copying, so I tried the -preload-binary option. When I typed the command on my master node as below (gordon-desktop is my master node, and gordon-laptop is the client node): -- gordon_at_gordon-desktop:~/Desktop/openmpi-1.3.3/examples$ mpirun -machinefile machine.linux -np 2 --preload-binary $(pwd)/hello_c.out -- I got the following: gordon_at_gordon-desktop's password: (I entered my password here, why am I asked for the password? I am working under this account anyway) WARNING: Remote peer ([[18118,0],1]) failed to preload a file. Exit Status: 256 Local File: /tmp/openmpi-sessions-gordon_at_gordon-laptop_0/18118/0/hello_c.out Remote File: /home/gordon/Desktop/openmpi-1.3.3/examples/hello_c.out Command: scp gordon-desktop:/home/gordon/Desktop/openmpi-1.3.3/examples/hello_c.out /tmp/openmpi-sessions-gordon_at_gordon-laptop_0/18118/0/hello_c.out Will continue attempting to launch the process(es). -- -- mpirun was unable to launch the specified application as it could not access or execute an executable: Executable: /home/gordon/Desktop/openmpi-1.3.3/examples/hello_c.out Node: node1 while attempting to start process rank 1. -- Had anyone succeeded with the 'preload-binary' option with the similar settings? I assume this mpirun option should work when compiling openmpi with default options? Anything I need to set? --qing
Re: [OMPI users] Problem with mpirun -preload-binary option
Now that I have passwordless-ssh set up both directions, and verified working - I still have the same problem. I'm able to run ssh/scp on both master and client nodes - (at this point, they are pretty much the same), without being asked for password. And mpirun works fine if I have the executable put in the same directory on both nodes. But when I tried the preload-binary option, I still have the same problem - it asked me for the password of the node running mpirun, and then tells that scp failed. --- Josh Wrote: Though the --preload-binary option was created while building the checkpoint/restart functionality it does not depend on checkpoint/restart function in any way (just a side effect of the initial development). The problem you are seeing is a result of the computing environment setup of password-less ssh. The --preload-binary command uses 'scp' (at the moment) to copy the files from the node running mpirun to the compute nodes. The compute nodes are the ones that call 'scp', so you will need to setup password-less ssh in both directions. -- Josh On Nov 11, 2009, at 8:38 AM, Ralph Castain wrote: I'm no expert on the preload-binary option - but I would suspect that is the case given your observations. That option was created to support checkpoint/restart, not for what you are attempting to do. Like I said, you -should- be able to use it for that purpose, but I expect you may hit a few quirks like this along the way. On Nov 11, 2009, at 9:16 AM, Qing Pang wrote: > Thank you very much for your help! I believe I do have password-less ssh set up, at least from master node to client node (desktop -> laptop in my case). If I type >ssh node1 on my desktop terminal, I am able to get to the laptop node without being asked for password. And as I mentioned, if I copy the example executable from desktop to the laptop node using scp, then I am able to run it from desktop using both nodes. > Back to the preload-binary problem - I am asked for the password of my master node - the node I am working on - not the remote client node. Do you mean that I should set up password-less ssh in both direction? Does the client node need to access master node through password-less ssh to make the preload-binary option work? > > > Ralph Castain Wrote: > > It -should- work, but you need password-less ssh setup. See our FAQ > for how to do that, if you are unfamiliar with it. > > On Nov 10, 2009, at 2:02 PM, Qing Pang wrote: > > I'm having problem getting the mpirun "preload-binary" option to work. >> >> I'm using ubutu8.10 with openmpi 1.3.3, nodes connected with Ethernet cable. >> If I copy the executable to client nodes using scp, then do mpirun, everything works. >> >> But I really want to avoid the copying, so I tried the -preload-binary option. >> >> When I typed the command on my master node as below (gordon-desktop is my master node, and gordon-laptop is the client node): >> >> -- >> gordon_at_gordon-desktop:~/Desktop/openmpi-1.3.3/examples$ mpirun >> -machinefile machine.linux -np 2 --preload-binary $(pwd)/hello_c.out >> -- >> >> I got the following: >> >> gordon_at_gordon-desktop's password: (I entered my password here, why am I asked for the password? I am working under this account anyway) >> >> >> WARNING: Remote peer ([[18118,0],1]) failed to preload a file. >> >> Exit Status: 256 >> Local File: /tmp/openmpi-sessions-gordon_at_gordon-laptop_0/18118/0/hello_c.out >> Remote File: /home/gordon/Desktop/openmpi-1.3.3/examples/hello_c.out >> Command: >> scp gordon-desktop:/home/gordon/Desktop/openmpi-1.3.3/examples/hello_c.out >> /tmp/openmpi-sessions-gordon_at_gordon-laptop_0/18118/0/hello_c.out >> >> Will continue attempting to launch the process(es). >> -- >> -- >> mpirun was unable to launch the specified application as it could not access >> or execute an executable: >> >> Executable: /home/gordon/Desktop/openmpi-1.3.3/examples/hello_c.out >> Node: node1 >> >> while attempting to start process rank 1. >> -- >> >> Had anyone succeeded with the 'preload-binary' option with the similar settings? I assume this mpirun option should work when compiling openmpi with default options? Anything I need to set? >> >> --qing >> >>
[OMPI users] benchmark - mpi_reduce() called only once but takes long time - proportional to calculation time
Dear users, I'm running the popular Calculate PI program on a 2 node setting running ubuntu 8.10 and openmpi1.3.3(with default settings). Password-less ssh is set up but no cluster management program such as network file system, network time protocol, resource management, scheduler, etc. The two nodes are connected though TCP/IP only. When I tried to benchmark the program, it shows that the time spent on MPI_Reduce(), is proportional to the Number-of-Intervals (n) used in calculation. For example, when n = 1,000,000, MPI_Reduce costs 15.65 milliseconds; while n= 1,000,000,000, MPI_Reduce costs 15526 milliseconds. This confused me - in this Calc-PI program, MPI_Reduce is used only once - no matter what number of intervals is used, MPI_Reduce is called after both nodes got the result, to merge the result - just once. So the time cost by MPI_Reduce (all though it might be slow through TCP/IP connection) should be somewhat consistent. But obviously it's not what I saw. Had anyone have the similar problem before? I'm not sure how MPI_Reduce() work internally. Does the fact that I don't have network file system, network time protocol, resource management, scheduler, etc installed matters? Below is the program - I did feed "n" to it more than once to warm it up. #include "mpi.h" #include #include int main(int argc, char *argv[]) { int numprocs, myid, rc; double ACCUPI = 3.1415926535897932384626433832795; double mypi, pi, h, sum, x; int n, i; double starttime, endtime; double time,told,bcasttime,reducetime,comptime,totaltime; rc = MPI_Init(&argc,&argv); if (rc != MPI_SUCCESS) { printf("Error starting MPI program. Terminating.\n"); MPI_Abort(MPI_COMM_WORLD, rc); } MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); while (1) { if (myid == 0) { printf("Enter the number of intervals: (0 quits) \n"); scanf("%d",&n); starttime = MPI_Wtime(); } time = MPI_Wtime(); MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); told = time; time = MPI_Wtime(); bcasttime = time - told; if (n == 0) break; else { h = 1.0/(double)n; sum = 0.0; for (i = myid + 1; i <= n; i += numprocs) { x = h*((double)i - 0.5); sum += (4.0/(1.0 + x*x)); } mypi = sum*h; told = time; time = MPI_Wtime(); comptime = time - told; MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); told = time; time = MPI_Wtime(); reducetime = time - told; if (myid == 0) { totaltime = MPI_Wtime() - starttime; printf("\nElapsed time (total): %f milliseconds\n",totaltime*1000); printf("Elapsed time (Bcast): %f milliseconds (%5.2f%%)\n",bcasttime*1000,bcasttime*100/totaltime); printf("Elapsed time (Reduce): %f milliseconds (%5.2f%%)\n",reducetime*1000,reducetime*100/totaltime); printf("Elapsed time (Comput): %f milliseconds (%5.2f%%)\n",comptime*1000,comptime*100/totaltime); printf("\nApproximated pi is %.16f, Error is %.4e\n", pi, fabs(pi - ACCUPI)); } } } MPI_Finalize(); }
Re: [OMPI users] benchmark - mpi_reduce() called only once but takes long time - proportional to calculation time
Thank you so much! It is a synchronization issue. In my case, one node actually run slower than the other node. Adding MPE_Barrier() helps to straight things out. Thank you for your help! Eugene Loh wrote: Your processes are probably running asynchronously. You could perhaps try tracing program execution and look at the timeline. E.g., http://www.open-mpi.org/faq/?category=perftools#free-tools . Or, where you have MPI_Wtime calls, just capture those timestamps on each process and dump the results at the end of your run. Or, report timings for all ranks instead of just for rank 0. Put another way, rank 0 must broadcast n. So, no one starts computation until they get the Bcast result. Rank 0 probably starts its computations before anyone else does. So, it gets to the Reduce before anyone else does, but it can't exit until other ranks have finished their computations. So, the Reduce time on rank 0 includes some amount of other ranks' compute times. Yet another approach is to insert MPI_Barrier calls at each phase of the program so that the various phases are synchronized. This adds some overhead to the program, but helps simplify interpretation of the timing results. Qing Pang wrote: I'm running the popular Calculate PI program on a 2 node setting running ubuntu 8.10 and openmpi1.3.3(with default settings). Password-less ssh is set up but no cluster management program such as network file system, network time protocol, resource management, scheduler, etc. The two nodes are connected though TCP/IP only. When I tried to benchmark the program, it shows that the time spent on MPI_Reduce(), is proportional to the Number-of-Intervals (n) used in calculation. For example, when n = 1,000,000, MPI_Reduce costs 15.65 milliseconds; while n= 1,000,000,000, MPI_Reduce costs 15526 milliseconds. This confused me - in this Calc-PI program, MPI_Reduce is used only once - no matter what number of intervals is used, MPI_Reduce is called after both nodes got the result, to merge the result - just once. So the time cost by MPI_Reduce (all though it might be slow through TCP/IP connection) should be somewhat consistent. But obviously it's not what I saw. Had anyone have the similar problem before? I'm not sure how MPI_Reduce() work internally. Does the fact that I don't have network file system, network time protocol, resource management, scheduler, etc installed matters? Below is the program - I did feed "n" to it more than once to warm it up. #include "mpi.h" #include #include int main(int argc, char *argv[]) { int numprocs, myid, rc; double ACCUPI = 3.1415926535897932384626433832795; double mypi, pi, h, sum, x; int n, i; double starttime, endtime; double time,told,bcasttime,reducetime,comptime,totaltime; rc = MPI_Init(&argc,&argv); if (rc != MPI_SUCCESS) { printf("Error starting MPI program. Terminating.\n"); MPI_Abort(MPI_COMM_WORLD, rc); } MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); while (1) { if (myid == 0) { printf("Enter the number of intervals: (0 quits) \n"); scanf("%d",&n); starttime = MPI_Wtime(); } time = MPI_Wtime(); MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); told = time; time = MPI_Wtime(); bcasttime = time - told; if (n == 0) break; else { h = 1.0/(double)n; sum = 0.0; for (i = myid + 1; i <= n; i += numprocs) { x = h*((double)i - 0.5); sum += (4.0/(1.0 + x*x)); } mypi = sum*h; told = time; time = MPI_Wtime(); comptime = time - told; MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); told = time; time = MPI_Wtime(); reducetime = time - told; if (myid == 0) { totaltime = MPI_Wtime() - starttime; printf("\nElapsed time (total): %f milliseconds\n",totaltime*1000); printf("Elapsed time (Bcast): %f milliseconds (%5.2f%%)\n",bcasttime*1000,bcasttime*100/totaltime); printf("Elapsed time (Reduce): %f milliseconds (%5.2f%%)\n",reducetime*1000,reducetime*100/totaltime); printf("Elapsed time (Comput): %f milliseconds (%5.2f%%)\n",comptime*1000,comptime*100/totaltime); printf("\nApproximated pi is %.16f, Error is %.4e\n", pi, fabs(pi - ACCUPI)); } } } MPI_Finalize(); } ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users