Re: [OMPI users] mpi problems/many cpus per node
I will give this a try, but wouldn't that be an issue as well if the process was run on the head node or another node? So long as the mpi job is not started on either of these two nodes, it works fine. Dan On 12/14/2012 11:46 PM, Ralph Castain wrote: It must be making contact or ORTE wouldn't be attempting to launch your application's procs. Looks more like it never received the launch command. Looking at the code, I suspect you're getting caught in a race condition that causes the message to get "stuck". Just to see if that's the case, you might try running this with the 1.7 release candidate, or even the developer's nightly build. Both use a different timing mechanism intended to resolve such situations. On Dec 14, 2012, at 2:49 PM, Daniel Davidson wrote: Thank you for the help so far. Here is the information that the debugging gives me. Looks like the daemon on on the non-local node never makes contact. If I step NP back two though, it does. Dan [root@compute-2-1 etc]# /home/apps/openmpi-1.6.3/bin/mpirun -host compute-2-0,compute-2-1 -v -np 34 --leave-session-attached -mca odls_base_verbose 5 hostname [compute-2-1.local:44855] mca:base:select:( odls) Querying component [default] [compute-2-1.local:44855] mca:base:select:( odls) Query of component [default] set priority to 1 [compute-2-1.local:44855] mca:base:select:( odls) Selected component [default] [compute-2-0.local:29282] mca:base:select:( odls) Querying component [default] [compute-2-0.local:29282] mca:base:select:( odls) Query of component [default] set priority to 1 [compute-2-0.local:29282] mca:base:select:( odls) Selected component [default] [compute-2-1.local:44855] [[49524,0],0] odls:update:daemon:info updating nidmap [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list [compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list unpacking data to launch job [49524,1] [compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list adding new jobdat for job [49524,1] [compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list unpacking 1 app_contexts [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking proc [[49524,1],0] on daemon 1 [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking proc [[49524,1],1] on daemon 0 [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found proc [[49524,1],1] for me! [compute-2-1.local:44855] adding proc [[49524,1],1] (1) to my local list [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking proc [[49524,1],2] on daemon 1 [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking proc [[49524,1],3] on daemon 0 [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found proc [[49524,1],3] for me! [compute-2-1.local:44855] adding proc [[49524,1],3] (3) to my local list [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking proc [[49524,1],4] on daemon 1 [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking proc [[49524,1],5] on daemon 0 [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found proc [[49524,1],5] for me! [compute-2-1.local:44855] adding proc [[49524,1],5] (5) to my local list [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking proc [[49524,1],6] on daemon 1 [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking proc [[49524,1],7] on daemon 0 [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found proc [[49524,1],7] for me! [compute-2-1.local:44855] adding proc [[49524,1],7] (7) to my local list [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking proc [[49524,1],8] on daemon 1 [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking proc [[49524,1],9] on daemon 0 [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found proc [[49524,1],9] for me! [compute-2-1.local:44855] adding proc [[49524,1],9] (9) to my local list [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking proc [[49524,1],10] on daemon 1 [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking proc [[49524,1],11] on daemon 0 [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found proc [[49524,1],11] for me! [compute-2-1.local:44855] adding proc [[49524,1],11] (11) to my local list [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking proc [[49524,1],12] on daemon 1 [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking proc [[49524,1],13] on daemon 0 [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found proc [[49524,1],13] for me! [compute-2-1.local:44855] adding proc [[49524,1],13] (13) to my local list [compute-2-1.local:44855] [[49524,0],0] odls:constructing child
Re: [OMPI users] EXTERNAL: Re: Problems with shared libraries while launching jobs
Ralph, Unfortunately I didn't see the ssh output. The output I got was pretty much as before. You know, the fact that the error message is not prefixed with a host name makes me think it could be happening on the host where the job is placed by PBS. If there is something wrong in the user environment prior to mpirun, that is not an OpenMPI problem. And yet, in one of the jobs that failed, I have also printed out the results of 'ldd' on the mpirun executable just prior to executing the command, and all the shared libraries were resolved: ldd /release/cfd/openmpi-intel/bin/mpirun linux-vdso.so.1 => (0x7fffbbb39000) libopen-rte.so.0 => /release/cfd/openmpi-intel/lib/libopen-rte.so.0 (0x2abdf75d2000) libopen-pal.so.0 => /release/cfd/openmpi-intel/lib/libopen-pal.so.0 (0x2abdf7887000) libdl.so.2 => /lib64/libdl.so.2 (0x2abdf7b39000) libnsl.so.1 => /lib64/libnsl.so.1 (0x2abdf7d3d000) libutil.so.1 => /lib64/libutil.so.1 (0x2abdf7f56000) libm.so.6 => /lib64/libm.so.6 (0x2abdf8159000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x2abdf83af000) libpthread.so.0 => /lib64/libpthread.so.0 (0x2abdf85c7000) libc.so.6 => /lib64/libc.so.6 (0x2abdf87e4000) libimf.so => /appserv/intel/Compiler/11.1/072/lib/intel64/libimf.so (0x2abdf8b42000) libsvml.so => /appserv/intel/Compiler/11.1/072/lib/intel64/libsvml.so (0x2abdf8ed7000) libintlc.so.5 => /appserv/intel/Compiler/11.1/072/lib/intel64/libintlc.so.5 (0x2abdf90ed000) /lib64/ld-linux-x86-64.so.2 (0x2abdf73b1000) Hence my initial assumption that the shared-library problem was happening with one of the child processes on a remote node. So at this point I have more questions than answers. I still don't know if this message comes from the main mpirun process or one of the child processes, although it seems that it should not be the main process because of the output of ldd above. Any more suggestions are welcomed of course. Thanks /release/cfd/openmpi-intel/bin/mpirun --machinefile /var/spool/PBS/aux/20804.maruhpc4-mgt -np 160 -x LD_LIBRARY_PATH -x MPI_ENVIRONMENT=1 --mca plm_base_verbose 5 --leave-session-attached /tmp/fv420804.maruhpc4-mgt/test_jsgl -v -cycles 1 -ri restart.5000 -ro /tmp/fv420804.maruhpc4-mgt/restart.5000 [c6n38:16219] mca:base:select:( plm) Querying component [rsh] [c6n38:16219] mca:base:select:( plm) Query of component [rsh] set priority to 10 [c6n38:16219] mca:base:select:( plm) Selected component [rsh] Warning: Permanently added 'c6n39' (RSA) to the list of known hosts.^M Warning: Permanently added 'c6n40' (RSA) to the list of known hosts.^M Warning: Permanently added 'c6n41' (RSA) to the list of known hosts.^M Warning: Permanently added 'c6n42' (RSA) to the list of known hosts.^M Warning: Permanently added 'c5n26' (RSA) to the list of known hosts.^M Warning: Permanently added 'c3n20' (RSA) to the list of known hosts.^M Warning: Permanently added 'c4n10' (RSA) to the list of known hosts.^M Warning: Permanently added 'c4n40' (RSA) to the list of known hosts.^M /release/cfd/openmpi-intel/bin/orted: error while loading shared libraries: libimf.so: cannot open shared object file: No such file or directory -- A daemon (pid 16227) died unexpectedly with status 127 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. -- -- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -- Warning: Permanently added 'c3n27' (RSA) to the list of known hosts.^M -- mpirun was unable to cleanly terminate the daemons on the nodes shown below. Additional manual cleanup may be required - please refer to the "orte-clean" tool for assistance. -- c6n39 - daemon did not report back when launched c6n40 - daemon did not report back when launched c6n41 - daemon did not report back when launched c6n42 - daemon did not report back when launched From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain Sent: Friday, December 14, 2012 2:25 PM To: Open MPI Users Subject: EXTERNAL: Re: [OMPI users] Problems with shared libraries whil
Re: [OMPI users] EXTERNAL: Re: Problems with shared libraries while launching jobs
On Dec 17, 2012, at 7:42 AM, "Blosch, Edwin L" wrote: > Ralph, > > Unfortunately I didn’t see the ssh output. The output I got was pretty much > as before. Sorry - I forgot that you built from a tarball, and so the debug is "off" by default. You need to reconfigure with --enable-debug to get debug output. > > You know, the fact that the error message is not prefixed with a host name > makes me think it could be happening on the host where the job is placed by > PBS. If there is something wrong in the user environment prior to mpirun, > that is not an OpenMPI problem. And yet, in one of the jobs that failed, I > have also printed out the results of ‘ldd’ on the mpirun executable just > prior to executing the command, and all the shared libraries were resolved: No, it isn't mpirun as the error message specifically says it is "orted" that is failing to resolve the library. Try running with the debug enabled and the same command line and let's see what the actual ssh command looks like - my guess is that the path to that intel library is missing. > > ldd /release/cfd/openmpi-intel/bin/mpirun > linux-vdso.so.1 => (0x7fffbbb39000) > libopen-rte.so.0 => /release/cfd/openmpi-intel/lib/libopen-rte.so.0 > (0x2abdf75d2000) > libopen-pal.so.0 => /release/cfd/openmpi-intel/lib/libopen-pal.so.0 > (0x2abdf7887000) > libdl.so.2 => /lib64/libdl.so.2 (0x2abdf7b39000) > libnsl.so.1 => /lib64/libnsl.so.1 (0x2abdf7d3d000) > libutil.so.1 => /lib64/libutil.so.1 (0x2abdf7f56000) > libm.so.6 => /lib64/libm.so.6 (0x2abdf8159000) > libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x2abdf83af000) > libpthread.so.0 => /lib64/libpthread.so.0 (0x2abdf85c7000) > libc.so.6 => /lib64/libc.so.6 (0x2abdf87e4000) > libimf.so => /appserv/intel/Compiler/11.1/072/lib/intel64/libimf.so > (0x2abdf8b42000) > libsvml.so => /appserv/intel/Compiler/11.1/072/lib/intel64/libsvml.so > (0x2abdf8ed7000) > libintlc.so.5 => > /appserv/intel/Compiler/11.1/072/lib/intel64/libintlc.so.5 > (0x2abdf90ed000) > /lib64/ld-linux-x86-64.so.2 (0x2abdf73b1000) > > Hence my initial assumption that the shared-library problem was happening > with one of the child processes on a remote node. > > So at this point I have more questions than answers. I still don’t know if > this message comes from the main mpirun process or one of the child > processes, although it seems that it should not be the main process because > of the output of ldd above. > > Any more suggestions are welcomed of course. > > Thanks > > > /release/cfd/openmpi-intel/bin/mpirun --machinefile > /var/spool/PBS/aux/20804.maruhpc4-mgt -np 160 -x LD_LIBRARY_PATH -x > MPI_ENVIRONMENT=1 --mca plm_base_verbose 5 --leave-session-attached > /tmp/fv420804.maruhpc4-mgt/test_jsgl -v -cycles 1 -ri restart.5000 -ro > /tmp/fv420804.maruhpc4-mgt/restart.5000 > > [c6n38:16219] mca:base:select:( plm) Querying component [rsh] > [c6n38:16219] mca:base:select:( plm) Query of component [rsh] set priority > to 10 > [c6n38:16219] mca:base:select:( plm) Selected component [rsh] > Warning: Permanently added 'c6n39' (RSA) to the list of known hosts.^M > Warning: Permanently added 'c6n40' (RSA) to the list of known hosts.^M > Warning: Permanently added 'c6n41' (RSA) to the list of known hosts.^M > Warning: Permanently added 'c6n42' (RSA) to the list of known hosts.^M > Warning: Permanently added 'c5n26' (RSA) to the list of known hosts.^M > Warning: Permanently added 'c3n20' (RSA) to the list of known hosts.^M > Warning: Permanently added 'c4n10' (RSA) to the list of known hosts.^M > Warning: Permanently added 'c4n40' (RSA) to the list of known hosts.^M > /release/cfd/openmpi-intel/bin/orted: error while loading shared libraries: > libimf.so: cannot open shared object file: No such file or directory > -- > A daemon (pid 16227) died unexpectedly with status 127 while attempting > to launch so we are aborting. > > There may be more information reported by the environment (see above). > > This may be because the daemon was unable to find all the needed shared > libraries on the remote node. You may set your LD_LIBRARY_PATH to have the > location of the shared libraries on the remote nodes and this will > automatically be forwarded to the remote nodes. > -- > -- > mpirun noticed that the job aborted, but has no info as to the process > that caused that situation. > -- > Warning: Permanently added 'c3n27' (RSA) to the list of known hosts.^M > -- > mpirun was unable
Re: [OMPI users] mpi problems/many cpus per node
This looks to be having issues as well, and I cannot get any number of processors to give me a different result with the new version. [root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host compute-2-0,compute-2-1 -v -np 50 --leave-session-attached -mca odls_base_verbose 5 hostname [compute-2-1.local:69417] mca:base:select:( odls) Querying component [default] [compute-2-1.local:69417] mca:base:select:( odls) Query of component [default] set priority to 1 [compute-2-1.local:69417] mca:base:select:( odls) Selected component [default] [compute-2-0.local:24486] mca:base:select:( odls) Querying component [default] [compute-2-0.local:24486] mca:base:select:( odls) Query of component [default] set priority to 1 [compute-2-0.local:24486] mca:base:select:( odls) Selected component [default] [compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc working on WILDCARD [compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc working on WILDCARD [compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc working on WILDCARD [compute-2-1.local:69417] [[24939,0],0] odls:kill_local_proc working on WILDCARD [compute-2-1.local:69417] [[24939,0],0] odls:kill_local_proc working on WILDCARD However from the head node: [root@biocluster openmpi-1.7rc5]# /home/apps/openmpi-1.7rc5/bin/mpirun -host compute-2-0,compute-2-1 -v -np 50 hostname Displays 25 hostnames from each system. Thank you again for the help so far, Dan On 12/17/2012 08:31 AM, Daniel Davidson wrote: I will give this a try, but wouldn't that be an issue as well if the process was run on the head node or another node? So long as the mpi job is not started on either of these two nodes, it works fine. Dan On 12/14/2012 11:46 PM, Ralph Castain wrote: It must be making contact or ORTE wouldn't be attempting to launch your application's procs. Looks more like it never received the launch command. Looking at the code, I suspect you're getting caught in a race condition that causes the message to get "stuck". Just to see if that's the case, you might try running this with the 1.7 release candidate, or even the developer's nightly build. Both use a different timing mechanism intended to resolve such situations. On Dec 14, 2012, at 2:49 PM, Daniel Davidson wrote: Thank you for the help so far. Here is the information that the debugging gives me. Looks like the daemon on on the non-local node never makes contact. If I step NP back two though, it does. Dan [root@compute-2-1 etc]# /home/apps/openmpi-1.6.3/bin/mpirun -host compute-2-0,compute-2-1 -v -np 34 --leave-session-attached -mca odls_base_verbose 5 hostname [compute-2-1.local:44855] mca:base:select:( odls) Querying component [default] [compute-2-1.local:44855] mca:base:select:( odls) Query of component [default] set priority to 1 [compute-2-1.local:44855] mca:base:select:( odls) Selected component [default] [compute-2-0.local:29282] mca:base:select:( odls) Querying component [default] [compute-2-0.local:29282] mca:base:select:( odls) Query of component [default] set priority to 1 [compute-2-0.local:29282] mca:base:select:( odls) Selected component [default] [compute-2-1.local:44855] [[49524,0],0] odls:update:daemon:info updating nidmap [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list [compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list unpacking data to launch job [49524,1] [compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list adding new jobdat for job [49524,1] [compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list unpacking 1 app_contexts [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking proc [[49524,1],0] on daemon 1 [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking proc [[49524,1],1] on daemon 0 [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found proc [[49524,1],1] for me! [compute-2-1.local:44855] adding proc [[49524,1],1] (1) to my local list [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking proc [[49524,1],2] on daemon 1 [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking proc [[49524,1],3] on daemon 0 [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found proc [[49524,1],3] for me! [compute-2-1.local:44855] adding proc [[49524,1],3] (3) to my local list [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking proc [[49524,1],4] on daemon 1 [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking proc [[49524,1],5] on daemon 0 [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found proc [[49524,1],5] for me! [compute-2-1.local:44855] adding proc [[49524,1],5] (5) to my local list [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking proc [[49524,1],6] on daemon 1 [compute-2-1.local:44855] [[49524,0],0] o
Re: [OMPI users] mpi problems/many cpus per node
?? That was all the output? If so, then something is indeed quite wrong as it didn't even attempt to launch the job. Try adding -mca plm_base_verbose 5 to the cmd line. I was assuming you were using ssh as the launcher, but I wonder if you are in some managed environment? If so, then it could be that launch from a backend node isn't allowed (e.g., on gridengine). On Dec 17, 2012, at 8:28 AM, Daniel Davidson wrote: > This looks to be having issues as well, and I cannot get any number of > processors to give me a different result with the new version. > > [root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host > compute-2-0,compute-2-1 -v -np 50 --leave-session-attached -mca > odls_base_verbose 5 hostname > [compute-2-1.local:69417] mca:base:select:( odls) Querying component [default] > [compute-2-1.local:69417] mca:base:select:( odls) Query of component > [default] set priority to 1 > [compute-2-1.local:69417] mca:base:select:( odls) Selected component [default] > [compute-2-0.local:24486] mca:base:select:( odls) Querying component [default] > [compute-2-0.local:24486] mca:base:select:( odls) Query of component > [default] set priority to 1 > [compute-2-0.local:24486] mca:base:select:( odls) Selected component [default] > [compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc working on > WILDCARD > [compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc working on > WILDCARD > [compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc working on > WILDCARD > [compute-2-1.local:69417] [[24939,0],0] odls:kill_local_proc working on > WILDCARD > [compute-2-1.local:69417] [[24939,0],0] odls:kill_local_proc working on > WILDCARD > > However from the head node: > > [root@biocluster openmpi-1.7rc5]# /home/apps/openmpi-1.7rc5/bin/mpirun -host > compute-2-0,compute-2-1 -v -np 50 hostname > > Displays 25 hostnames from each system. > > Thank you again for the help so far, > > Dan > > > > > > > On 12/17/2012 08:31 AM, Daniel Davidson wrote: >> I will give this a try, but wouldn't that be an issue as well if the process >> was run on the head node or another node? So long as the mpi job is not >> started on either of these two nodes, it works fine. >> >> Dan >> >> On 12/14/2012 11:46 PM, Ralph Castain wrote: >>> It must be making contact or ORTE wouldn't be attempting to launch your >>> application's procs. Looks more like it never received the launch command. >>> Looking at the code, I suspect you're getting caught in a race condition >>> that causes the message to get "stuck". >>> >>> Just to see if that's the case, you might try running this with the 1.7 >>> release candidate, or even the developer's nightly build. Both use a >>> different timing mechanism intended to resolve such situations. >>> >>> >>> On Dec 14, 2012, at 2:49 PM, Daniel Davidson wrote: >>> Thank you for the help so far. Here is the information that the debugging gives me. Looks like the daemon on on the non-local node never makes contact. If I step NP back two though, it does. Dan [root@compute-2-1 etc]# /home/apps/openmpi-1.6.3/bin/mpirun -host compute-2-0,compute-2-1 -v -np 34 --leave-session-attached -mca odls_base_verbose 5 hostname [compute-2-1.local:44855] mca:base:select:( odls) Querying component [default] [compute-2-1.local:44855] mca:base:select:( odls) Query of component [default] set priority to 1 [compute-2-1.local:44855] mca:base:select:( odls) Selected component [default] [compute-2-0.local:29282] mca:base:select:( odls) Querying component [default] [compute-2-0.local:29282] mca:base:select:( odls) Query of component [default] set priority to 1 [compute-2-0.local:29282] mca:base:select:( odls) Selected component [default] [compute-2-1.local:44855] [[49524,0],0] odls:update:daemon:info updating nidmap [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list [compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list unpacking data to launch job [49524,1] [compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list adding new jobdat for job [49524,1] [compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list unpacking 1 app_contexts [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking proc [[49524,1],0] on daemon 1 [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking proc [[49524,1],1] on daemon 0 [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found proc [[49524,1],1] for me! [compute-2-1.local:44855] adding proc [[49524,1],1] (1) to my local list [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking proc [[49524,1],2] on daemon 1 [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checki
Re: [OMPI users] mpi problems/many cpus per node
These nodes have not been locked down yet so that jobs cannot be launched from the backend, at least on purpose anyway. The added logging returns the information below: [root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host compute-2-0,compute-2-1 -v -np 10 --leave-session-attached -mca odls_base_verbose 5 -mca plm_base_verbose 5 hostname [compute-2-1.local:69655] mca:base:select:( plm) Querying component [rsh] [compute-2-1.local:69655] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL [compute-2-1.local:69655] mca:base:select:( plm) Query of component [rsh] set priority to 10 [compute-2-1.local:69655] mca:base:select:( plm) Querying component [slurm] [compute-2-1.local:69655] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module [compute-2-1.local:69655] mca:base:select:( plm) Querying component [tm] [compute-2-1.local:69655] mca:base:select:( plm) Skipping component [tm]. Query failed to return a module [compute-2-1.local:69655] mca:base:select:( plm) Selected component [rsh] [compute-2-1.local:69655] plm:base:set_hnp_name: initial bias 69655 nodename hash 3634869988 [compute-2-1.local:69655] plm:base:set_hnp_name: final jobfam 32341 [compute-2-1.local:69655] [[32341,0],0] plm:rsh_setup on agent ssh : rsh path NULL [compute-2-1.local:69655] [[32341,0],0] plm:base:receive start comm [compute-2-1.local:69655] mca:base:select:( odls) Querying component [default] [compute-2-1.local:69655] mca:base:select:( odls) Query of component [default] set priority to 1 [compute-2-1.local:69655] mca:base:select:( odls) Selected component [default] [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_job [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm creating map [compute-2-1.local:69655] [[32341,0],0] setup:vm: working unmanaged allocation [compute-2-1.local:69655] [[32341,0],0] using dash_host [compute-2-1.local:69655] [[32341,0],0] checking node compute-2-0 [compute-2-1.local:69655] [[32341,0],0] adding compute-2-0 to list [compute-2-1.local:69655] [[32341,0],0] checking node compute-2-1.local [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm add new daemon [[32341,0],1] [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm assigning new daemon [[32341,0],1] to node compute-2-0 [compute-2-1.local:69655] [[32341,0],0] plm:rsh: launching vm [compute-2-1.local:69655] [[32341,0],0] plm:rsh: local shell: 0 (bash) [compute-2-1.local:69655] [[32341,0],0] plm:rsh: assuming same remote shell as local shell [compute-2-1.local:69655] [[32341,0],0] plm:rsh: remote shell: 0 (bash) [compute-2-1.local:69655] [[32341,0],0] plm:rsh: final template argv: /usr/bin/ssh PATH=/home/apps/openmpi-1.7rc5/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /home/apps/openmpi-1.7rc5/bin/orted -mca ess env -mca orte_ess_jobid 2119499776 -mca orte_ess_vpid -mca orte_ess_num_procs 2 -mca orte_hnp_uri "2119499776.0;tcp://10.1.255.226:46314;tcp://172.16.28.94:46314" -mca orte_use_common_port 0 --tree-spawn -mca oob tcp -mca odls_base_verbose 5 -mca plm_base_verbose 5 -mca plm rsh -mca orte_leave_session_attached 1 [compute-2-1.local:69655] [[32341,0],0] plm:rsh:launch daemon 0 not a child of mine [compute-2-1.local:69655] [[32341,0],0] plm:rsh: adding node compute-2-0 to launch list [compute-2-1.local:69655] [[32341,0],0] plm:rsh: activating launch event [compute-2-1.local:69655] [[32341,0],0] plm:rsh: recording launch of daemon [[32341,0],1] [compute-2-1.local:69655] [[32341,0],0] plm:rsh: executing: (//usr/bin/ssh) [/usr/bin/ssh compute-2-0 PATH=/home/apps/openmpi-1.7rc5/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /home/apps/openmpi-1.7rc5/bin/orted -mca ess env -mca orte_ess_jobid 2119499776 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 -mca orte_hnp_uri "2119499776.0;tcp://10.1.255.226:46314;tcp://172.16.28.94:46314" -mca orte_use_common_port 0 --tree-spawn -mca oob tcp -mca odls_base_verbose 5 -mca plm_base_verbose 5 -mca plm rsh -mca orte_leave_session_attached 1] Warning: untrusted X11 forwarding setup failed: xauth key data not generated Warning: No xauth data; using fake authentication data for X11 forwarding. [compute-2-0.local:24659] mca:base:select:( plm) Querying component [rsh] [compute-2-0.local:24659] [[32341,0],1] plm:rsh_lookup on agent ssh : rsh path NULL [compute-2-0.local:24659] mca:base:select:( plm) Query of component [rsh] set priority to 10 [compute-2-0.local:24659] mca:base:select:( plm) Selected component [rsh] [compute-2-0.local:24659] mca:base:select:( odls) Querying component [def
Re: [OMPI users] mpi problems/many cpus per node
Hmmm...and that is ALL the output? If so, then it never succeeded in sending a message back, which leads one to suspect some kind of firewall in the way. Looking at the ssh line, we are going to attempt to send a message from tnode 2-0 to node 2-1 on the 10.1.255.226 address. Is that going to work? Anything preventing it? On Dec 17, 2012, at 8:56 AM, Daniel Davidson wrote: > These nodes have not been locked down yet so that jobs cannot be launched > from the backend, at least on purpose anyway. The added logging returns the > information below: > > [root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host > compute-2-0,compute-2-1 -v -np 10 --leave-session-attached -mca > odls_base_verbose 5 -mca plm_base_verbose 5 hostname > [compute-2-1.local:69655] mca:base:select:( plm) Querying component [rsh] > [compute-2-1.local:69655] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : > rsh path NULL > [compute-2-1.local:69655] mca:base:select:( plm) Query of component [rsh] > set priority to 10 > [compute-2-1.local:69655] mca:base:select:( plm) Querying component [slurm] > [compute-2-1.local:69655] mca:base:select:( plm) Skipping component [slurm]. > Query failed to return a module > [compute-2-1.local:69655] mca:base:select:( plm) Querying component [tm] > [compute-2-1.local:69655] mca:base:select:( plm) Skipping component [tm]. > Query failed to return a module > [compute-2-1.local:69655] mca:base:select:( plm) Selected component [rsh] > [compute-2-1.local:69655] plm:base:set_hnp_name: initial bias 69655 nodename > hash 3634869988 > [compute-2-1.local:69655] plm:base:set_hnp_name: final jobfam 32341 > [compute-2-1.local:69655] [[32341,0],0] plm:rsh_setup on agent ssh : rsh path > NULL > [compute-2-1.local:69655] [[32341,0],0] plm:base:receive start comm > [compute-2-1.local:69655] mca:base:select:( odls) Querying component [default] > [compute-2-1.local:69655] mca:base:select:( odls) Query of component > [default] set priority to 1 > [compute-2-1.local:69655] mca:base:select:( odls) Selected component [default] > [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_job > [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm > [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm creating map > [compute-2-1.local:69655] [[32341,0],0] setup:vm: working unmanaged allocation > [compute-2-1.local:69655] [[32341,0],0] using dash_host > [compute-2-1.local:69655] [[32341,0],0] checking node compute-2-0 > [compute-2-1.local:69655] [[32341,0],0] adding compute-2-0 to list > [compute-2-1.local:69655] [[32341,0],0] checking node compute-2-1.local > [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm add new daemon > [[32341,0],1] > [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm assigning new > daemon [[32341,0],1] to node compute-2-0 > [compute-2-1.local:69655] [[32341,0],0] plm:rsh: launching vm > [compute-2-1.local:69655] [[32341,0],0] plm:rsh: local shell: 0 (bash) > [compute-2-1.local:69655] [[32341,0],0] plm:rsh: assuming same remote shell > as local shell > [compute-2-1.local:69655] [[32341,0],0] plm:rsh: remote shell: 0 (bash) > [compute-2-1.local:69655] [[32341,0],0] plm:rsh: final template argv: >/usr/bin/ssh PATH=/home/apps/openmpi-1.7rc5/bin:$PATH ; > export PATH ; LD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$LD_LIBRARY_PATH > ; export LD_LIBRARY_PATH ; > DYLD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$DYLD_LIBRARY_PATH ; export > DYLD_LIBRARY_PATH ; /home/apps/openmpi-1.7rc5/bin/orted -mca ess env -mca > orte_ess_jobid 2119499776 -mca orte_ess_vpid -mca > orte_ess_num_procs 2 -mca orte_hnp_uri > "2119499776.0;tcp://10.1.255.226:46314;tcp://172.16.28.94:46314" -mca > orte_use_common_port 0 --tree-spawn -mca oob tcp -mca odls_base_verbose 5 > -mca plm_base_verbose 5 -mca plm rsh -mca orte_leave_session_attached 1 > [compute-2-1.local:69655] [[32341,0],0] plm:rsh:launch daemon 0 not a child > of mine > [compute-2-1.local:69655] [[32341,0],0] plm:rsh: adding node compute-2-0 to > launch list > [compute-2-1.local:69655] [[32341,0],0] plm:rsh: activating launch event > [compute-2-1.local:69655] [[32341,0],0] plm:rsh: recording launch of daemon > [[32341,0],1] > [compute-2-1.local:69655] [[32341,0],0] plm:rsh: executing: (//usr/bin/ssh) > [/usr/bin/ssh compute-2-0 PATH=/home/apps/openmpi-1.7rc5/bin:$PATH ; export > PATH ; LD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$LD_LIBRARY_PATH ; > export LD_LIBRARY_PATH ; > DYLD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$DYLD_LIBRARY_PATH ; export > DYLD_LIBRARY_PATH ; /home/apps/openmpi-1.7rc5/bin/orted -mca ess env -mca > orte_ess_jobid 2119499776 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 -mca > orte_hnp_uri "2119499776.0;tcp://10.1.255.226:46314;tcp://172.16.28.94:46314" > -mca orte_use_common_port 0 --tree-spawn -mca oob tcp -mca odls_base_verbose > 5 -mca plm_base_verbose 5 -mca plm rsh -mca orte_leave_session_attached 1] > Warning: untrusted X11 forwarding setup fai
Re: [OMPI users] mpi problems/many cpus per node
A very long time (15 mintues or so) I finally received the following in addition to what I just sent earlier: [compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc working on WILDCARD [compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc working on WILDCARD [compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc working on WILDCARD [compute-2-1.local:69655] [[32341,0],0] daemon 1 failed with status 1 [compute-2-1.local:69655] [[32341,0],0] plm:base:orted_cmd sending orted_exit commands [compute-2-1.local:69655] [[32341,0],0] odls:kill_local_proc working on WILDCARD [compute-2-1.local:69655] [[32341,0],0] odls:kill_local_proc working on WILDCARD Firewalls are down: [root@compute-2-1 /]# iptables -L Chain INPUT (policy ACCEPT) target prot opt source destination Chain FORWARD (policy ACCEPT) target prot opt source destination Chain OUTPUT (policy ACCEPT) target prot opt source destination [root@compute-2-0 ~]# iptables -L Chain INPUT (policy ACCEPT) target prot opt source destination Chain FORWARD (policy ACCEPT) target prot opt source destination Chain OUTPUT (policy ACCEPT) target prot opt source destination On 12/17/2012 11:09 AM, Ralph Castain wrote: Hmmm...and that is ALL the output? If so, then it never succeeded in sending a message back, which leads one to suspect some kind of firewall in the way. Looking at the ssh line, we are going to attempt to send a message from tnode 2-0 to node 2-1 on the 10.1.255.226 address. Is that going to work? Anything preventing it? On Dec 17, 2012, at 8:56 AM, Daniel Davidson wrote: These nodes have not been locked down yet so that jobs cannot be launched from the backend, at least on purpose anyway. The added logging returns the information below: [root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host compute-2-0,compute-2-1 -v -np 10 --leave-session-attached -mca odls_base_verbose 5 -mca plm_base_verbose 5 hostname [compute-2-1.local:69655] mca:base:select:( plm) Querying component [rsh] [compute-2-1.local:69655] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL [compute-2-1.local:69655] mca:base:select:( plm) Query of component [rsh] set priority to 10 [compute-2-1.local:69655] mca:base:select:( plm) Querying component [slurm] [compute-2-1.local:69655] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module [compute-2-1.local:69655] mca:base:select:( plm) Querying component [tm] [compute-2-1.local:69655] mca:base:select:( plm) Skipping component [tm]. Query failed to return a module [compute-2-1.local:69655] mca:base:select:( plm) Selected component [rsh] [compute-2-1.local:69655] plm:base:set_hnp_name: initial bias 69655 nodename hash 3634869988 [compute-2-1.local:69655] plm:base:set_hnp_name: final jobfam 32341 [compute-2-1.local:69655] [[32341,0],0] plm:rsh_setup on agent ssh : rsh path NULL [compute-2-1.local:69655] [[32341,0],0] plm:base:receive start comm [compute-2-1.local:69655] mca:base:select:( odls) Querying component [default] [compute-2-1.local:69655] mca:base:select:( odls) Query of component [default] set priority to 1 [compute-2-1.local:69655] mca:base:select:( odls) Selected component [default] [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_job [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm creating map [compute-2-1.local:69655] [[32341,0],0] setup:vm: working unmanaged allocation [compute-2-1.local:69655] [[32341,0],0] using dash_host [compute-2-1.local:69655] [[32341,0],0] checking node compute-2-0 [compute-2-1.local:69655] [[32341,0],0] adding compute-2-0 to list [compute-2-1.local:69655] [[32341,0],0] checking node compute-2-1.local [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm add new daemon [[32341,0],1] [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm assigning new daemon [[32341,0],1] to node compute-2-0 [compute-2-1.local:69655] [[32341,0],0] plm:rsh: launching vm [compute-2-1.local:69655] [[32341,0],0] plm:rsh: local shell: 0 (bash) [compute-2-1.local:69655] [[32341,0],0] plm:rsh: assuming same remote shell as local shell [compute-2-1.local:69655] [[32341,0],0] plm:rsh: remote shell: 0 (bash) [compute-2-1.local:69655] [[32341,0],0] plm:rsh: final template argv: /usr/bin/ssh PATH=/home/apps/openmpi-1.7rc5/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /home/apps/openmpi-1.7rc5/bin/orted -mca ess env -mca orte_ess_jobid 2119499776 -mca orte_ess_vpid -mca orte_ess_num_procs 2 -mca orte_hnp_uri "2119499776.0;tcp://10.1.255.226:46314;tcp://172.16.28.94:46314" -mca orte_use_common_port 0 --tree-spawn -mca oob tcp -mca odls_base_verbose 5 -mca p
Re: [OMPI users] mpi problems/many cpus per node
I would also add that scp seems to be creating the file in the /tmp directory of compute-2-0, and that /var/log secure is showing ssh connections being accepted. Is there anything in ssh that can limit connections that I need to look out for? My guess is that it is part of the client prefs and not the server prefs since I can initiate the mpi command from another machine and it works fine, even when it uses compute-2-0 and 1. Dan [root@compute-2-1 /]# date Mon Dec 17 15:11:50 CST 2012 [root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host compute-2-0,compute-2-1 -v -np 10 --leave-session-attached -mca odls_base_verbose 5 -mca plm_base_verbose 5 hostname [compute-2-1.local:70237] mca:base:select:( plm) Querying component [rsh] [compute-2-1.local:70237] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL [root@compute-2-0 tmp]# ls -ltr total 24 -rw---. 1 rootroot 0 Nov 28 08:42 yum.log -rw---. 1 rootroot5962 Nov 29 10:50 yum_save_tx-2012-11-29-10-50SRba9s.yumtx drwx--. 3 danield danield 4096 Dec 12 14:56 openmpi-sessions-danield@compute-2-0_0 drwx--. 3 rootroot4096 Dec 13 15:38 openmpi-sessions-root@compute-2-0_0 drwx-- 18 danield danield 4096 Dec 14 09:48 openmpi-sessions-danield@compute-2-0.local_0 drwx-- 44 rootroot4096 Dec 17 15:14 openmpi-sessions-root@compute-2-0.local_0 [root@compute-2-0 tmp]# tail -10 /var/log/secure Dec 17 15:13:40 compute-2-0 sshd[24834]: Accepted publickey for root from 10.1.255.226 port 49483 ssh2 Dec 17 15:13:40 compute-2-0 sshd[24834]: pam_unix(sshd:session): session opened for user root by (uid=0) Dec 17 15:13:42 compute-2-0 sshd[24834]: Received disconnect from 10.1.255.226: 11: disconnected by user Dec 17 15:13:42 compute-2-0 sshd[24834]: pam_unix(sshd:session): session closed for user root Dec 17 15:13:50 compute-2-0 sshd[24851]: Accepted publickey for root from 10.1.255.226 port 49484 ssh2 Dec 17 15:13:50 compute-2-0 sshd[24851]: pam_unix(sshd:session): session opened for user root by (uid=0) Dec 17 15:13:55 compute-2-0 sshd[24851]: Received disconnect from 10.1.255.226: 11: disconnected by user Dec 17 15:13:55 compute-2-0 sshd[24851]: pam_unix(sshd:session): session closed for user root Dec 17 15:14:01 compute-2-0 sshd[24868]: Accepted publickey for root from 10.1.255.226 port 49485 ssh2 Dec 17 15:14:01 compute-2-0 sshd[24868]: pam_unix(sshd:session): session opened for user root by (uid=0) On 12/17/2012 11:16 AM, Daniel Davidson wrote: A very long time (15 mintues or so) I finally received the following in addition to what I just sent earlier: [compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc working on WILDCARD [compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc working on WILDCARD [compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc working on WILDCARD [compute-2-1.local:69655] [[32341,0],0] daemon 1 failed with status 1 [compute-2-1.local:69655] [[32341,0],0] plm:base:orted_cmd sending orted_exit commands [compute-2-1.local:69655] [[32341,0],0] odls:kill_local_proc working on WILDCARD [compute-2-1.local:69655] [[32341,0],0] odls:kill_local_proc working on WILDCARD Firewalls are down: [root@compute-2-1 /]# iptables -L Chain INPUT (policy ACCEPT) target prot opt source destination Chain FORWARD (policy ACCEPT) target prot opt source destination Chain OUTPUT (policy ACCEPT) target prot opt source destination [root@compute-2-0 ~]# iptables -L Chain INPUT (policy ACCEPT) target prot opt source destination Chain FORWARD (policy ACCEPT) target prot opt source destination Chain OUTPUT (policy ACCEPT) target prot opt source destination On 12/17/2012 11:09 AM, Ralph Castain wrote: Hmmm...and that is ALL the output? If so, then it never succeeded in sending a message back, which leads one to suspect some kind of firewall in the way. Looking at the ssh line, we are going to attempt to send a message from tnode 2-0 to node 2-1 on the 10.1.255.226 address. Is that going to work? Anything preventing it? On Dec 17, 2012, at 8:56 AM, Daniel Davidson wrote: These nodes have not been locked down yet so that jobs cannot be launched from the backend, at least on purpose anyway. The added logging returns the information below: [root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host compute-2-0,compute-2-1 -v -np 10 --leave-session-attached -mca odls_base_verbose 5 -mca plm_base_verbose 5 hostname [compute-2-1.local:69655] mca:base:select:( plm) Querying component [rsh] [compute-2-1.local:69655] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL [compute-2-1.local:69655] mca:base:select:( plm) Query of component [rsh] set priority to 10 [compute-2-1.local:69655] mca:base:select:( plm) Querying component [slurm] [compute-2-1.local:69655] mca:base:select:( plm) Sk
Re: [OMPI users] mpi problems/many cpus per node
Daniel, Does passwordless ssh work. You need to make sure that it is. Doug On Dec 17, 2012, at 2:24 PM, Daniel Davidson wrote: > I would also add that scp seems to be creating the file in the /tmp directory > of compute-2-0, and that /var/log secure is showing ssh connections being > accepted. Is there anything in ssh that can limit connections that I need to > look out for? My guess is that it is part of the client prefs and not the > server prefs since I can initiate the mpi command from another machine and it > works fine, even when it uses compute-2-0 and 1. > > Dan > > > [root@compute-2-1 /]# date > Mon Dec 17 15:11:50 CST 2012 > [root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host > compute-2-0,compute-2-1 -v -np 10 --leave-session-attached -mca > odls_base_verbose 5 -mca plm_base_verbose 5 hostname > [compute-2-1.local:70237] mca:base:select:( plm) Querying component [rsh] > [compute-2-1.local:70237] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : > rsh path NULL > > [root@compute-2-0 tmp]# ls -ltr > total 24 > -rw---. 1 rootroot 0 Nov 28 08:42 yum.log > -rw---. 1 rootroot5962 Nov 29 10:50 > yum_save_tx-2012-11-29-10-50SRba9s.yumtx > drwx--. 3 danield danield 4096 Dec 12 14:56 > openmpi-sessions-danield@compute-2-0_0 > drwx--. 3 rootroot4096 Dec 13 15:38 > openmpi-sessions-root@compute-2-0_0 > drwx-- 18 danield danield 4096 Dec 14 09:48 > openmpi-sessions-danield@compute-2-0.local_0 > drwx-- 44 rootroot4096 Dec 17 15:14 > openmpi-sessions-root@compute-2-0.local_0 > > [root@compute-2-0 tmp]# tail -10 /var/log/secure > Dec 17 15:13:40 compute-2-0 sshd[24834]: Accepted publickey for root from > 10.1.255.226 port 49483 ssh2 > Dec 17 15:13:40 compute-2-0 sshd[24834]: pam_unix(sshd:session): session > opened for user root by (uid=0) > Dec 17 15:13:42 compute-2-0 sshd[24834]: Received disconnect from > 10.1.255.226: 11: disconnected by user > Dec 17 15:13:42 compute-2-0 sshd[24834]: pam_unix(sshd:session): session > closed for user root > Dec 17 15:13:50 compute-2-0 sshd[24851]: Accepted publickey for root from > 10.1.255.226 port 49484 ssh2 > Dec 17 15:13:50 compute-2-0 sshd[24851]: pam_unix(sshd:session): session > opened for user root by (uid=0) > Dec 17 15:13:55 compute-2-0 sshd[24851]: Received disconnect from > 10.1.255.226: 11: disconnected by user > Dec 17 15:13:55 compute-2-0 sshd[24851]: pam_unix(sshd:session): session > closed for user root > Dec 17 15:14:01 compute-2-0 sshd[24868]: Accepted publickey for root from > 10.1.255.226 port 49485 ssh2 > Dec 17 15:14:01 compute-2-0 sshd[24868]: pam_unix(sshd:session): session > opened for user root by (uid=0) > > > > > > > On 12/17/2012 11:16 AM, Daniel Davidson wrote: >> A very long time (15 mintues or so) I finally received the following in >> addition to what I just sent earlier: >> >> [compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc working on >> WILDCARD >> [compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc working on >> WILDCARD >> [compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc working on >> WILDCARD >> [compute-2-1.local:69655] [[32341,0],0] daemon 1 failed with status 1 >> [compute-2-1.local:69655] [[32341,0],0] plm:base:orted_cmd sending >> orted_exit commands >> [compute-2-1.local:69655] [[32341,0],0] odls:kill_local_proc working on >> WILDCARD >> [compute-2-1.local:69655] [[32341,0],0] odls:kill_local_proc working on >> WILDCARD >> >> Firewalls are down: >> >> [root@compute-2-1 /]# iptables -L >> Chain INPUT (policy ACCEPT) >> target prot opt source destination >> >> Chain FORWARD (policy ACCEPT) >> target prot opt source destination >> >> Chain OUTPUT (policy ACCEPT) >> target prot opt source destination >> [root@compute-2-0 ~]# iptables -L >> Chain INPUT (policy ACCEPT) >> target prot opt source destination >> >> Chain FORWARD (policy ACCEPT) >> target prot opt source destination >> >> Chain OUTPUT (policy ACCEPT) >> target prot opt source destination >> >> On 12/17/2012 11:09 AM, Ralph Castain wrote: >>> Hmmm...and that is ALL the output? If so, then it never succeeded in >>> sending a message back, which leads one to suspect some kind of firewall in >>> the way. >>> >>> Looking at the ssh line, we are going to attempt to send a message from >>> tnode 2-0 to node 2-1 on the 10.1.255.226 address. Is that going to work? >>> Anything preventing it? >>> >>> >>> On Dec 17, 2012, at 8:56 AM, Daniel Davidson wrote: >>> These nodes have not been locked down yet so that jobs cannot be launched from the backend, at least on purpose anyway. The added logging returns the information below: [root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host compute-2-0,compute-2-1 -v -np 10 --leave-session-attached -mca odls_base_v
Re: [OMPI users] [Open MPI] #3351: JAVA scatter error
On Dec 15, 2012, at 10:46 AM, Siegmar Gross wrote: >> 1. The datatypes passed to Scatter are not valid MPI datatypes >> (MPI.OBJECT). You need to construct a datatype that is specific to the >> !MyData class, just like you would in C/C++. I think that this is the >> first error that you are seeing (i.e., that OMPI is trying to treat >> MPI.OBJECT as an MPI Datatype object, and failing (and therefore throwing >> an !ClassCastException exception). > > Perhaps you are right and my small example program ist not a valid MPI > program. The problem is that I couldn't find any good documentation or > example programs how to write a program which uses a structured data > type. In Java, that's probably true. Remember: there are no official MPI Java bindings. What is included in Open MPI is a research project from several years ago. We picked what appeared to be the best one, freshened it up a little, updated its build system to incorporate into ours, verified its basic functionality, and went with that. In C, there should be plenty of google-able examples about how to use Scatter (and friends). You might want to have a look at a few of those to get an idea how to use MPI_Scatter in general, and then apply that knowledge to a Java program. Make sense? > Therefore I sticked to the mpiJava specification which states > for derived datatypes in chapter 3.12 that the effect for MPI_Type_struct > can be achieved by using MPI.OBJECT as the buffer type and relying on > Java object serialization. "dataItem" is a serializable Java object and > I used MPI.OBJECT as buffer type. How can I create a valid MPI datatype > MPI.OBJECT so that I get a working example program? /me reads some Java implementation code... It looks like they allow passing MPI.OBJECT as the datatype argument; sorry, I guess I was wrong about that. >MPI.COMM_WORLD.Scatter (dataItem, 0, 1, MPI.OBJECT, >objBuffer, 0, 1, MPI.OBJECT, 0); What I think you're running into here is that you're still using Scatter wrong, per my other point, below: >> 1. It looks like you're trying to Scatter a single object to N peers. >> That's invalid MPI -- you need to scatter (N*M) objects to N peers, where >> M is a positive integer value (e.g., 1 or 2). Are you trying to >> broadcast? > > It is the very first version of the program where I scatter one object > to the process itself (at this point it is not the normal application > area for scatter, but should nevertheless work). I didn't continue due > to the error. I get the same error when I broadcast my data item. > > tyr java 116 mpiexec -np 1 java -cp $DIRPREFIX_LOCAL/mpi_classfiles \ > ObjectScatterMain > Exception in thread "main" java.lang.ClassCastException: MyData cannot > be cast to [Ljava.lang.Object; >at mpi.Intracomm.copyBuffer(Intracomm.java:119) >at mpi.Intracomm.Scatter(Intracomm.java:389) >at ObjectScatterMain.main(ObjectScatterMain.java:45) I don't know Java, but it looks like it's complaining about the type of dataItem, not the type of MPI.OBJECT. It says it can't cast dataItem to a Ljava.lang.Object -- which appears to be the type of the first argument to Scatter. Do you need to have MyData inherit from the Java base Object type, or some such? > "Broadcast" works if I have only a root process and it fails when I have > one more process. If I change MPI.COMM_WORLD.Scatter(...) to MPI.COMM_WORLD.Bcast(dataItem, 0, 1, MPI.OBJECT, 0); I get the same casting error. I'm sorry; I really don't know Java, and don't know how to fix this offhand. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] mpi problems/many cpus per node
Yes, it does. Dan [root@compute-2-1 ~]# ssh compute-2-0 Warning: untrusted X11 forwarding setup failed: xauth key data not generated Warning: No xauth data; using fake authentication data for X11 forwarding. Last login: Mon Dec 17 16:13:00 2012 from compute-2-1.local [root@compute-2-0 ~]# ssh compute-2-1 Warning: untrusted X11 forwarding setup failed: xauth key data not generated Warning: No xauth data; using fake authentication data for X11 forwarding. Last login: Mon Dec 17 16:12:32 2012 from biocluster.local [root@compute-2-1 ~]# On 12/17/2012 03:39 PM, Doug Reeder wrote: Daniel, Does passwordless ssh work. You need to make sure that it is. Doug On Dec 17, 2012, at 2:24 PM, Daniel Davidson wrote: I would also add that scp seems to be creating the file in the /tmp directory of compute-2-0, and that /var/log secure is showing ssh connections being accepted. Is there anything in ssh that can limit connections that I need to look out for? My guess is that it is part of the client prefs and not the server prefs since I can initiate the mpi command from another machine and it works fine, even when it uses compute-2-0 and 1. Dan [root@compute-2-1 /]# date Mon Dec 17 15:11:50 CST 2012 [root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host compute-2-0,compute-2-1 -v -np 10 --leave-session-attached -mca odls_base_verbose 5 -mca plm_base_verbose 5 hostname [compute-2-1.local:70237] mca:base:select:( plm) Querying component [rsh] [compute-2-1.local:70237] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL [root@compute-2-0 tmp]# ls -ltr total 24 -rw---. 1 rootroot 0 Nov 28 08:42 yum.log -rw---. 1 rootroot5962 Nov 29 10:50 yum_save_tx-2012-11-29-10-50SRba9s.yumtx drwx--. 3 danield danield 4096 Dec 12 14:56 openmpi-sessions-danield@compute-2-0_0 drwx--. 3 rootroot4096 Dec 13 15:38 openmpi-sessions-root@compute-2-0_0 drwx-- 18 danield danield 4096 Dec 14 09:48 openmpi-sessions-danield@compute-2-0.local_0 drwx-- 44 rootroot4096 Dec 17 15:14 openmpi-sessions-root@compute-2-0.local_0 [root@compute-2-0 tmp]# tail -10 /var/log/secure Dec 17 15:13:40 compute-2-0 sshd[24834]: Accepted publickey for root from 10.1.255.226 port 49483 ssh2 Dec 17 15:13:40 compute-2-0 sshd[24834]: pam_unix(sshd:session): session opened for user root by (uid=0) Dec 17 15:13:42 compute-2-0 sshd[24834]: Received disconnect from 10.1.255.226: 11: disconnected by user Dec 17 15:13:42 compute-2-0 sshd[24834]: pam_unix(sshd:session): session closed for user root Dec 17 15:13:50 compute-2-0 sshd[24851]: Accepted publickey for root from 10.1.255.226 port 49484 ssh2 Dec 17 15:13:50 compute-2-0 sshd[24851]: pam_unix(sshd:session): session opened for user root by (uid=0) Dec 17 15:13:55 compute-2-0 sshd[24851]: Received disconnect from 10.1.255.226: 11: disconnected by user Dec 17 15:13:55 compute-2-0 sshd[24851]: pam_unix(sshd:session): session closed for user root Dec 17 15:14:01 compute-2-0 sshd[24868]: Accepted publickey for root from 10.1.255.226 port 49485 ssh2 Dec 17 15:14:01 compute-2-0 sshd[24868]: pam_unix(sshd:session): session opened for user root by (uid=0) On 12/17/2012 11:16 AM, Daniel Davidson wrote: A very long time (15 mintues or so) I finally received the following in addition to what I just sent earlier: [compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc working on WILDCARD [compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc working on WILDCARD [compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc working on WILDCARD [compute-2-1.local:69655] [[32341,0],0] daemon 1 failed with status 1 [compute-2-1.local:69655] [[32341,0],0] plm:base:orted_cmd sending orted_exit commands [compute-2-1.local:69655] [[32341,0],0] odls:kill_local_proc working on WILDCARD [compute-2-1.local:69655] [[32341,0],0] odls:kill_local_proc working on WILDCARD Firewalls are down: [root@compute-2-1 /]# iptables -L Chain INPUT (policy ACCEPT) target prot opt source destination Chain FORWARD (policy ACCEPT) target prot opt source destination Chain OUTPUT (policy ACCEPT) target prot opt source destination [root@compute-2-0 ~]# iptables -L Chain INPUT (policy ACCEPT) target prot opt source destination Chain FORWARD (policy ACCEPT) target prot opt source destination Chain OUTPUT (policy ACCEPT) target prot opt source destination On 12/17/2012 11:09 AM, Ralph Castain wrote: Hmmm...and that is ALL the output? If so, then it never succeeded in sending a message back, which leads one to suspect some kind of firewall in the way. Looking at the ssh line, we are going to attempt to send a message from tnode 2-0 to node 2-1 on the 10.1.255.226 address. Is that going to work? Anything preventing it? On Dec 17, 2012, at 8:56 AM, Daniel Davidson wrote: These nodes have not been locked down yet so th
Re: [OMPI users] [Open MPI] #3351: JAVA scatter error
On Dec 15, 2012, at 10:46 AM, Siegmar Gross wrote: > "Broadcast" works if I have only a root process and it fails when I have > one more process. I'm sorry; I didn't clarify this error. In a broadcast of only 1 process, it's effectively a no-op. So it doesn't need to do anything to the buffer. Hence, it succeeds. But when you include >1 process, then it needs to do something to the buffer, so the same casting issue arises. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] [Open MPI] #3351: JAVA scatter error
On Dec 15, 2012, at 10:46 AM, Siegmar Gross wrote: > If I misunderstood the mpiJava specification and I must create a special > MPI object from my Java object: How do I create it? Thank you very much > for any help in advance. You sent me a source code listing off-list, but I want to reply on-list for a few reasons: 1. We actively do want feedback on these Java bindings 2. There is some documentation about this, but these bindings *are* different than the C bindings (and Java just behaves differently from Java), so it's worth documenting here in a Google-able location I attached the source code listing you sent. It is closer to correct, but I still don't think it's quite right. There's two issues here: 1. I have no idea how Java stores 2D arrays of doubles. I.e., you're using "double matrix[][]". I don't know if all P*Q values are stored contiguously in memory (or, more specifically, if the Java language *guarantees* that that will always be so). 2. Your MPI vector is almost right, but there's a subtle issue about MPI vectors that you're missing. Because of #1, I changed your program to use matrix[], and have it malloc a single P*Q array. Then I always accessed the {i,j} element via matrix[i * Q + j]. In this way, Java seems to keep all the values contiguously in memory. That leads to this conversion of your program: - import mpi.*; public class ColumnScatterMain { static final int P = 4; static final int Q = 6; static final int NUM_ELEM_PER_LINE = 6; public static void main (String args[]) throws MPIException, InterruptedException { int ntasks, mytid, i, j, tmp; double matrix[], column[]; Datatype column_t; MPI.Init (args); matrix = new double[P * Q]; column = new double[P]; mytid = MPI.COMM_WORLD.Rank (); ntasks = MPI.COMM_WORLD.Size (); if (mytid == 0) { if (ntasks != Q) { System.err.println ("\n\nI need exactly " + Q + " processes.\n\n" + "Usage:\n" + " mpiexec -np " + Q + " java \n"); } } if (ntasks != Q) { MPI.Finalize (); System.exit (0); } column_t = Datatype.Vector (P, 1, Q, MPI.DOUBLE); column_t.Commit (); if (mytid == 0) { tmp = 1; System.out.println ("\nmatrix:\n"); for (i = 0; i < P; ++i) { for (j = 0; j < Q; ++j) { matrix[i * Q + j] = tmp++; System.out.printf ("%10.2f", matrix[i * Q + j]); } System.out.println (); } System.out.println (); } MPI.COMM_WORLD.Scatter (matrix, 0, 1, column_t, column, 0, P, MPI.DOUBLE, 0); Thread.sleep(1000 * mytid); // Sleep to get ordered output System.out.println ("\nColumn of process " + mytid + "\n"); for (i = 0; i < P; ++i) { if (((i + 1) % NUM_ELEM_PER_LINE) == 0) { System.out.printf ("%10.2f\n", column[i]); } else { System.out.printf ("%10.2f", column[i]); } } System.out.println (); column_t.finalize (); MPI.Finalize(); } } - Notice that the output for process 0 after the scatter is correct -- it shows that it received values 1, 7, 13, 19 for its column. But all other processes are wrong. Why? Because of #2. Notice that process 1 got values 20, 0, 0, 0 (or, more specifically, 20, junk, junk, junk). That's because the vector datatype you created ended right at element 19. So it started the next vector (i.e., to send to process 1) at the next element -- element 20. And then went on in the same memory pattern from there, but that was already beyond the end of the array. Go google a tutorial on MPI_Type_vector and you'll see what I mean. In C or Fortran, the solution would be to use an MPI_TYPE_UB at the end of the vector to artificially make the "next" vector be at element 1 (vs. element 20). By the description in 3.12, it looks like they explicitly disallowed this (or, I guess, they didn't implement LB/UB properly -- but MPI_LB and MPI_UB are deprecated in MPI-3.0, anyway). But I think it could be done with MPI_TYPE_CREATE_RESIZED, which, unfortunately, doesn't look like it is implemented in these java bindings yet. Make sense? -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ColumnScatterMain.java Description: Binary data