Am 06.04.2012 um 12:17 schrieb Eloi Gaudry: > > - Can you please post while it's running the relevant lines from: > > ps -e f --cols=500 > > (f w/o -) from both machines. > > It's allocated between the nodes more like in a round-robin fashion. > > [eg: ] I'll try to do this tomorrow, as soon as some slots become free. > > Thanks for your feedback Reuti, I appreciate. > > hi reuti, here is the information related to another run that is failing in > the same way: > > qstat -g t: > ------------ > --------------------------------------------------------------------------------- > smp...@barney.fft BIP 0/3/4 3.37 lx-amd64 > hc:mem_available=1.715G > hc:proc_available=1 > 1416 0.60500 semi_green jj r 04/06/2012 11:57:34 SLAVE > SLAVE > SLAVE > --------------------------------------------------------------------------------- > smp...@carl.fft BIP 0/3/4 3.44 lx-amd64 > hc:mem_available=1.715G > hc:proc_available=1 > 1416 0.60500 semi_green jj r 04/06/2012 11:57:34 SLAVE > SLAVE > SLAVE > --------------------------------------------------------------------------------- > smp...@charlie.fft BIP 0/6/8 3.46 lx-amd64 > hc:mem_available=4.018G > hc:proc_available=2 > 1416 0.60500 semi_green jj r 04/06/2012 11:57:34 MASTER > SLAVE > SLAVE > SLAVE > SLAVE > SLAVE > SLAVE
Thx. This is the allocation which is also confirmed by the Open MPI output. - The application was compiled with the same version of Open MPI? - Does the application start something on its own besides the tasks granted by mpiexec/orterun? You want 12 ranks in total, and to barney.fft and carl.fft there are also "-mca orte_ess_num_procs 3 " given in to the qrsh_starter. In total I count only 10 ranks in this example given - 4+4+2 - do you observe the same? It looks like Open MPI is doing the right thing, but the applications decided to start in a different allocation. Does the application use OpenMP in addition or other kinds of threads? The suffix "_mp" in the name "actranpy_mp" makes me suspicious about it. -- Reuti > barney: ps -e f --cols=500: > ----------------------------------- > 2048 ? Sl 3:33 /opt/sge/bin/lx-amd64/sge_execd > 27502 ? Sl 0:00 \_ sge_shepherd-1416 -bg > 27503 ? Ss 0:00 \_ /opt/sge/utilbin/lx-amd64/qrsh_starter > /opt/sge/default/spool/barney/active_jobs/1416.1/1.barney > 27510 ? S 0:00 \_ bash -c > PATH=/opt/openmpi-1.4.4/bin:$PATH ; export PATH ; > LD_LIBRARY_PATH=/opt/openmpi-1.4.4/lib:$LD_LIBRARY_PATH ; export > LD_LIBRARY_PATH ; /opt/openmpi-1.4.4/bin/orted -mca ess e > nv -mca orte_ess_jobid 3800367104 -mca orte_ess_vpid 1 -mca > orte_ess_num_procs 3 --hnp-uri "3800367104.0;tcp://192.168.0.20:57233" --mca > pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca > ras_gridengine_verbose 1 > 27511 ? S 0:00 \_ /opt/openmpi-1.4.4/bin/orted -mca > ess env -mca orte_ess_jobid 3800367104 -mca orte_ess_vpid 1 -mca > orte_ess_num_procs 3 --hnp-uri 3800367104.0;tcp://192.168.0.20:57233 --mca > pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca > ras_gridengine_verbose 1 > 27512 ? Rl 12:54 \_ > /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp > --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 > --parallel=frequency --scratch=/scratch/cluster/1416 > --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat > 27513 ? Rl 12:54 \_ > /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp > --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 > --parallel=frequency --scratch=/scratch/cluster/1416 > --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat > > carl: ps -e f --cols=500: > ------------------------------- > 1928 ? Sl 3:10 /opt/sge/bin/lx-amd64/sge_execd > 29022 ? Sl 0:00 \_ sge_shepherd-1416 -bg > 29023 ? Ss 0:00 \_ /opt/sge/utilbin/lx-amd64/qrsh_starter > /opt/sge/default/spool/carl/active_jobs/1416.1/1.carl > 29030 ? S 0:00 \_ bash -c > PATH=/opt/openmpi-1.4.4/bin:$PATH ; export PATH ; > LD_LIBRARY_PATH=/opt/openmpi-1.4.4/lib:$LD_LIBRARY_PATH ; export > LD_LIBRARY_PATH ; /opt/openmpi-1.4.4/bin/orted -mca ess e > nv -mca orte_ess_jobid 3800367104 -mca orte_ess_vpid 2 -mca > orte_ess_num_procs 3 --hnp-uri "3800367104.0;tcp://192.168.0.20:57233" --mca > pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca > ras_gridengine_verbose 1 > 29031 ? S 0:00 \_ /opt/openmpi-1.4.4/bin/orted -mca > ess env -mca orte_ess_jobid 3800367104 -mca orte_ess_vpid 2 -mca > orte_ess_num_procs 3 --hnp-uri 3800367104.0;tcp://192.168.0.20:57233 --mca > pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca > ras_gridengine_verbose 1 > 29032 ? Rl 13:49 \_ > /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp > --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 > --parallel=frequency --scratch=/scratch/cluster/1416 > --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat > 29033 ? Rl 13:50 \_ > /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp > --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 > --parallel=frequency --scratch=/scratch/cluster/1416 > --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat > 29034 ? Rl 13:49 \_ > /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp > --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 > --parallel=frequency --scratch=/scratch/cluster/1416 > --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat > 29035 ? Rl 13:49 \_ > /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp > --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 > --parallel=frequency --scratch=/scratch/cluster/1416 > --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat > > > charlie: ps -e f --cols=500: > ----------------------------------- > 1591 ? Sl 3:13 /opt/sge/bin/lx-amd64/sge_execd > 8793 ? S 0:00 \_ sge_shepherd-1416 -bg > 8795 ? Ss 0:00 \_ -bash > /opt/sge/default/spool/charlie/job_scripts/1416 > 8800 ? S 0:00 \_ /opt/openmpi-1.4.4/bin/orterun --mca > pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca > ras_gridengine_verbose 1 --bynode -report-bindings -display-map > -display-devel-map -display-allocation -display-devel-allocation -np 12 -x > ACTRAN_LICENSE -x ACTRAN_PRODUCTLINE -x LD_LIBRARY_PATH -x PATH -x > ACTRAN_DEBUG /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp > --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 > --parall > 8801 ? Sl 0:00 \_ /opt/sge/bin/lx-amd64/qrsh > -inherit -nostdin -V barney.fft PATH=/opt/openmpi-1.4.4/bin:$PATH ; export > PATH ; LD_LIBRARY_PATH=/opt/openmpi-1.4.4/lib:$LD_LIBRARY_PATH ; export > LD_LIBRARY_PATH ; /opt/openmpi-1.4.4/bin/orted -mca ess env -mca > orte_ess_jobid 3800367104 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 > --hnp-uri "3800367104.0;tcp://192.168.0.20:57233" --mca > pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca > ras_gridengine_verbose > 8802 ? Sl 0:00 \_ /opt/sge/bin/lx-amd64/qrsh > -inherit -nostdin -V carl.fft PATH=/opt/openmpi-1.4.4/bin:$PATH ; export > PATH ; LD_LIBRARY_PATH=/opt/openmpi-1.4.4/lib:$LD_LIBRARY_PATH ; export > LD_LIBRARY_PATH ; /opt/openmpi-1.4.4/bin/orted -mca ess env -mca > orte_ess_jobid 3800367104 -mca orte_ess_vpid 2 -mca orte_ess_num_procs 3 > --hnp-uri "3800367104.0;tcp://192.168.0.20:57233" --mca > pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca > ras_gridengine_verbose 1 > 8807 ? Rl 14:23 \_ > /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp > --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 > --parallel=frequency --scratch=/scratch/cluster/1416 > --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat > 8808 ? Rl 14:23 \_ > /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp > --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 > --parallel=frequency --scratch=/scratch/cluster/1416 > --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat > 8809 ? Rl 14:23 \_ > /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp > --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 > --parallel=frequency --scratch=/scratch/cluster/1416 > --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat > 8810 ? Rl 14:23 \_ > /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp > --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 > --parallel=frequency --scratch=/scratch/cluster/1416 > --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat > > oretrun information: > -------------------------- > [charlie:08800] ras:gridengine: JOB_ID: 1416 > [charlie:08800] ras:gridengine: PE_HOSTFILE: > /opt/sge/default/spool/charlie/active_jobs/1416.1/pe_hostfile > [charlie:08800] ras:gridengine: charlie.fft: PE_HOSTFILE shows slots=6 > [charlie:08800] ras:gridengine: barney.fft: PE_HOSTFILE shows slots=3 > [charlie:08800] ras:gridengine: carl.fft: PE_HOSTFILE shows slots=3 > > ====================== ALLOCATED NODES ====================== > > Data for node: Name: charlie Launch id: -1 Arch: ffc91200 State: 2 > Num boards: 1 Num sockets/board: 2 Num cores/socket: 4 > Daemon: [[57989,0],0] Daemon launched: True > Num slots: 6 Slots in use: 0 > Num slots allocated: 6 Max slots: 0 > Username on node: NULL > Num procs: 0 Next node_rank: 0 > Data for node: Name: barney.fft Launch id: -1 Arch: 0 State: 2 > Num boards: 1 Num sockets/board: 2 Num cores/socket: 4 > Daemon: Not defined Daemon launched: False > Num slots: 3 Slots in use: 0 > Num slots allocated: 3 Max slots: 0 > Username on node: NULL > Num procs: 0 Next node_rank: 0 > Data for node: Name: carl.fft Launch id: -1 Arch: 0 State: 2 > Num boards: 1 Num sockets/board: 2 Num cores/socket: 4 > Daemon: Not defined Daemon launched: False > Num slots: 3 Slots in use: 0 > Num slots allocated: 3 Max slots: 0 > Username on node: NULL > Num procs: 0 Next node_rank: 0 > > ================================================================= > > Map generated by mapping policy: 0200 > Npernode: 0 Oversubscribe allowed: TRUE CPU Lists: FALSE > Num new daemons: 2 New daemon starting vpid 1 > Num nodes: 3 > > Data for node: Name: charlie Launch id: -1 Arch: ffc91200 State: 2 > Num boards: 1 Num sockets/board: 2 Num cores/socket: 4 > Daemon: [[57989,0],0] Daemon launched: True > Num slots: 6 Slots in use: 4 > Num slots allocated: 6 Max slots: 0 > Username on node: NULL > Num procs: 4 Next node_rank: 4 > Data for proc: [[57989,1],0] > Pid: 0 Local rank: 0 Node rank: 0 > State: 0 App_context: 0 Slot list: NULL > Data for proc: [[57989,1],3] > Pid: 0 Local rank: 1 Node rank: 1 > State: 0 App_context: 0 Slot list: NULL > Data for proc: [[57989,1],6] > Pid: 0 Local rank: 2 Node rank: 2 > State: 0 App_context: 0 Slot list: NULL > Data for proc: [[57989,1],9] > Pid: 0 Local rank: 3 Node rank: 3 > State: 0 App_context: 0 Slot list: NULL > > Data for node: Name: barney.fft Launch id: -1 Arch: 0 State: 2 > Num boards: 1 Num sockets/board: 2 Num cores/socket: 4 > Daemon: [[57989,0],1] Daemon launched: False > Num slots: 3 Slots in use: 4 > Num slots allocated: 3 Max slots: 0 > Username on node: NULL > Num procs: 4 Next node_rank: 4 > Data for proc: [[57989,1],1] > Pid: 0 Local rank: 0 Node rank: 0 > State: 0 App_context: 0 Slot list: NULL > Data for proc: [[57989,1],4] > Pid: 0 Local rank: 1 Node rank: 1 > State: 0 App_context: 0 Slot list: NULL > Data for proc: [[57989,1],7] > Pid: 0 Local rank: 2 Node rank: 2 > State: 0 App_context: 0 Slot list: NULL > Data for proc: [[57989,1],10] > Pid: 0 Local rank: 3 Node rank: 3 > State: 0 App_context: 0 Slot list: NULL > > Data for node: Name: carl.fft Launch id: -1 Arch: 0 State: 2 > Num boards: 1 Num sockets/board: 2 Num cores/socket: 4 > Daemon: [[57989,0],2] Daemon launched: False > Num slots: 3 Slots in use: 4 > Num slots allocated: 3 Max slots: 0 > Username on node: NULL > Num procs: 4 Next node_rank: 4 > Data for proc: [[57989,1],2] > Pid: 0 Local rank: 0 Node rank: 0 > State: 0 App_context: 0 Slot list: NULL > Data for proc: [[57989,1],5] > Pid: 0 Local rank: 1 Node rank: 1 > State: 0 App_context: 0 Slot list: NULL > Data for proc: [[57989,1],8] > Pid: 0 Local rank: 2 Node rank: 2 > State: 0 App_context: 0 Slot list: NULL > Data for proc: [[57989,1],11] > Pid: 0 Local rank: 3 Node rank: 3 > State: 0 App_context: 0 Slot list: NULL > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users