Am 06.04.2012 um 12:17 schrieb Eloi Gaudry:

> > - Can you please post while it's running the relevant lines from:
> > ps -e f --cols=500
> > (f w/o -) from both machines.
> > It's allocated between the nodes more like in a round-robin fashion.
> > [eg: ] I'll try to do this tomorrow, as soon as some slots become free. 
> > Thanks for your feedback Reuti, I appreciate.
>  
> hi reuti, here is the information related to another run that is failing in 
> the same way:
>  
> qstat -g t:
> ------------
> ---------------------------------------------------------------------------------
> smp...@barney.fft              BIP   0/3/4          3.37     lx-amd64
>   hc:mem_available=1.715G
>   hc:proc_available=1
>    1416 0.60500 semi_green jj           r     04/06/2012 11:57:34 SLAVE
>                                                                   SLAVE
>                                                                   SLAVE
> ---------------------------------------------------------------------------------
> smp...@carl.fft                BIP   0/3/4          3.44     lx-amd64
>   hc:mem_available=1.715G
>   hc:proc_available=1
>    1416 0.60500 semi_green jj           r     04/06/2012 11:57:34 SLAVE
>                                                                   SLAVE
>                                                                   SLAVE
> ---------------------------------------------------------------------------------
> smp...@charlie.fft             BIP   0/6/8          3.46     lx-amd64
>   hc:mem_available=4.018G
>   hc:proc_available=2
>    1416 0.60500 semi_green jj           r     04/06/2012 11:57:34 MASTER
>                                                                   SLAVE
>                                                                   SLAVE
>                                                                   SLAVE
>                                                                   SLAVE
>                                                                   SLAVE
>                                                                   SLAVE

Thx. This is the allocation which is also confirmed by the Open MPI output.

- The application was compiled with the same version of Open MPI?
- Does the application start something on its own besides the tasks granted by 
mpiexec/orterun?

You want 12 ranks in total, and to barney.fft and carl.fft there are also "-mca 
orte_ess_num_procs 3 " given in to the qrsh_starter. In total I count only 10 
ranks in this example given - 4+4+2 - do you observe the same?

It looks like Open MPI is doing the right thing, but the applications decided 
to start in a different allocation.

Does the application use OpenMP in addition or other kinds of threads? The 
suffix "_mp" in the name "actranpy_mp" makes me suspicious about it.

-- Reuti


> barney: ps -e f --cols=500:
> -----------------------------------
>  2048 ?        Sl     3:33 /opt/sge/bin/lx-amd64/sge_execd
> 27502 ?        Sl     0:00  \_ sge_shepherd-1416 -bg
> 27503 ?        Ss     0:00      \_ /opt/sge/utilbin/lx-amd64/qrsh_starter 
> /opt/sge/default/spool/barney/active_jobs/1416.1/1.barney
> 27510 ?        S      0:00          \_ bash -c  
> PATH=/opt/openmpi-1.4.4/bin:$PATH ; export PATH ; 
> LD_LIBRARY_PATH=/opt/openmpi-1.4.4/lib:$LD_LIBRARY_PATH ; export 
> LD_LIBRARY_PATH ;  /opt/openmpi-1.4.4/bin/orted -mca ess e
> nv -mca orte_ess_jobid 3800367104 -mca orte_ess_vpid 1 -mca 
> orte_ess_num_procs 3 --hnp-uri "3800367104.0;tcp://192.168.0.20:57233" --mca 
> pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca 
> ras_gridengine_verbose 1 
> 27511 ?        S      0:00              \_ /opt/openmpi-1.4.4/bin/orted -mca 
> ess env -mca orte_ess_jobid 3800367104 -mca orte_ess_vpid 1 -mca 
> orte_ess_num_procs 3 --hnp-uri 3800367104.0;tcp://192.168.0.20:57233 --mca 
> pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca 
> ras_gridengine_verbose 1
> 27512 ?        Rl    12:54                  \_ 
> /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp 
> --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 
> --parallel=frequency --scratch=/scratch/cluster/1416 
> --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
> 27513 ?        Rl    12:54                  \_ 
> /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp 
> --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 
> --parallel=frequency --scratch=/scratch/cluster/1416 
> --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
>  
> carl: ps -e f --cols=500:
> -------------------------------
>  1928 ?        Sl     3:10 /opt/sge/bin/lx-amd64/sge_execd
> 29022 ?        Sl     0:00  \_ sge_shepherd-1416 -bg
> 29023 ?        Ss     0:00      \_ /opt/sge/utilbin/lx-amd64/qrsh_starter 
> /opt/sge/default/spool/carl/active_jobs/1416.1/1.carl
> 29030 ?        S      0:00          \_ bash -c  
> PATH=/opt/openmpi-1.4.4/bin:$PATH ; export PATH ; 
> LD_LIBRARY_PATH=/opt/openmpi-1.4.4/lib:$LD_LIBRARY_PATH ; export 
> LD_LIBRARY_PATH ;  /opt/openmpi-1.4.4/bin/orted -mca ess e
> nv -mca orte_ess_jobid 3800367104 -mca orte_ess_vpid 2 -mca 
> orte_ess_num_procs 3 --hnp-uri "3800367104.0;tcp://192.168.0.20:57233" --mca 
> pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca 
> ras_gridengine_verbose 1 
> 29031 ?        S      0:00              \_ /opt/openmpi-1.4.4/bin/orted -mca 
> ess env -mca orte_ess_jobid 3800367104 -mca orte_ess_vpid 2 -mca 
> orte_ess_num_procs 3 --hnp-uri 3800367104.0;tcp://192.168.0.20:57233 --mca 
> pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca 
> ras_gridengine_verbose 1
> 29032 ?        Rl    13:49                  \_ 
> /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp 
> --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 
> --parallel=frequency --scratch=/scratch/cluster/1416 
> --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
> 29033 ?        Rl    13:50                  \_ 
> /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp 
> --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 
> --parallel=frequency --scratch=/scratch/cluster/1416 
> --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
> 29034 ?        Rl    13:49                  \_ 
> /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp 
> --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 
> --parallel=frequency --scratch=/scratch/cluster/1416 
> --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
> 29035 ?        Rl    13:49                  \_ 
> /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp 
> --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 
> --parallel=frequency --scratch=/scratch/cluster/1416 
> --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
>  
>  
> charlie: ps -e f --cols=500:
> -----------------------------------
>  1591 ?        Sl     3:13 /opt/sge/bin/lx-amd64/sge_execd
>  8793 ?        S      0:00  \_ sge_shepherd-1416 -bg
>  8795 ?        Ss     0:00      \_ -bash 
> /opt/sge/default/spool/charlie/job_scripts/1416
>  8800 ?        S      0:00          \_ /opt/openmpi-1.4.4/bin/orterun --mca 
> pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca 
> ras_gridengine_verbose 1 --bynode -report-bindings -display-map 
> -display-devel-map -display-allocation -display-devel-allocation -np 12 -x 
> ACTRAN_LICENSE -x ACTRAN_PRODUCTLINE -x LD_LIBRARY_PATH -x PATH -x 
> ACTRAN_DEBUG /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp 
> --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 
> --parall
>  8801 ?        Sl     0:00              \_ /opt/sge/bin/lx-amd64/qrsh 
> -inherit -nostdin -V barney.fft  PATH=/opt/openmpi-1.4.4/bin:$PATH ; export 
> PATH ; LD_LIBRARY_PATH=/opt/openmpi-1.4.4/lib:$LD_LIBRARY_PATH ; export 
> LD_LIBRARY_PATH ;  /opt/openmpi-1.4.4/bin/orted -mca ess env -mca 
> orte_ess_jobid 3800367104 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 
> --hnp-uri "3800367104.0;tcp://192.168.0.20:57233" --mca 
> pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca 
> ras_gridengine_verbose
>  8802 ?        Sl     0:00              \_ /opt/sge/bin/lx-amd64/qrsh 
> -inherit -nostdin -V carl.fft  PATH=/opt/openmpi-1.4.4/bin:$PATH ; export 
> PATH ; LD_LIBRARY_PATH=/opt/openmpi-1.4.4/lib:$LD_LIBRARY_PATH ; export 
> LD_LIBRARY_PATH ;  /opt/openmpi-1.4.4/bin/orted -mca ess env -mca 
> orte_ess_jobid 3800367104 -mca orte_ess_vpid 2 -mca orte_ess_num_procs 3 
> --hnp-uri "3800367104.0;tcp://192.168.0.20:57233" --mca 
> pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca 
> ras_gridengine_verbose 1
>  8807 ?        Rl    14:23              \_ 
> /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp 
> --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 
> --parallel=frequency --scratch=/scratch/cluster/1416 
> --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
>  8808 ?        Rl    14:23              \_ 
> /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp 
> --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 
> --parallel=frequency --scratch=/scratch/cluster/1416 
> --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
>  8809 ?        Rl    14:23              \_ 
> /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp 
> --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 
> --parallel=frequency --scratch=/scratch/cluster/1416 
> --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
>  8810 ?        Rl    14:23              \_ 
> /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp 
> --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 
> --parallel=frequency --scratch=/scratch/cluster/1416 
> --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
>  
> oretrun information:
> --------------------------
> [charlie:08800] ras:gridengine: JOB_ID: 1416
> [charlie:08800] ras:gridengine: PE_HOSTFILE: 
> /opt/sge/default/spool/charlie/active_jobs/1416.1/pe_hostfile
> [charlie:08800] ras:gridengine: charlie.fft: PE_HOSTFILE shows slots=6
> [charlie:08800] ras:gridengine: barney.fft: PE_HOSTFILE shows slots=3
> [charlie:08800] ras:gridengine: carl.fft: PE_HOSTFILE shows slots=3
> 
> ======================   ALLOCATED NODES   ======================
> 
>  Data for node: Name: charlie   Launch id: -1 Arch: ffc91200  State: 2
>   Num boards: 1 Num sockets/board: 2  Num cores/socket: 4
>   Daemon: [[57989,0],0] Daemon launched: True
>   Num slots: 6  Slots in use: 0
>   Num slots allocated: 6  Max slots: 0
>   Username on node: NULL
>   Num procs: 0  Next node_rank: 0
>  Data for node: Name: barney.fft    Launch id: -1 Arch: 0 State: 2
>   Num boards: 1 Num sockets/board: 2  Num cores/socket: 4
>   Daemon: Not defined Daemon launched: False
>   Num slots: 3  Slots in use: 0
>   Num slots allocated: 3  Max slots: 0
>   Username on node: NULL
>   Num procs: 0  Next node_rank: 0
>  Data for node: Name: carl.fft    Launch id: -1 Arch: 0 State: 2
>   Num boards: 1 Num sockets/board: 2  Num cores/socket: 4
>   Daemon: Not defined Daemon launched: False
>   Num slots: 3  Slots in use: 0
>   Num slots allocated: 3  Max slots: 0
>   Username on node: NULL
>   Num procs: 0  Next node_rank: 0
> 
> =================================================================
> 
>  Map generated by mapping policy: 0200
>   Npernode: 0 Oversubscribe allowed: TRUE CPU Lists: FALSE
>   Num new daemons: 2  New daemon starting vpid 1
>   Num nodes: 3
> 
>  Data for node: Name: charlie   Launch id: -1 Arch: ffc91200  State: 2
>   Num boards: 1 Num sockets/board: 2  Num cores/socket: 4
>   Daemon: [[57989,0],0] Daemon launched: True
>   Num slots: 6  Slots in use: 4
>   Num slots allocated: 6  Max slots: 0
>   Username on node: NULL
>   Num procs: 4  Next node_rank: 4
>   Data for proc: [[57989,1],0]
>     Pid: 0  Local rank: 0 Node rank: 0
>     State: 0  App_context: 0  Slot list: NULL
>   Data for proc: [[57989,1],3]
>     Pid: 0  Local rank: 1 Node rank: 1
>     State: 0  App_context: 0  Slot list: NULL
>   Data for proc: [[57989,1],6]
>     Pid: 0  Local rank: 2 Node rank: 2
>     State: 0  App_context: 0  Slot list: NULL
>   Data for proc: [[57989,1],9]
>     Pid: 0  Local rank: 3 Node rank: 3
>     State: 0  App_context: 0  Slot list: NULL
> 
>  Data for node: Name: barney.fft    Launch id: -1 Arch: 0 State: 2
>   Num boards: 1 Num sockets/board: 2  Num cores/socket: 4
>   Daemon: [[57989,0],1] Daemon launched: False
>   Num slots: 3  Slots in use: 4
>   Num slots allocated: 3  Max slots: 0
>   Username on node: NULL
>   Num procs: 4  Next node_rank: 4
>   Data for proc: [[57989,1],1]
>     Pid: 0  Local rank: 0 Node rank: 0
>     State: 0  App_context: 0  Slot list: NULL
>   Data for proc: [[57989,1],4]
>     Pid: 0  Local rank: 1 Node rank: 1
>     State: 0  App_context: 0  Slot list: NULL
>   Data for proc: [[57989,1],7]
>     Pid: 0  Local rank: 2 Node rank: 2
>     State: 0  App_context: 0  Slot list: NULL
>   Data for proc: [[57989,1],10]
>     Pid: 0  Local rank: 3 Node rank: 3
>     State: 0  App_context: 0  Slot list: NULL
> 
>  Data for node: Name: carl.fft    Launch id: -1 Arch: 0 State: 2
>   Num boards: 1 Num sockets/board: 2  Num cores/socket: 4
>   Daemon: [[57989,0],2] Daemon launched: False
>   Num slots: 3  Slots in use: 4
>   Num slots allocated: 3  Max slots: 0
>   Username on node: NULL
>   Num procs: 4  Next node_rank: 4
>   Data for proc: [[57989,1],2]
>     Pid: 0  Local rank: 0 Node rank: 0
>     State: 0  App_context: 0  Slot list: NULL
>   Data for proc: [[57989,1],5]
>     Pid: 0  Local rank: 1 Node rank: 1
>     State: 0  App_context: 0  Slot list: NULL
>   Data for proc: [[57989,1],8]
>     Pid: 0  Local rank: 2 Node rank: 2
>     State: 0  App_context: 0  Slot list: NULL
>   Data for proc: [[57989,1],11]
>     Pid: 0  Local rank: 3 Node rank: 3
>     State: 0  App_context: 0  Slot list: NULL
>  
>  
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to