> - Can you please post while it's running the relevant lines from: > ps -e f --cols=500 > (f w/o -) from both machines. > It's allocated between the nodes more like in a round-robin fashion. > [eg: ] I'll try to do this tomorrow, as soon as some slots become free. > Thanks for your feedback Reuti, I appreciate.
hi reuti, here is the information related to another run that is failing in the same way: qstat -g t: ------------ --------------------------------------------------------------------------------- smp...@barney.fft BIP 0/3/4 3.37 lx-amd64 hc:mem_available=1.715G hc:proc_available=1 1416 0.60500 semi_green jj r 04/06/2012 11:57:34 SLAVE SLAVE SLAVE --------------------------------------------------------------------------------- smp...@carl.fft BIP 0/3/4 3.44 lx-amd64 hc:mem_available=1.715G hc:proc_available=1 1416 0.60500 semi_green jj r 04/06/2012 11:57:34 SLAVE SLAVE SLAVE --------------------------------------------------------------------------------- smp...@charlie.fft BIP 0/6/8 3.46 lx-amd64 hc:mem_available=4.018G hc:proc_available=2 1416 0.60500 semi_green jj r 04/06/2012 11:57:34 MASTER SLAVE SLAVE SLAVE SLAVE SLAVE SLAVE barney: ps -e f --cols=500: ----------------------------------- 2048 ? Sl 3:33 /opt/sge/bin/lx-amd64/sge_execd 27502 ? Sl 0:00 \_ sge_shepherd-1416 -bg 27503 ? Ss 0:00 \_ /opt/sge/utilbin/lx-amd64/qrsh_starter /opt/sge/default/spool/barney/active_jobs/1416.1/1.barney 27510 ? S 0:00 \_ bash -c PATH=/opt/openmpi-1.4.4/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/opt/openmpi-1.4.4/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; /opt/openmpi-1.4.4/bin/orted -mca ess e nv -mca orte_ess_jobid 3800367104 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 --hnp-uri "3800367104.0;tcp://192.168.0.20:57233" --mca pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca ras_gridengine_verbose 1 27511 ? S 0:00 \_ /opt/openmpi-1.4.4/bin/orted -mca ess env -mca orte_ess_jobid 3800367104 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 --hnp-uri 3800367104.0;tcp://192.168.0.20:57233 --mca pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca ras_gridengine_verbose 1 27512 ? Rl 12:54 \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat 27513 ? Rl 12:54 \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat carl: ps -e f --cols=500: ------------------------------- 1928 ? Sl 3:10 /opt/sge/bin/lx-amd64/sge_execd 29022 ? Sl 0:00 \_ sge_shepherd-1416 -bg 29023 ? Ss 0:00 \_ /opt/sge/utilbin/lx-amd64/qrsh_starter /opt/sge/default/spool/carl/active_jobs/1416.1/1.carl 29030 ? S 0:00 \_ bash -c PATH=/opt/openmpi-1.4.4/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/opt/openmpi-1.4.4/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; /opt/openmpi-1.4.4/bin/orted -mca ess e nv -mca orte_ess_jobid 3800367104 -mca orte_ess_vpid 2 -mca orte_ess_num_procs 3 --hnp-uri "3800367104.0;tcp://192.168.0.20:57233" --mca pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca ras_gridengine_verbose 1 29031 ? S 0:00 \_ /opt/openmpi-1.4.4/bin/orted -mca ess env -mca orte_ess_jobid 3800367104 -mca orte_ess_vpid 2 -mca orte_ess_num_procs 3 --hnp-uri 3800367104.0;tcp://192.168.0.20:57233 --mca pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca ras_gridengine_verbose 1 29032 ? Rl 13:49 \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat 29033 ? Rl 13:50 \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat 29034 ? Rl 13:49 \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat 29035 ? Rl 13:49 \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat charlie: ps -e f --cols=500: ----------------------------------- 1591 ? Sl 3:13 /opt/sge/bin/lx-amd64/sge_execd 8793 ? S 0:00 \_ sge_shepherd-1416 -bg 8795 ? Ss 0:00 \_ -bash /opt/sge/default/spool/charlie/job_scripts/1416 8800 ? S 0:00 \_ /opt/openmpi-1.4.4/bin/orterun --mca pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca ras_gridengine_verbose 1 --bynode -report-bindings -display-map -display-devel-map -display-allocation -display-devel-allocation -np 12 -x ACTRAN_LICENSE -x ACTRAN_PRODUCTLINE -x LD_LIBRARY_PATH -x PATH -x ACTRAN_DEBUG /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parall 8801 ? Sl 0:00 \_ /opt/sge/bin/lx-amd64/qrsh -inherit -nostdin -V barney.fft PATH=/opt/openmpi-1.4.4/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/opt/openmpi-1.4.4/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; /opt/openmpi-1.4.4/bin/orted -mca ess env -mca orte_ess_jobid 3800367104 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 --hnp-uri "3800367104.0;tcp://192.168.0.20:57233" --mca pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca ras_gridengine_verbose 8802 ? Sl 0:00 \_ /opt/sge/bin/lx-amd64/qrsh -inherit -nostdin -V carl.fft PATH=/opt/openmpi-1.4.4/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/opt/openmpi-1.4.4/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; /opt/openmpi-1.4.4/bin/orted -mca ess env -mca orte_ess_jobid 3800367104 -mca orte_ess_vpid 2 -mca orte_ess_num_procs 3 --hnp-uri "3800367104.0;tcp://192.168.0.20:57233" --mca pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca ras_gridengine_verbose 1 8807 ? Rl 14:23 \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat 8808 ? Rl 14:23 \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat 8809 ? Rl 14:23 \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat 8810 ? Rl 14:23 \_ /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 --parallel=frequency --scratch=/scratch/cluster/1416 --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat oretrun information: -------------------------- [charlie:08800] ras:gridengine: JOB_ID: 1416 [charlie:08800] ras:gridengine: PE_HOSTFILE: /opt/sge/default/spool/charlie/active_jobs/1416.1/pe_hostfile [charlie:08800] ras:gridengine: charlie.fft: PE_HOSTFILE shows slots=6 [charlie:08800] ras:gridengine: barney.fft: PE_HOSTFILE shows slots=3 [charlie:08800] ras:gridengine: carl.fft: PE_HOSTFILE shows slots=3 ====================== ALLOCATED NODES ====================== Data for node: Name: charlie Launch id: -1 Arch: ffc91200 State: 2 Num boards: 1 Num sockets/board: 2 Num cores/socket: 4 Daemon: [[57989,0],0] Daemon launched: True Num slots: 6 Slots in use: 0 Num slots allocated: 6 Max slots: 0 Username on node: NULL Num procs: 0 Next node_rank: 0 Data for node: Name: barney.fft Launch id: -1 Arch: 0 State: 2 Num boards: 1 Num sockets/board: 2 Num cores/socket: 4 Daemon: Not defined Daemon launched: False Num slots: 3 Slots in use: 0 Num slots allocated: 3 Max slots: 0 Username on node: NULL Num procs: 0 Next node_rank: 0 Data for node: Name: carl.fft Launch id: -1 Arch: 0 State: 2 Num boards: 1 Num sockets/board: 2 Num cores/socket: 4 Daemon: Not defined Daemon launched: False Num slots: 3 Slots in use: 0 Num slots allocated: 3 Max slots: 0 Username on node: NULL Num procs: 0 Next node_rank: 0 ================================================================= Map generated by mapping policy: 0200 Npernode: 0 Oversubscribe allowed: TRUE CPU Lists: FALSE Num new daemons: 2 New daemon starting vpid 1 Num nodes: 3 Data for node: Name: charlie Launch id: -1 Arch: ffc91200 State: 2 Num boards: 1 Num sockets/board: 2 Num cores/socket: 4 Daemon: [[57989,0],0] Daemon launched: True Num slots: 6 Slots in use: 4 Num slots allocated: 6 Max slots: 0 Username on node: NULL Num procs: 4 Next node_rank: 4 Data for proc: [[57989,1],0] Pid: 0 Local rank: 0 Node rank: 0 State: 0 App_context: 0 Slot list: NULL Data for proc: [[57989,1],3] Pid: 0 Local rank: 1 Node rank: 1 State: 0 App_context: 0 Slot list: NULL Data for proc: [[57989,1],6] Pid: 0 Local rank: 2 Node rank: 2 State: 0 App_context: 0 Slot list: NULL Data for proc: [[57989,1],9] Pid: 0 Local rank: 3 Node rank: 3 State: 0 App_context: 0 Slot list: NULL Data for node: Name: barney.fft Launch id: -1 Arch: 0 State: 2 Num boards: 1 Num sockets/board: 2 Num cores/socket: 4 Daemon: [[57989,0],1] Daemon launched: False Num slots: 3 Slots in use: 4 Num slots allocated: 3 Max slots: 0 Username on node: NULL Num procs: 4 Next node_rank: 4 Data for proc: [[57989,1],1] Pid: 0 Local rank: 0 Node rank: 0 State: 0 App_context: 0 Slot list: NULL Data for proc: [[57989,1],4] Pid: 0 Local rank: 1 Node rank: 1 State: 0 App_context: 0 Slot list: NULL Data for proc: [[57989,1],7] Pid: 0 Local rank: 2 Node rank: 2 State: 0 App_context: 0 Slot list: NULL Data for proc: [[57989,1],10] Pid: 0 Local rank: 3 Node rank: 3 State: 0 App_context: 0 Slot list: NULL Data for node: Name: carl.fft Launch id: -1 Arch: 0 State: 2 Num boards: 1 Num sockets/board: 2 Num cores/socket: 4 Daemon: [[57989,0],2] Daemon launched: False Num slots: 3 Slots in use: 4 Num slots allocated: 3 Max slots: 0 Username on node: NULL Num procs: 4 Next node_rank: 4 Data for proc: [[57989,1],2] Pid: 0 Local rank: 0 Node rank: 0 State: 0 App_context: 0 Slot list: NULL Data for proc: [[57989,1],5] Pid: 0 Local rank: 1 Node rank: 1 State: 0 App_context: 0 Slot list: NULL Data for proc: [[57989,1],8] Pid: 0 Local rank: 2 Node rank: 2 State: 0 App_context: 0 Slot list: NULL Data for proc: [[57989,1],11] Pid: 0 Local rank: 3 Node rank: 3 State: 0 App_context: 0 Slot list: NULL