Could well be a bug in OMPI - I can take a look, though it may be awhile before I get to it. Have you tried one of the 1.5 series releases?
On Apr 10, 2012, at 3:42 AM, Eloi Gaudry wrote: > Thx. This is the allocation which is also confirmed by the Open MPI output. > [eg: ] exactly, but not the one used afterwards by openmpi > > - The application was compiled with the same version of Open MPI? > [eg: ] yes, version 1.4.4 for all > > - Does the application start something on its own besides the tasks granted > by mpiexec/orterun? > [eg: ] no > > You want 12 ranks in total, and to barney.fft and carl.fft there are also > "-mca orte_ess_num_procs 3 " given in to the qrsh_starter. In total I count > only 10 ranks in this example given - 4+4+2 - do you observe the same? > [eg: ] i don't know why the -mca orte_ess_num_procs 3 is added here... > In the "Map generated by mapping policy" output in my last email, I see that > 4 processes were started on each node (barney, carl and charlie), but yes, in > the ps -elf output, two of them are missing for one node (barney)... sorry > about that, a bad copy/paste. Here is the actual output for this node: > 2048 ? Sl 3:33 /opt/sge/bin/lx-amd64/sge_execd > 27502 ? Sl 0:00 \_ sge_shepherd-1416 -bg > 27503 ? Ss 0:00 \_ /opt/sge/utilbin/lx-amd64/qrsh_starter > /opt/sge/default/spool/barney/active_jobs/1416.1/1.barney > 27510 ? S 0:00 \_ bash -c > PATH=/opt/openmpi-1.4.4/bin:$PATH ; export PATH ; > LD_LIBRARY_PATH=/opt/openmpi-1.4.4/lib:$LD_LIBRARY_PATH ; export > LD_LIBRARY_PATH ; /opt/openmpi-1.4.4/bin/orted -mca ess env -mca > orte_ess_jobid 3800367104 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 > --hnp-uri "3800367104.0;tcp://192.168.0.20:57233" --mca > pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca > ras_gridengine_verbose 1 > 27511 ? S 0:00 \_ /opt/openmpi-1.4.4/bin/orted -mca > ess env -mca orte_ess_jobid 3800367104 -mca orte_ess_vpid 1 -mca > orte_ess_num_procs 3 --hnp-uri 3800367104.0;tcp://192.168.0.20:57233 --mca > pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca > ras_gridengine_verbose 1 > 27512 ? Rl 12:54 \_ > /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp > --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 > --parallel=frequency --scratch=/scratch/cluster/1416 > --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat > 27513 ? Rl 12:54 \_ > /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp > --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 > --parallel=frequency --scratch=/scratch/cluster/1416 > --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat > 27514 ? Rl 12:54 \_ > /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp > --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 > --parallel=frequency --scratch=/scratch/cluster/1416 > --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat > 27515 ? Rl 12:53 \_ > /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp > --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 > --parallel=frequency --scratch=/scratch/cluster/1416 > --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat > > It looks like Open MPI is doing the right thing, but the applications decided > to start in a different allocation. > [eg: ] if the "Map generated by mapping policy" is different than the sge > allocation, then openmpi is not doing the right thing, don't you think ? > > Does the application use OpenMP in addition or other kinds of threads? The > suffix "_mp" in the name "actranpy_mp" makes me suspicious about it. > [eg: ] no, the suffix _mp stands for "parallel". > > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users