Re: [OMPI users] sge tight intregration leads to bad allocation

Ralph Castain Tue, 10 Apr 2012 11:04:20 -0400

On Apr 10, 2012, at 8:55 AM, Eloi Gaudry wrote:

> Hi Ralf,
> 
> I haven't tried any of the 1.5 series yet (we have chosen not to use the 
> features releases) but if this is mandatory for you to work on this topic, I 
> will.


Not mandatory, no - however, the 1.4 series has been closed out, so any fix 
will go into 1.6 (the 1.5 series is about to go "stable").

> 
> This might be of interest to Reuti and you : it seems that we cannot 
> reproduce the problem anymore if we don't provide the "-np N" option on the 
> orterun command line. Of course, we need to launch a few other runs to be 
> really sure because the allocation error was not always observable. Actually, 
> I recently understood (from Reuti) that the tight integration mode would 
> supply every necessary bits to the launcher and thus I removed the '-np N' 
> that was around... Could it be that using the '-np N' while using the sge 
> tight integration mode is pathologic ?

No, it should work just fine. Sounds like something weird is going on.

> 
> Regards,
> Eloi
> 
> 
> -----Original Message-----
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
> Behalf Of Ralph Castain
> Sent: mardi 10 avril 2012 16:43
> To: Open MPI Users
> Subject: Re: [OMPI users] sge tight intregration leads to bad allocation
> 
> Could well be a bug in OMPI - I can take a look, though it may be awhile 
> before I get to it. Have you tried one of the 1.5 series releases?
> 
> On Apr 10, 2012, at 3:42 AM, Eloi Gaudry wrote:
> 
>> Thx. This is the allocation which is also confirmed by the Open MPI output.
>> [eg: ] exactly, but not the one used afterwards by openmpi
>> 
>> - The application was compiled with the same version of Open MPI?
>> [eg: ] yes, version 1.4.4 for all
>> 
>> - Does the application start something on its own besides the tasks granted 
>> by mpiexec/orterun?
>> [eg: ] no
>> 
>> You want 12 ranks in total, and to barney.fft and carl.fft there are also 
>> "-mca orte_ess_num_procs 3 " given in to the qrsh_starter. In total I count 
>> only 10 ranks in this example given - 4+4+2 - do you observe the same?
>> [eg: ] i don't know why the -mca orte_ess_num_procs 3 is added here...
>> In the "Map generated by mapping policy" output in my last email, I see that 
>> 4 processes were started on each node (barney, carl and charlie), but yes, 
>> in the ps -elf output, two of them are missing for one node (barney)... 
>> sorry about that, a bad copy/paste. Here is the actual output for this node:
>> 2048 ?        Sl     3:33 /opt/sge/bin/lx-amd64/sge_execd
>> 27502 ?        Sl     0:00  \_ sge_shepherd-1416 -bg
>> 27503 ?        Ss     0:00      \_ /opt/sge/utilbin/lx-amd64/qrsh_starter 
>> /opt/sge/default/spool/barney/active_jobs/1416.1/1.barney
>> 27510 ?        S      0:00          \_ bash -c  
>> PATH=/opt/openmpi-1.4.4/bin:$PATH ; export PATH ; 
>> LD_LIBRARY_PATH=/opt/openmpi-1.4.4/lib:$LD_LIBRARY_PATH ; export 
>> LD_LIBRARY_PATH ;  /opt/openmpi-1.4.4/bin/orted -mca ess env -mca 
>> orte_ess_jobid 3800367104 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 
>> --hnp-uri "3800367104.0;tcp://192.168.0.20:57233" --mca 
>> pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca 
>> ras_gridengine_verbose 1
>> 27511 ?        S      0:00              \_ /opt/openmpi-1.4.4/bin/orted -mca 
>> ess env -mca orte_ess_jobid 3800367104 -mca orte_ess_vpid 1 -mca 
>> orte_ess_num_procs 3 --hnp-uri 3800367104.0;tcp://192.168.0.20:57233 --mca 
>> pls_gridengine_verbose 1 --mca ras_gridengine_show_jobid 1 --mca 
>> ras_gridengine_verbose 1
>> 27512 ?        Rl    12:54                  \_ 
>> /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp 
>> --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 
>> --parallel=frequency --scratch=/scratch/cluster/1416 
>> --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
>> 27513 ?        Rl    12:54                  \_ 
>> /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp 
>> --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 
>> --parallel=frequency --scratch=/scratch/cluster/1416 
>> --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
>> 27514 ?        Rl    12:54                  \_ 
>> /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp 
>> --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 
>> --parallel=frequency --scratch=/scratch/cluster/1416 
>> --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
>> 27515 ?        Rl    12:53                  \_ 
>> /opt/fft/actran_product/Actran_13.0.b.57333/bin/actranpy_mp 
>> --apl=/opt/fft/actran_product/Actran_13.0.b.57333 -e radiation -m 10000 
>> --parallel=frequency --scratch=/scratch/cluster/1416 
>> --inputfile=/home/jj/Projects/Toyota/REFERENCE_JPC/semi_green_PML_06/semi_green_coarse.edat
>> 
>> It looks like Open MPI is doing the right thing, but the applications 
>> decided to start in a different allocation.
>> [eg: ] if the "Map generated by mapping policy" is different than the sge 
>> allocation, then openmpi is not doing the right thing, don't you think ?
>> 
>> Does the application use OpenMP in addition or other kinds of threads? The 
>> suffix "_mp" in the name "actranpy_mp" makes me suspicious about it.
>> [eg: ] no, the suffix _mp stands for "parallel".
>> 
>> 
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] sge tight intregration leads to bad allocation

Reply via email to