On Sep 3, 2012, at 3:50 PM, Reuti <re...@staff.uni-marburg.de> wrote:

> Am 04.09.2012 um 00:07 schrieb Ralph Castain:
> 
>> I'm leaning towards fixing it - it came due to discussions on how to handle 
>> hostfile when there is an allocation. For now, though, that should work.
> 
> Oh, did I miss this on the list? If there is a hostfile given as argument, it 
> should override the default hostfile IMO. 

This was several years ago now - first showed up in the 1.5 series. Unless 
someone objects, I'll change it.

> 
> 
>>> 
>>> 
>>>>> ==
>>>>> 
>>>>> SGE issue
>>>>> 
>>>>> I usually don't install new versions instantly, so I only noticed right 
>>>>> now, that in 1.4.5 I get a wrong allocation inside SGE (always one 
>>>>> process less than requested with `qsub -pe orted N ...`. This I tried 
>>>>> only, as with 1.6.1 I get:
>>>>> 
>>>>> --------------------------------------------------------------------------
>>>>> There are no nodes allocated to this job.
>>>>> --------------------------------------------------------------------------
>>>>> 
>>>>> all the time.
>>>> 
>>>> Weird - I'm not sure I understand what you are saying. Is this happening 
>>>> with 1.6.1 as well? Or just with 1.4.5?
>>> 
>>> 1.6.1 = no nodes allocated
>>> 1.4.5 = one process less than requested
>>> 1.4.1 = works as it should
>>> 
>> 
>> Well that seems strange! Can you run 1.6.1 with the following on the mpirun 
>> cmd line:
>> 
>> -mca ras_gridengine_debug 1 -mca ras_gridengine_verbose 10 -mca 
>> ras_base_verbose 10

I'll take a look at this and see what's going on - have to get back to you on 
it.

Thx!

> 
> [pc15381:06250] mca: base: components_open: Looking for ras components
> [pc15381:06250] mca: base: components_open: opening ras components
> [pc15381:06250] mca: base: components_open: found loaded component cm
> [pc15381:06250] mca: base: components_open: component cm has no register 
> function
> [pc15381:06250] mca: base: components_open: component cm open function 
> successful
> [pc15381:06250] mca: base: components_open: found loaded component gridengine
> [pc15381:06250] mca: base: components_open: component gridengine has no 
> register function
> [pc15381:06250] mca: base: components_open: component gridengine open 
> function successful
> [pc15381:06250] mca: base: components_open: found loaded component loadleveler
> [pc15381:06250] mca: base: components_open: component loadleveler has no 
> register function
> [pc15381:06250] mca: base: components_open: component loadleveler open 
> function successful
> [pc15381:06250] mca: base: components_open: found loaded component slurm
> [pc15381:06250] mca: base: components_open: component slurm has no register 
> function
> [pc15381:06250] mca: base: components_open: component slurm open function 
> successful
> [pc15381:06250] mca:base:select: Auto-selecting ras components
> [pc15381:06250] mca:base:select:(  ras) Querying component [cm]
> [pc15381:06250] mca:base:select:(  ras) Skipping component [cm]. Query failed 
> to return a module
> [pc15381:06250] mca:base:select:(  ras) Querying component [gridengine]
> [pc15381:06250] mca:base:select:(  ras) Query of component [gridengine] set 
> priority to 100
> [pc15381:06250] mca:base:select:(  ras) Querying component [loadleveler]
> [pc15381:06250] mca:base:select:(  ras) Skipping component [loadleveler]. 
> Query failed to return a module
> [pc15381:06250] mca:base:select:(  ras) Querying component [slurm]
> [pc15381:06250] mca:base:select:(  ras) Skipping component [slurm]. Query 
> failed to return a module
> [pc15381:06250] mca:base:select:(  ras) Selected component [gridengine]
> [pc15381:06250] mca: base: close: unloading component cm
> [pc15381:06250] mca: base: close: unloading component loadleveler
> [pc15381:06250] mca: base: close: unloading component slurm
> [pc15381:06250] ras:gridengine: JOB_ID: 4636
> [pc15381:06250] ras:gridengine: PE_HOSTFILE: 
> /var/spool/sge/pc15381/active_jobs/4636.1/pe_hostfile
> [pc15381:06250] ras:gridengine: pc15381: PE_HOSTFILE shows slots=1
> [pc15381:06250] ras:gridengine: pc15370: PE_HOSTFILE shows slots=2
> --------------------------------------------------------------------------
> There are no nodes allocated to this job.
> --------------------------------------------------------------------------
> [pc15381:06250] mca: base: close: component gridengine closed
> [pc15381:06250] mca: base: close: unloading component gridengine
> 
> The actual hostfile contains:
> 
> pc15381 1 all.q@pc15381 UNDEFINED
> pc15370 2 extra.q@pc15370 UNDEFINED
> pc15381 1 extra.q@pc15381 UNDEFINED
> 
> and it was submitted with `qsub -pe orted 4 ...`.
> 
> 
> Aha, I remember an issue on the list, if a job gets slots from several queues 
> that they weren't added. This was the issue in 1.4.5, ok. Wasn't it fixed 
> lateron? But here it's getting no allocation at all.
> 
> ==
> 
> If I force it to get jobs only from one queue:
> 
> [pc15370:30447] mca: base: components_open: Looking for ras components
> [pc15370:30447] mca: base: components_open: opening ras components
> [pc15370:30447] mca: base: components_open: found loaded component cm
> [pc15370:30447] mca: base: components_open: component cm has no register 
> function
> [pc15370:30447] mca: base: components_open: component cm open function 
> successful
> [pc15370:30447] mca: base: components_open: found loaded component gridengine
> [pc15370:30447] mca: base: components_open: component gridengine has no 
> register function
> [pc15370:30447] mca: base: components_open: component gridengine open 
> function successful
> [pc15370:30447] mca: base: components_open: found loaded component loadleveler
> [pc15370:30447] mca: base: components_open: component loadleveler has no 
> register function
> [pc15370:30447] mca: base: components_open: component loadleveler open 
> function successful
> [pc15370:30447] mca: base: components_open: found loaded component slurm
> [pc15370:30447] mca: base: components_open: component slurm has no register 
> function
> [pc15370:30447] mca: base: components_open: component slurm open function 
> successful
> [pc15370:30447] mca:base:select: Auto-selecting ras components
> [pc15370:30447] mca:base:select:(  ras) Querying component [cm]
> [pc15370:30447] mca:base:select:(  ras) Skipping component [cm]. Query failed 
> to return a module
> [pc15370:30447] mca:base:select:(  ras) Querying component [gridengine]
> [pc15370:30447] mca:base:select:(  ras) Query of component [gridengine] set 
> priority to 100
> [pc15370:30447] mca:base:select:(  ras) Querying component [loadleveler]
> [pc15370:30447] mca:base:select:(  ras) Skipping component [loadleveler]. 
> Query failed to return a module
> [pc15370:30447] mca:base:select:(  ras) Querying component [slurm]
> [pc15370:30447] mca:base:select:(  ras) Skipping component [slurm]. Query 
> failed to return a module
> [pc15370:30447] mca:base:select:(  ras) Selected component [gridengine]
> [pc15370:30447] mca: base: close: unloading component cm
> [pc15370:30447] mca: base: close: unloading component loadleveler
> [pc15370:30447] mca: base: close: unloading component slurm
> [pc15370:30447] ras:gridengine: JOB_ID: 4638
> [pc15370:30447] ras:gridengine: PE_HOSTFILE: 
> /var/spool/sge/pc15370/active_jobs/4638.1/pe_hostfile
> [pc15370:30447] ras:gridengine: pc15370: PE_HOSTFILE shows slots=2
> [pc15370:30447] ras:gridengine: pc15381: PE_HOSTFILE shows slots=2
> 
> But: it starts only 2 processes instead of 4:
> 
> Total: 2
> Universe: 4
> Hello World from Rank 0.
> Hello World from Rank 1.
> 
> Yes, I can add `mpiexec -np $NSLOTS ..` to get 4, but all will be on pc15370, 
> the pc15381 is ignored completely.
> 
> ==
> 
> If I go back to 1.4.1:
> 
> [pc15370:31052] mca: base: components_open: Looking for ras components
> [pc15370:31052] mca: base: components_open: opening ras components
> [pc15370:31052] mca: base: components_open: found loaded component gridengine
> [pc15370:31052] mca: base: components_open: component gridengine has no 
> register function
> [pc15370:31052] mca: base: components_open: component gridengine open 
> function successful
> [pc15370:31052] mca: base: components_open: found loaded component slurm
> [pc15370:31052] mca: base: components_open: component slurm has no register 
> function
> [pc15370:31052] mca: base: components_open: component slurm open function 
> successful
> [pc15370:31052] mca:base:select: Auto-selecting ras components
> [pc15370:31052] mca:base:select:(  ras) Querying component [gridengine]
> [pc15370:31052] mca:base:select:(  ras) Query of component [gridengine] set 
> priority to 100
> [pc15370:31052] mca:base:select:(  ras) Querying component [slurm]
> [pc15370:31052] mca:base:select:(  ras) Skipping component [slurm]. Query 
> failed to return a module
> [pc15370:31052] mca:base:select:(  ras) Selected component [gridengine]
> [pc15370:31052] mca: base: close: unloading component slurm
> [pc15370:31052] ras:gridengine: JOB_ID: 4640
> [pc15370:31052] ras:gridengine: PE_HOSTFILE: 
> /var/spool/sge/pc15370/active_jobs/4640.1/pe_hostfile
> [pc15370:31052] ras:gridengine: pc15370: PE_HOSTFILE shows slots=2
> [pc15370:31052] ras:gridengine: pc15381: PE_HOSTFILE shows slots=2
> 
> Total: 4
> Universe: 4
> Hello World from Rank 0.
> Hello World from Rank 1.
> Hello World from Rank 2.
> Hello World from Rank 3.
> 
> And no "-np $NSLOTS" in the command, just a plain `mpiexec ./mpihello`.
> 
> -- Reuti
> 
> 
>> My guess is that something in the pe_hostfile syntax may have changed and we 
>> didn't pick up on it.
>> 
>> 
>>> -- Reuti
>>> 
>>> 
>>>> 
>>>>> 
>>>>> ==
>>>>> 
>>>>> I configured with:
>>>>> 
>>>>> ./configure --prefix=$HOME/local/... --enable-static --disable-shared 
>>>>> --with-sge
>>>>> 
>>>>> and adjusted my PATHs accordingly (at least: I hope so).
>>>>> 
>>>>> -- Reuti
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to