Am 04.09.2012 um 01:38 schrieb Ralph Castain:

>>> <snip>W
>>> Well that seems strange! Can you run 1.6.1 with the following on the mpirun 
>>> cmd line:
>>> 
>>> -mca ras_gridengine_debug 1 -mca ras_gridengine_verbose 10 -mca 
>>> ras_base_verbose 10
> 
> I'll take a look at this and see what's going on - have to get back to you on 
> it.

In "ras_gridengine_module.c" I added between the found = true / break:

                 opal_output(mca_ras_gridengine_component.verbose,
                        "ras:gridengine: %s: PE_HOSTFILE increased to slots=%d",
                        node->name, node->slots);

Then I get:

[pc15370:13630] ras:gridengine: JOB_ID: 4644
[pc15370:13630] ras:gridengine: PE_HOSTFILE: 
/var/spool/sge/pc15370/active_jobs/4644.1/pe_hostfile
[pc15370:13630] ras:gridengine: pc15370: PE_HOSTFILE shows slots=1
[pc15370:13630] ras:gridengine: pc15381: PE_HOSTFILE shows slots=2
[pc15370:13630] ras:gridengine: pc15370: PE_HOSTFILE increased to slots=2

And the allocation is correct. I'll continue to investigate what is different 
today.

-- Reuti



> Thx!
> 
>> 
>> [pc15381:06250] mca: base: components_open: Looking for ras components
>> [pc15381:06250] mca: base: components_open: opening ras components
>> [pc15381:06250] mca: base: components_open: found loaded component cm
>> [pc15381:06250] mca: base: components_open: component cm has no register 
>> function
>> [pc15381:06250] mca: base: components_open: component cm open function 
>> successful
>> [pc15381:06250] mca: base: components_open: found loaded component gridengine
>> [pc15381:06250] mca: base: components_open: component gridengine has no 
>> register function
>> [pc15381:06250] mca: base: components_open: component gridengine open 
>> function successful
>> [pc15381:06250] mca: base: components_open: found loaded component 
>> loadleveler
>> [pc15381:06250] mca: base: components_open: component loadleveler has no 
>> register function
>> [pc15381:06250] mca: base: components_open: component loadleveler open 
>> function successful
>> [pc15381:06250] mca: base: components_open: found loaded component slurm
>> [pc15381:06250] mca: base: components_open: component slurm has no register 
>> function
>> [pc15381:06250] mca: base: components_open: component slurm open function 
>> successful
>> [pc15381:06250] mca:base:select: Auto-selecting ras components
>> [pc15381:06250] mca:base:select:(  ras) Querying component [cm]
>> [pc15381:06250] mca:base:select:(  ras) Skipping component [cm]. Query 
>> failed to return a module
>> [pc15381:06250] mca:base:select:(  ras) Querying component [gridengine]
>> [pc15381:06250] mca:base:select:(  ras) Query of component [gridengine] set 
>> priority to 100
>> [pc15381:06250] mca:base:select:(  ras) Querying component [loadleveler]
>> [pc15381:06250] mca:base:select:(  ras) Skipping component [loadleveler]. 
>> Query failed to return a module
>> [pc15381:06250] mca:base:select:(  ras) Querying component [slurm]
>> [pc15381:06250] mca:base:select:(  ras) Skipping component [slurm]. Query 
>> failed to return a module
>> [pc15381:06250] mca:base:select:(  ras) Selected component [gridengine]
>> [pc15381:06250] mca: base: close: unloading component cm
>> [pc15381:06250] mca: base: close: unloading component loadleveler
>> [pc15381:06250] mca: base: close: unloading component slurm
>> [pc15381:06250] ras:gridengine: JOB_ID: 4636
>> [pc15381:06250] ras:gridengine: PE_HOSTFILE: 
>> /var/spool/sge/pc15381/active_jobs/4636.1/pe_hostfile
>> [pc15381:06250] ras:gridengine: pc15381: PE_HOSTFILE shows slots=1
>> [pc15381:06250] ras:gridengine: pc15370: PE_HOSTFILE shows slots=2
>> --------------------------------------------------------------------------
>> There are no nodes allocated to this job.
>> --------------------------------------------------------------------------
>> [pc15381:06250] mca: base: close: component gridengine closed
>> [pc15381:06250] mca: base: close: unloading component gridengine
>> 
>> The actual hostfile contains:
>> 
>> pc15381 1 all.q@pc15381 UNDEFINED
>> pc15370 2 extra.q@pc15370 UNDEFINED
>> pc15381 1 extra.q@pc15381 UNDEFINED
>> 
>> and it was submitted with `qsub -pe orted 4 ...`.
>> 
>> 
>> Aha, I remember an issue on the list, if a job gets slots from several 
>> queues that they weren't added. This was the issue in 1.4.5, ok. Wasn't it 
>> fixed lateron? But here it's getting no allocation at all.
>> 
>> ==
>> 
>> If I force it to get jobs only from one queue:
>> 
>> [pc15370:30447] mca: base: components_open: Looking for ras components
>> [pc15370:30447] mca: base: components_open: opening ras components
>> [pc15370:30447] mca: base: components_open: found loaded component cm
>> [pc15370:30447] mca: base: components_open: component cm has no register 
>> function
>> [pc15370:30447] mca: base: components_open: component cm open function 
>> successful
>> [pc15370:30447] mca: base: components_open: found loaded component gridengine
>> [pc15370:30447] mca: base: components_open: component gridengine has no 
>> register function
>> [pc15370:30447] mca: base: components_open: component gridengine open 
>> function successful
>> [pc15370:30447] mca: base: components_open: found loaded component 
>> loadleveler
>> [pc15370:30447] mca: base: components_open: component loadleveler has no 
>> register function
>> [pc15370:30447] mca: base: components_open: component loadleveler open 
>> function successful
>> [pc15370:30447] mca: base: components_open: found loaded component slurm
>> [pc15370:30447] mca: base: components_open: component slurm has no register 
>> function
>> [pc15370:30447] mca: base: components_open: component slurm open function 
>> successful
>> [pc15370:30447] mca:base:select: Auto-selecting ras components
>> [pc15370:30447] mca:base:select:(  ras) Querying component [cm]
>> [pc15370:30447] mca:base:select:(  ras) Skipping component [cm]. Query 
>> failed to return a module
>> [pc15370:30447] mca:base:select:(  ras) Querying component [gridengine]
>> [pc15370:30447] mca:base:select:(  ras) Query of component [gridengine] set 
>> priority to 100
>> [pc15370:30447] mca:base:select:(  ras) Querying component [loadleveler]
>> [pc15370:30447] mca:base:select:(  ras) Skipping component [loadleveler]. 
>> Query failed to return a module
>> [pc15370:30447] mca:base:select:(  ras) Querying component [slurm]
>> [pc15370:30447] mca:base:select:(  ras) Skipping component [slurm]. Query 
>> failed to return a module
>> [pc15370:30447] mca:base:select:(  ras) Selected component [gridengine]
>> [pc15370:30447] mca: base: close: unloading component cm
>> [pc15370:30447] mca: base: close: unloading component loadleveler
>> [pc15370:30447] mca: base: close: unloading component slurm
>> [pc15370:30447] ras:gridengine: JOB_ID: 4638
>> [pc15370:30447] ras:gridengine: PE_HOSTFILE: 
>> /var/spool/sge/pc15370/active_jobs/4638.1/pe_hostfile
>> [pc15370:30447] ras:gridengine: pc15370: PE_HOSTFILE shows slots=2
>> [pc15370:30447] ras:gridengine: pc15381: PE_HOSTFILE shows slots=2
>> 
>> But: it starts only 2 processes instead of 4:
>> 
>> Total: 2
>> Universe: 4
>> Hello World from Rank 0.
>> Hello World from Rank 1.
>> 
>> Yes, I can add `mpiexec -np $NSLOTS ..` to get 4, but all will be on 
>> pc15370, the pc15381 is ignored completely.
>> 
>> ==
>> 
>> If I go back to 1.4.1:
>> 
>> [pc15370:31052] mca: base: components_open: Looking for ras components
>> [pc15370:31052] mca: base: components_open: opening ras components
>> [pc15370:31052] mca: base: components_open: found loaded component gridengine
>> [pc15370:31052] mca: base: components_open: component gridengine has no 
>> register function
>> [pc15370:31052] mca: base: components_open: component gridengine open 
>> function successful
>> [pc15370:31052] mca: base: components_open: found loaded component slurm
>> [pc15370:31052] mca: base: components_open: component slurm has no register 
>> function
>> [pc15370:31052] mca: base: components_open: component slurm open function 
>> successful
>> [pc15370:31052] mca:base:select: Auto-selecting ras components
>> [pc15370:31052] mca:base:select:(  ras) Querying component [gridengine]
>> [pc15370:31052] mca:base:select:(  ras) Query of component [gridengine] set 
>> priority to 100
>> [pc15370:31052] mca:base:select:(  ras) Querying component [slurm]
>> [pc15370:31052] mca:base:select:(  ras) Skipping component [slurm]. Query 
>> failed to return a module
>> [pc15370:31052] mca:base:select:(  ras) Selected component [gridengine]
>> [pc15370:31052] mca: base: close: unloading component slurm
>> [pc15370:31052] ras:gridengine: JOB_ID: 4640
>> [pc15370:31052] ras:gridengine: PE_HOSTFILE: 
>> /var/spool/sge/pc15370/active_jobs/4640.1/pe_hostfile
>> [pc15370:31052] ras:gridengine: pc15370: PE_HOSTFILE shows slots=2
>> [pc15370:31052] ras:gridengine: pc15381: PE_HOSTFILE shows slots=2
>> 
>> Total: 4
>> Universe: 4
>> Hello World from Rank 0.
>> Hello World from Rank 1.
>> Hello World from Rank 2.
>> Hello World from Rank 3.
>> 
>> And no "-np $NSLOTS" in the command, just a plain `mpiexec ./mpihello`.
>> 
>> -- Reuti
>> 
>> 
>>> My guess is that something in the pe_hostfile syntax may have changed and 
>>> we didn't pick up on it.
>>> 
>>> 
>>>> -- Reuti
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>> ==
>>>>>> 
>>>>>> I configured with:
>>>>>> 
>>>>>> ./configure --prefix=$HOME/local/... --enable-static --disable-shared 
>>>>>> --with-sge
>>>>>> 
>>>>>> and adjusted my PATHs accordingly (at least: I hope so).
>>>>>> 
>>>>>> -- Reuti
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to