Am 04.09.2012 um 00:07 schrieb Ralph Castain: > I'm leaning towards fixing it - it came due to discussions on how to handle > hostfile when there is an allocation. For now, though, that should work.
Oh, did I miss this on the list? If there is a hostfile given as argument, it should override the default hostfile IMO. >> >> >>>> == >>>> >>>> SGE issue >>>> >>>> I usually don't install new versions instantly, so I only noticed right >>>> now, that in 1.4.5 I get a wrong allocation inside SGE (always one process >>>> less than requested with `qsub -pe orted N ...`. This I tried only, as >>>> with 1.6.1 I get: >>>> >>>> -------------------------------------------------------------------------- >>>> There are no nodes allocated to this job. >>>> -------------------------------------------------------------------------- >>>> >>>> all the time. >>> >>> Weird - I'm not sure I understand what you are saying. Is this happening >>> with 1.6.1 as well? Or just with 1.4.5? >> >> 1.6.1 = no nodes allocated >> 1.4.5 = one process less than requested >> 1.4.1 = works as it should >> > > Well that seems strange! Can you run 1.6.1 with the following on the mpirun > cmd line: > > -mca ras_gridengine_debug 1 -mca ras_gridengine_verbose 10 -mca > ras_base_verbose 10 [pc15381:06250] mca: base: components_open: Looking for ras components [pc15381:06250] mca: base: components_open: opening ras components [pc15381:06250] mca: base: components_open: found loaded component cm [pc15381:06250] mca: base: components_open: component cm has no register function [pc15381:06250] mca: base: components_open: component cm open function successful [pc15381:06250] mca: base: components_open: found loaded component gridengine [pc15381:06250] mca: base: components_open: component gridengine has no register function [pc15381:06250] mca: base: components_open: component gridengine open function successful [pc15381:06250] mca: base: components_open: found loaded component loadleveler [pc15381:06250] mca: base: components_open: component loadleveler has no register function [pc15381:06250] mca: base: components_open: component loadleveler open function successful [pc15381:06250] mca: base: components_open: found loaded component slurm [pc15381:06250] mca: base: components_open: component slurm has no register function [pc15381:06250] mca: base: components_open: component slurm open function successful [pc15381:06250] mca:base:select: Auto-selecting ras components [pc15381:06250] mca:base:select:( ras) Querying component [cm] [pc15381:06250] mca:base:select:( ras) Skipping component [cm]. Query failed to return a module [pc15381:06250] mca:base:select:( ras) Querying component [gridengine] [pc15381:06250] mca:base:select:( ras) Query of component [gridengine] set priority to 100 [pc15381:06250] mca:base:select:( ras) Querying component [loadleveler] [pc15381:06250] mca:base:select:( ras) Skipping component [loadleveler]. Query failed to return a module [pc15381:06250] mca:base:select:( ras) Querying component [slurm] [pc15381:06250] mca:base:select:( ras) Skipping component [slurm]. Query failed to return a module [pc15381:06250] mca:base:select:( ras) Selected component [gridengine] [pc15381:06250] mca: base: close: unloading component cm [pc15381:06250] mca: base: close: unloading component loadleveler [pc15381:06250] mca: base: close: unloading component slurm [pc15381:06250] ras:gridengine: JOB_ID: 4636 [pc15381:06250] ras:gridengine: PE_HOSTFILE: /var/spool/sge/pc15381/active_jobs/4636.1/pe_hostfile [pc15381:06250] ras:gridengine: pc15381: PE_HOSTFILE shows slots=1 [pc15381:06250] ras:gridengine: pc15370: PE_HOSTFILE shows slots=2 -------------------------------------------------------------------------- There are no nodes allocated to this job. -------------------------------------------------------------------------- [pc15381:06250] mca: base: close: component gridengine closed [pc15381:06250] mca: base: close: unloading component gridengine The actual hostfile contains: pc15381 1 all.q@pc15381 UNDEFINED pc15370 2 extra.q@pc15370 UNDEFINED pc15381 1 extra.q@pc15381 UNDEFINED and it was submitted with `qsub -pe orted 4 ...`. Aha, I remember an issue on the list, if a job gets slots from several queues that they weren't added. This was the issue in 1.4.5, ok. Wasn't it fixed lateron? But here it's getting no allocation at all. == If I force it to get jobs only from one queue: [pc15370:30447] mca: base: components_open: Looking for ras components [pc15370:30447] mca: base: components_open: opening ras components [pc15370:30447] mca: base: components_open: found loaded component cm [pc15370:30447] mca: base: components_open: component cm has no register function [pc15370:30447] mca: base: components_open: component cm open function successful [pc15370:30447] mca: base: components_open: found loaded component gridengine [pc15370:30447] mca: base: components_open: component gridengine has no register function [pc15370:30447] mca: base: components_open: component gridengine open function successful [pc15370:30447] mca: base: components_open: found loaded component loadleveler [pc15370:30447] mca: base: components_open: component loadleveler has no register function [pc15370:30447] mca: base: components_open: component loadleveler open function successful [pc15370:30447] mca: base: components_open: found loaded component slurm [pc15370:30447] mca: base: components_open: component slurm has no register function [pc15370:30447] mca: base: components_open: component slurm open function successful [pc15370:30447] mca:base:select: Auto-selecting ras components [pc15370:30447] mca:base:select:( ras) Querying component [cm] [pc15370:30447] mca:base:select:( ras) Skipping component [cm]. Query failed to return a module [pc15370:30447] mca:base:select:( ras) Querying component [gridengine] [pc15370:30447] mca:base:select:( ras) Query of component [gridengine] set priority to 100 [pc15370:30447] mca:base:select:( ras) Querying component [loadleveler] [pc15370:30447] mca:base:select:( ras) Skipping component [loadleveler]. Query failed to return a module [pc15370:30447] mca:base:select:( ras) Querying component [slurm] [pc15370:30447] mca:base:select:( ras) Skipping component [slurm]. Query failed to return a module [pc15370:30447] mca:base:select:( ras) Selected component [gridengine] [pc15370:30447] mca: base: close: unloading component cm [pc15370:30447] mca: base: close: unloading component loadleveler [pc15370:30447] mca: base: close: unloading component slurm [pc15370:30447] ras:gridengine: JOB_ID: 4638 [pc15370:30447] ras:gridengine: PE_HOSTFILE: /var/spool/sge/pc15370/active_jobs/4638.1/pe_hostfile [pc15370:30447] ras:gridengine: pc15370: PE_HOSTFILE shows slots=2 [pc15370:30447] ras:gridengine: pc15381: PE_HOSTFILE shows slots=2 But: it starts only 2 processes instead of 4: Total: 2 Universe: 4 Hello World from Rank 0. Hello World from Rank 1. Yes, I can add `mpiexec -np $NSLOTS ..` to get 4, but all will be on pc15370, the pc15381 is ignored completely. == If I go back to 1.4.1: [pc15370:31052] mca: base: components_open: Looking for ras components [pc15370:31052] mca: base: components_open: opening ras components [pc15370:31052] mca: base: components_open: found loaded component gridengine [pc15370:31052] mca: base: components_open: component gridengine has no register function [pc15370:31052] mca: base: components_open: component gridengine open function successful [pc15370:31052] mca: base: components_open: found loaded component slurm [pc15370:31052] mca: base: components_open: component slurm has no register function [pc15370:31052] mca: base: components_open: component slurm open function successful [pc15370:31052] mca:base:select: Auto-selecting ras components [pc15370:31052] mca:base:select:( ras) Querying component [gridengine] [pc15370:31052] mca:base:select:( ras) Query of component [gridengine] set priority to 100 [pc15370:31052] mca:base:select:( ras) Querying component [slurm] [pc15370:31052] mca:base:select:( ras) Skipping component [slurm]. Query failed to return a module [pc15370:31052] mca:base:select:( ras) Selected component [gridengine] [pc15370:31052] mca: base: close: unloading component slurm [pc15370:31052] ras:gridengine: JOB_ID: 4640 [pc15370:31052] ras:gridengine: PE_HOSTFILE: /var/spool/sge/pc15370/active_jobs/4640.1/pe_hostfile [pc15370:31052] ras:gridengine: pc15370: PE_HOSTFILE shows slots=2 [pc15370:31052] ras:gridengine: pc15381: PE_HOSTFILE shows slots=2 Total: 4 Universe: 4 Hello World from Rank 0. Hello World from Rank 1. Hello World from Rank 2. Hello World from Rank 3. And no "-np $NSLOTS" in the command, just a plain `mpiexec ./mpihello`. -- Reuti > My guess is that something in the pe_hostfile syntax may have changed and we > didn't pick up on it. > > >> -- Reuti >> >> >>> >>>> >>>> == >>>> >>>> I configured with: >>>> >>>> ./configure --prefix=$HOME/local/... --enable-static --disable-shared >>>> --with-sge >>>> >>>> and adjusted my PATHs accordingly (at least: I hope so). >>>> >>>> -- Reuti >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users