On Sep 3, 2012, at 3:50 PM, Reuti <re...@staff.uni-marburg.de> wrote:
> Am 04.09.2012 um 00:07 schrieb Ralph Castain: > >> I'm leaning towards fixing it - it came due to discussions on how to handle >> hostfile when there is an allocation. For now, though, that should work. > > Oh, did I miss this on the list? If there is a hostfile given as argument, it > should override the default hostfile IMO. This was several years ago now - first showed up in the 1.5 series. Unless someone objects, I'll change it. > > >>> >>> >>>>> == >>>>> >>>>> SGE issue >>>>> >>>>> I usually don't install new versions instantly, so I only noticed right >>>>> now, that in 1.4.5 I get a wrong allocation inside SGE (always one >>>>> process less than requested with `qsub -pe orted N ...`. This I tried >>>>> only, as with 1.6.1 I get: >>>>> >>>>> -------------------------------------------------------------------------- >>>>> There are no nodes allocated to this job. >>>>> -------------------------------------------------------------------------- >>>>> >>>>> all the time. >>>> >>>> Weird - I'm not sure I understand what you are saying. Is this happening >>>> with 1.6.1 as well? Or just with 1.4.5? >>> >>> 1.6.1 = no nodes allocated >>> 1.4.5 = one process less than requested >>> 1.4.1 = works as it should >>> >> >> Well that seems strange! Can you run 1.6.1 with the following on the mpirun >> cmd line: >> >> -mca ras_gridengine_debug 1 -mca ras_gridengine_verbose 10 -mca >> ras_base_verbose 10 I'll take a look at this and see what's going on - have to get back to you on it. Thx! > > [pc15381:06250] mca: base: components_open: Looking for ras components > [pc15381:06250] mca: base: components_open: opening ras components > [pc15381:06250] mca: base: components_open: found loaded component cm > [pc15381:06250] mca: base: components_open: component cm has no register > function > [pc15381:06250] mca: base: components_open: component cm open function > successful > [pc15381:06250] mca: base: components_open: found loaded component gridengine > [pc15381:06250] mca: base: components_open: component gridengine has no > register function > [pc15381:06250] mca: base: components_open: component gridengine open > function successful > [pc15381:06250] mca: base: components_open: found loaded component loadleveler > [pc15381:06250] mca: base: components_open: component loadleveler has no > register function > [pc15381:06250] mca: base: components_open: component loadleveler open > function successful > [pc15381:06250] mca: base: components_open: found loaded component slurm > [pc15381:06250] mca: base: components_open: component slurm has no register > function > [pc15381:06250] mca: base: components_open: component slurm open function > successful > [pc15381:06250] mca:base:select: Auto-selecting ras components > [pc15381:06250] mca:base:select:( ras) Querying component [cm] > [pc15381:06250] mca:base:select:( ras) Skipping component [cm]. Query failed > to return a module > [pc15381:06250] mca:base:select:( ras) Querying component [gridengine] > [pc15381:06250] mca:base:select:( ras) Query of component [gridengine] set > priority to 100 > [pc15381:06250] mca:base:select:( ras) Querying component [loadleveler] > [pc15381:06250] mca:base:select:( ras) Skipping component [loadleveler]. > Query failed to return a module > [pc15381:06250] mca:base:select:( ras) Querying component [slurm] > [pc15381:06250] mca:base:select:( ras) Skipping component [slurm]. Query > failed to return a module > [pc15381:06250] mca:base:select:( ras) Selected component [gridengine] > [pc15381:06250] mca: base: close: unloading component cm > [pc15381:06250] mca: base: close: unloading component loadleveler > [pc15381:06250] mca: base: close: unloading component slurm > [pc15381:06250] ras:gridengine: JOB_ID: 4636 > [pc15381:06250] ras:gridengine: PE_HOSTFILE: > /var/spool/sge/pc15381/active_jobs/4636.1/pe_hostfile > [pc15381:06250] ras:gridengine: pc15381: PE_HOSTFILE shows slots=1 > [pc15381:06250] ras:gridengine: pc15370: PE_HOSTFILE shows slots=2 > -------------------------------------------------------------------------- > There are no nodes allocated to this job. > -------------------------------------------------------------------------- > [pc15381:06250] mca: base: close: component gridengine closed > [pc15381:06250] mca: base: close: unloading component gridengine > > The actual hostfile contains: > > pc15381 1 all.q@pc15381 UNDEFINED > pc15370 2 extra.q@pc15370 UNDEFINED > pc15381 1 extra.q@pc15381 UNDEFINED > > and it was submitted with `qsub -pe orted 4 ...`. > > > Aha, I remember an issue on the list, if a job gets slots from several queues > that they weren't added. This was the issue in 1.4.5, ok. Wasn't it fixed > lateron? But here it's getting no allocation at all. > > == > > If I force it to get jobs only from one queue: > > [pc15370:30447] mca: base: components_open: Looking for ras components > [pc15370:30447] mca: base: components_open: opening ras components > [pc15370:30447] mca: base: components_open: found loaded component cm > [pc15370:30447] mca: base: components_open: component cm has no register > function > [pc15370:30447] mca: base: components_open: component cm open function > successful > [pc15370:30447] mca: base: components_open: found loaded component gridengine > [pc15370:30447] mca: base: components_open: component gridengine has no > register function > [pc15370:30447] mca: base: components_open: component gridengine open > function successful > [pc15370:30447] mca: base: components_open: found loaded component loadleveler > [pc15370:30447] mca: base: components_open: component loadleveler has no > register function > [pc15370:30447] mca: base: components_open: component loadleveler open > function successful > [pc15370:30447] mca: base: components_open: found loaded component slurm > [pc15370:30447] mca: base: components_open: component slurm has no register > function > [pc15370:30447] mca: base: components_open: component slurm open function > successful > [pc15370:30447] mca:base:select: Auto-selecting ras components > [pc15370:30447] mca:base:select:( ras) Querying component [cm] > [pc15370:30447] mca:base:select:( ras) Skipping component [cm]. Query failed > to return a module > [pc15370:30447] mca:base:select:( ras) Querying component [gridengine] > [pc15370:30447] mca:base:select:( ras) Query of component [gridengine] set > priority to 100 > [pc15370:30447] mca:base:select:( ras) Querying component [loadleveler] > [pc15370:30447] mca:base:select:( ras) Skipping component [loadleveler]. > Query failed to return a module > [pc15370:30447] mca:base:select:( ras) Querying component [slurm] > [pc15370:30447] mca:base:select:( ras) Skipping component [slurm]. Query > failed to return a module > [pc15370:30447] mca:base:select:( ras) Selected component [gridengine] > [pc15370:30447] mca: base: close: unloading component cm > [pc15370:30447] mca: base: close: unloading component loadleveler > [pc15370:30447] mca: base: close: unloading component slurm > [pc15370:30447] ras:gridengine: JOB_ID: 4638 > [pc15370:30447] ras:gridengine: PE_HOSTFILE: > /var/spool/sge/pc15370/active_jobs/4638.1/pe_hostfile > [pc15370:30447] ras:gridengine: pc15370: PE_HOSTFILE shows slots=2 > [pc15370:30447] ras:gridengine: pc15381: PE_HOSTFILE shows slots=2 > > But: it starts only 2 processes instead of 4: > > Total: 2 > Universe: 4 > Hello World from Rank 0. > Hello World from Rank 1. > > Yes, I can add `mpiexec -np $NSLOTS ..` to get 4, but all will be on pc15370, > the pc15381 is ignored completely. > > == > > If I go back to 1.4.1: > > [pc15370:31052] mca: base: components_open: Looking for ras components > [pc15370:31052] mca: base: components_open: opening ras components > [pc15370:31052] mca: base: components_open: found loaded component gridengine > [pc15370:31052] mca: base: components_open: component gridengine has no > register function > [pc15370:31052] mca: base: components_open: component gridengine open > function successful > [pc15370:31052] mca: base: components_open: found loaded component slurm > [pc15370:31052] mca: base: components_open: component slurm has no register > function > [pc15370:31052] mca: base: components_open: component slurm open function > successful > [pc15370:31052] mca:base:select: Auto-selecting ras components > [pc15370:31052] mca:base:select:( ras) Querying component [gridengine] > [pc15370:31052] mca:base:select:( ras) Query of component [gridengine] set > priority to 100 > [pc15370:31052] mca:base:select:( ras) Querying component [slurm] > [pc15370:31052] mca:base:select:( ras) Skipping component [slurm]. Query > failed to return a module > [pc15370:31052] mca:base:select:( ras) Selected component [gridengine] > [pc15370:31052] mca: base: close: unloading component slurm > [pc15370:31052] ras:gridengine: JOB_ID: 4640 > [pc15370:31052] ras:gridengine: PE_HOSTFILE: > /var/spool/sge/pc15370/active_jobs/4640.1/pe_hostfile > [pc15370:31052] ras:gridengine: pc15370: PE_HOSTFILE shows slots=2 > [pc15370:31052] ras:gridengine: pc15381: PE_HOSTFILE shows slots=2 > > Total: 4 > Universe: 4 > Hello World from Rank 0. > Hello World from Rank 1. > Hello World from Rank 2. > Hello World from Rank 3. > > And no "-np $NSLOTS" in the command, just a plain `mpiexec ./mpihello`. > > -- Reuti > > >> My guess is that something in the pe_hostfile syntax may have changed and we >> didn't pick up on it. >> >> >>> -- Reuti >>> >>> >>>> >>>>> >>>>> == >>>>> >>>>> I configured with: >>>>> >>>>> ./configure --prefix=$HOME/local/... --enable-static --disable-shared >>>>> --with-sge >>>>> >>>>> and adjusted my PATHs accordingly (at least: I hope so). >>>>> >>>>> -- Reuti >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users