Am 04.09.2012 um 01:38 schrieb Ralph Castain: >>> <snip>W >>> Well that seems strange! Can you run 1.6.1 with the following on the mpirun >>> cmd line: >>> >>> -mca ras_gridengine_debug 1 -mca ras_gridengine_verbose 10 -mca >>> ras_base_verbose 10 > > I'll take a look at this and see what's going on - have to get back to you on > it.
In "ras_gridengine_module.c" I added between the found = true / break: opal_output(mca_ras_gridengine_component.verbose, "ras:gridengine: %s: PE_HOSTFILE increased to slots=%d", node->name, node->slots); Then I get: [pc15370:13630] ras:gridengine: JOB_ID: 4644 [pc15370:13630] ras:gridengine: PE_HOSTFILE: /var/spool/sge/pc15370/active_jobs/4644.1/pe_hostfile [pc15370:13630] ras:gridengine: pc15370: PE_HOSTFILE shows slots=1 [pc15370:13630] ras:gridengine: pc15381: PE_HOSTFILE shows slots=2 [pc15370:13630] ras:gridengine: pc15370: PE_HOSTFILE increased to slots=2 And the allocation is correct. I'll continue to investigate what is different today. -- Reuti > Thx! > >> >> [pc15381:06250] mca: base: components_open: Looking for ras components >> [pc15381:06250] mca: base: components_open: opening ras components >> [pc15381:06250] mca: base: components_open: found loaded component cm >> [pc15381:06250] mca: base: components_open: component cm has no register >> function >> [pc15381:06250] mca: base: components_open: component cm open function >> successful >> [pc15381:06250] mca: base: components_open: found loaded component gridengine >> [pc15381:06250] mca: base: components_open: component gridengine has no >> register function >> [pc15381:06250] mca: base: components_open: component gridengine open >> function successful >> [pc15381:06250] mca: base: components_open: found loaded component >> loadleveler >> [pc15381:06250] mca: base: components_open: component loadleveler has no >> register function >> [pc15381:06250] mca: base: components_open: component loadleveler open >> function successful >> [pc15381:06250] mca: base: components_open: found loaded component slurm >> [pc15381:06250] mca: base: components_open: component slurm has no register >> function >> [pc15381:06250] mca: base: components_open: component slurm open function >> successful >> [pc15381:06250] mca:base:select: Auto-selecting ras components >> [pc15381:06250] mca:base:select:( ras) Querying component [cm] >> [pc15381:06250] mca:base:select:( ras) Skipping component [cm]. Query >> failed to return a module >> [pc15381:06250] mca:base:select:( ras) Querying component [gridengine] >> [pc15381:06250] mca:base:select:( ras) Query of component [gridengine] set >> priority to 100 >> [pc15381:06250] mca:base:select:( ras) Querying component [loadleveler] >> [pc15381:06250] mca:base:select:( ras) Skipping component [loadleveler]. >> Query failed to return a module >> [pc15381:06250] mca:base:select:( ras) Querying component [slurm] >> [pc15381:06250] mca:base:select:( ras) Skipping component [slurm]. Query >> failed to return a module >> [pc15381:06250] mca:base:select:( ras) Selected component [gridengine] >> [pc15381:06250] mca: base: close: unloading component cm >> [pc15381:06250] mca: base: close: unloading component loadleveler >> [pc15381:06250] mca: base: close: unloading component slurm >> [pc15381:06250] ras:gridengine: JOB_ID: 4636 >> [pc15381:06250] ras:gridengine: PE_HOSTFILE: >> /var/spool/sge/pc15381/active_jobs/4636.1/pe_hostfile >> [pc15381:06250] ras:gridengine: pc15381: PE_HOSTFILE shows slots=1 >> [pc15381:06250] ras:gridengine: pc15370: PE_HOSTFILE shows slots=2 >> -------------------------------------------------------------------------- >> There are no nodes allocated to this job. >> -------------------------------------------------------------------------- >> [pc15381:06250] mca: base: close: component gridengine closed >> [pc15381:06250] mca: base: close: unloading component gridengine >> >> The actual hostfile contains: >> >> pc15381 1 all.q@pc15381 UNDEFINED >> pc15370 2 extra.q@pc15370 UNDEFINED >> pc15381 1 extra.q@pc15381 UNDEFINED >> >> and it was submitted with `qsub -pe orted 4 ...`. >> >> >> Aha, I remember an issue on the list, if a job gets slots from several >> queues that they weren't added. This was the issue in 1.4.5, ok. Wasn't it >> fixed lateron? But here it's getting no allocation at all. >> >> == >> >> If I force it to get jobs only from one queue: >> >> [pc15370:30447] mca: base: components_open: Looking for ras components >> [pc15370:30447] mca: base: components_open: opening ras components >> [pc15370:30447] mca: base: components_open: found loaded component cm >> [pc15370:30447] mca: base: components_open: component cm has no register >> function >> [pc15370:30447] mca: base: components_open: component cm open function >> successful >> [pc15370:30447] mca: base: components_open: found loaded component gridengine >> [pc15370:30447] mca: base: components_open: component gridengine has no >> register function >> [pc15370:30447] mca: base: components_open: component gridengine open >> function successful >> [pc15370:30447] mca: base: components_open: found loaded component >> loadleveler >> [pc15370:30447] mca: base: components_open: component loadleveler has no >> register function >> [pc15370:30447] mca: base: components_open: component loadleveler open >> function successful >> [pc15370:30447] mca: base: components_open: found loaded component slurm >> [pc15370:30447] mca: base: components_open: component slurm has no register >> function >> [pc15370:30447] mca: base: components_open: component slurm open function >> successful >> [pc15370:30447] mca:base:select: Auto-selecting ras components >> [pc15370:30447] mca:base:select:( ras) Querying component [cm] >> [pc15370:30447] mca:base:select:( ras) Skipping component [cm]. Query >> failed to return a module >> [pc15370:30447] mca:base:select:( ras) Querying component [gridengine] >> [pc15370:30447] mca:base:select:( ras) Query of component [gridengine] set >> priority to 100 >> [pc15370:30447] mca:base:select:( ras) Querying component [loadleveler] >> [pc15370:30447] mca:base:select:( ras) Skipping component [loadleveler]. >> Query failed to return a module >> [pc15370:30447] mca:base:select:( ras) Querying component [slurm] >> [pc15370:30447] mca:base:select:( ras) Skipping component [slurm]. Query >> failed to return a module >> [pc15370:30447] mca:base:select:( ras) Selected component [gridengine] >> [pc15370:30447] mca: base: close: unloading component cm >> [pc15370:30447] mca: base: close: unloading component loadleveler >> [pc15370:30447] mca: base: close: unloading component slurm >> [pc15370:30447] ras:gridengine: JOB_ID: 4638 >> [pc15370:30447] ras:gridengine: PE_HOSTFILE: >> /var/spool/sge/pc15370/active_jobs/4638.1/pe_hostfile >> [pc15370:30447] ras:gridengine: pc15370: PE_HOSTFILE shows slots=2 >> [pc15370:30447] ras:gridengine: pc15381: PE_HOSTFILE shows slots=2 >> >> But: it starts only 2 processes instead of 4: >> >> Total: 2 >> Universe: 4 >> Hello World from Rank 0. >> Hello World from Rank 1. >> >> Yes, I can add `mpiexec -np $NSLOTS ..` to get 4, but all will be on >> pc15370, the pc15381 is ignored completely. >> >> == >> >> If I go back to 1.4.1: >> >> [pc15370:31052] mca: base: components_open: Looking for ras components >> [pc15370:31052] mca: base: components_open: opening ras components >> [pc15370:31052] mca: base: components_open: found loaded component gridengine >> [pc15370:31052] mca: base: components_open: component gridengine has no >> register function >> [pc15370:31052] mca: base: components_open: component gridengine open >> function successful >> [pc15370:31052] mca: base: components_open: found loaded component slurm >> [pc15370:31052] mca: base: components_open: component slurm has no register >> function >> [pc15370:31052] mca: base: components_open: component slurm open function >> successful >> [pc15370:31052] mca:base:select: Auto-selecting ras components >> [pc15370:31052] mca:base:select:( ras) Querying component [gridengine] >> [pc15370:31052] mca:base:select:( ras) Query of component [gridengine] set >> priority to 100 >> [pc15370:31052] mca:base:select:( ras) Querying component [slurm] >> [pc15370:31052] mca:base:select:( ras) Skipping component [slurm]. Query >> failed to return a module >> [pc15370:31052] mca:base:select:( ras) Selected component [gridengine] >> [pc15370:31052] mca: base: close: unloading component slurm >> [pc15370:31052] ras:gridengine: JOB_ID: 4640 >> [pc15370:31052] ras:gridengine: PE_HOSTFILE: >> /var/spool/sge/pc15370/active_jobs/4640.1/pe_hostfile >> [pc15370:31052] ras:gridengine: pc15370: PE_HOSTFILE shows slots=2 >> [pc15370:31052] ras:gridengine: pc15381: PE_HOSTFILE shows slots=2 >> >> Total: 4 >> Universe: 4 >> Hello World from Rank 0. >> Hello World from Rank 1. >> Hello World from Rank 2. >> Hello World from Rank 3. >> >> And no "-np $NSLOTS" in the command, just a plain `mpiexec ./mpihello`. >> >> -- Reuti >> >> >>> My guess is that something in the pe_hostfile syntax may have changed and >>> we didn't pick up on it. >>> >>> >>>> -- Reuti >>>> >>>> >>>>> >>>>>> >>>>>> == >>>>>> >>>>>> I configured with: >>>>>> >>>>>> ./configure --prefix=$HOME/local/... --enable-static --disable-shared >>>>>> --with-sge >>>>>> >>>>>> and adjusted my PATHs accordingly (at least: I hope so). >>>>>> >>>>>> -- Reuti >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users