"Lane, William" <william.l...@cshs.org> writes:

> I can successfully run my OpenMPI 1.8.7 jobs outside of Son-of-Gridengine but 
> not via qrsh. We're
> using CentOS 6.3 and a heterogeneous cluster of hyperthreaded and 
> non-hyperthreaded blades
> and x3550 chassis. OpenMPI 1.8.7 has been built w/the debug switch as well.

I think you want to explain exactly why you need this world of pain.  It
seems unlikely that MPI programs will run efficiently in it.  Our Intel
nodes mostly have hyperthreading on in BIOS -- or what passes for BIOS
on them -- but disabled at startup, and we only run MPI across identical
nodes in the heterogeneous system.

> Here's my latest errors:
> qrsh -V -now yes -pe mpi 209 mpirun -np 209 -display-devel-map --prefix 
> /hpc/apps/mpi/openmpi/1.8.7/ --mca btl ^sm --hetero-nodes --bind-to core 
> /hpc/home/lanew/mpi/openmpi/ProcessColors3

[What does --hetero-nodes do?  It's undocumented as far as I can tell.]

> error: executing task of job 211298 failed: execution daemon on host 
> "csclprd3-0-4" didn't accept task
> error: executing task of job 211298 failed: execution daemon on host 
> "csclprd3-4-1" didn't accept task

So you need to find out why that was (probably lack of slots on the exec
host, which might be explained in the execd messages).

> [...]

> NOTE: the hosts that "didn't accept task" were different in two different 
> runs but the errors were the same.
>
> Here's the definition of the mpi Parallel Environment on our 
> Son-of-Gridengine cluster:
>
> pe_name            mpi
> slots              9999
> user_lists         NONE
> xuser_lists        NONE
> start_proc_args    /opt/sge/mpi/startmpi.sh $pe_hostfile
> stop_proc_args     /opt/sge/mpi/stopmpi.sh

Why are those two not NONE? 

> allocation_rule    $fill_up

As I said, that doesn't seem wise (unless you use -l exclusive).

> control_slaves     FALSE
> job_is_first_task  TRUE
> urgency_slots      min
> accounting_summary TRUE
> qsort_args         NONE
>
> Qsort_args is set to NONE, but it's supposed to be set to TRUE right?

No see sge_pe(5).  (I think the text I supplied for the FAQ is accurate,
but reuti might confirm if he's reading this.)

> -Bill L.
>
> If I can run my OpenMPI 1.8.7 jobs outside of Son-of-Gridengine w/no issues 
> it has to be Son-of-Gridengine that's
> the issue right?

I don't see any evidence of an SGE bug, if that's what you mean, but
clearly you have a problem if execds won't accept the jobs, and this
isn't the place to discuss it.  I asked about SGE core binding, and it's
presumably also relevant how slots are defined on the compute nodes, but
I'd just say "Don't do that" without a pressing reason.

Reply via email to