"Lane, William" <william.l...@cshs.org> writes:

> I'm running a mixed cluster of Blades (HS21 and HS22 chassis), x3550-M3 and 
> X3550-M4 systems, some of which support hyperthreading, while others
> don't (specifically the HS21 blades) all on CentOS 6.3 w/SGE.

Do you mean jobs are split across nodes which have hyperthreading on,
and ones which don't (and you're trying to use the threads where they're
on)?  That doesn't seem a good idea.  (You could turn off threads
per-job in a root-privileged prolog, pe_starter, or shepherd_cmd; or it
would probably work to set the slot count to the core count and bind to
cores.)

> I have no problems running my simple OpenMPI 1.8.7 test code outside of SGE 
> (with or without the --bind-to core switch, but can only run the jobs within
> SGE via qrsh on a limited number of slots (4 at most) successfully. The 
> errors are very similar to the ones I was getting running OpenMPI 1.8.5 - 
> 1.8.6 outside of SGE
> on this same cluster.
>
> Strangely, when running the test code outside of SGE w/the --bind-to core 
> switch, mpirun still binds processes to hyperthreading cores. Additionally,
> the --bind-to core switch prevents the OpenMPI 1.8.7 test code from running 
> at all within SGE (it outputs warnings about missing NUMA libraries reducing 
> performance
> then exits).

Are you doing SGE core binding
<http://arc.liv.ac.uk/SGE/howto/sge-configs.html#_core_binding>?

> We would rather run out OpenMPI jobs from within SGE so that we can get 
> accounting data on OpenMPI jobs for administrative purposes.
>
> The orte PE I'm been using seems to meet all the requirements for previous 
> versions of OpenMPI:
> the allocation rule is fill-up, rather than round-robin (I'm not sure if this 
> makes a difference or not)

If you're really going to have heterogeneous threading, I'd guess you
best allocate only whole nodes and let openmpi do the binding.

[procenv is recommended for comparing the job's generalized environment
with the environment outside the resource manager
<http://arc.liv.ac.uk/SGE/howto/troubleshooting.html>.]

Reply via email to