"Lane, William" <william.l...@cshs.org> writes: > I'm running a mixed cluster of Blades (HS21 and HS22 chassis), x3550-M3 and > X3550-M4 systems, some of which support hyperthreading, while others > don't (specifically the HS21 blades) all on CentOS 6.3 w/SGE.
Do you mean jobs are split across nodes which have hyperthreading on, and ones which don't (and you're trying to use the threads where they're on)? That doesn't seem a good idea. (You could turn off threads per-job in a root-privileged prolog, pe_starter, or shepherd_cmd; or it would probably work to set the slot count to the core count and bind to cores.) > I have no problems running my simple OpenMPI 1.8.7 test code outside of SGE > (with or without the --bind-to core switch, but can only run the jobs within > SGE via qrsh on a limited number of slots (4 at most) successfully. The > errors are very similar to the ones I was getting running OpenMPI 1.8.5 - > 1.8.6 outside of SGE > on this same cluster. > > Strangely, when running the test code outside of SGE w/the --bind-to core > switch, mpirun still binds processes to hyperthreading cores. Additionally, > the --bind-to core switch prevents the OpenMPI 1.8.7 test code from running > at all within SGE (it outputs warnings about missing NUMA libraries reducing > performance > then exits). Are you doing SGE core binding <http://arc.liv.ac.uk/SGE/howto/sge-configs.html#_core_binding>? > We would rather run out OpenMPI jobs from within SGE so that we can get > accounting data on OpenMPI jobs for administrative purposes. > > The orte PE I'm been using seems to meet all the requirements for previous > versions of OpenMPI: > the allocation rule is fill-up, rather than round-robin (I'm not sure if this > makes a difference or not) If you're really going to have heterogeneous threading, I'd guess you best allocate only whole nodes and let openmpi do the binding. [procenv is recommended for comparing the job's generalized environment with the environment outside the resource manager <http://arc.liv.ac.uk/SGE/howto/troubleshooting.html>.]