Hi,

Am 14.03.2012 um 04:02 schrieb Joshua Baker-LePain:

> On Tue, 13 Mar 2012 at 5:31pm, Ralph Castain wrote
> 
>> FWIW: I have a Centos6 system myself, and I have no problems running OMPI on 
>> it (1.4 or 1.5). I can try building it the same way you do and see what 
>> happens.
> 
> I can run as many threads as I like on a single system with no problems, even 
> if those threads are running at different nice levels.

How do they get different nice levels - you renice them? I would assume that 
all start at the same of the parent. In your test program you posted there are 
no threads.


>  The problem seems to arise when I'm both a) running across multiple machines 
> and b) running threads at differing nice levels (which often happens as a 
> result of our queueing setup).

This sounds like you are getting slots from different queues assigned to one 
and the same job. My experience: don't do it, unless you neeed it. The problem 
is, that SGE can't decide in its `qrsh -inherit ...` call, which queue is the 
correct one for the particular call. As a result all calls to a slave machine 
can end up in one and the same queue. Although this is not correct, it won't 
oversubscribe the node, as usually the overall slot amount is limited already 
and it's more a matter of names SGE sets for the environment of the job:

https://arc.liv.ac.uk/trac/SGE/ticket/813

As a result, the SGE set $TMPDIR can be different between the master of the 
parallel job and the slave as the name of the queue is part of $TMPDIR. When a 
wrong $TMPDIR is set on a node (by Open MPI's forwarding?), strange things can 
happen depending on the application.

Do you face the same if you stay in one and the same queue across the machines? 
If you want to limit the number of available PEs in your setup for the user, 
you could request a PE by a wildcard and once a PE is selected SGE will stay in 
this PE. Attaching each PE to only one queue allows this way to avoid the 
mixture of slots from different queues (orte1 PE => all.q, orte2 PE => extra.q 
and you request orte*).

 -- Reuti


>  I can't guarantee that the problem *never* happens when I run across 
> multiple machines with all the threads un-niced, but I haven't been able to 
> reproduce that at will like I can for the other case.
> 
> -- 
> Joshua Baker-LePain
> QB3 Shared Cluster Sysadmin
> UCSF
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to