Re: [OMPI users] busy waiting and oversubscriptions

Gus Correa Wed, 26 Mar 2014 19:21:06 -0400 (EDT)

On 03/26/2014 05:26 PM, Ross Boylan wrote:

[Main part is at the bottom]
On Wed, 2014-03-26 at 19:28 +0100, Andreas Schäfer wrote:

Ross-


On 09:08 Wed 26 Mar     , Ross Boylan wrote:

On Wed, 2014-03-26 at 10:27 +0000, Jeff Squyres (jsquyres) wrote:

On Mar 26, 2014, at 1:31 AM, Andreas Schäfer <gent...@gmx.de> wrote:

...

This seems to restate the premise of my question.  Is it meant to lead
to the answer "A process in busy wait blocks other users of the CPU to
the same extent as any other process at 100%"?

Yes.

Thanks for  confirming.

At any rate, my question is whether, if I have processes that spend most
of their time waiting to receive a message, I can put more of them than
I have physical cores without much slowdown?

AFAICS there will always be a certain slowdown. Is there a reason why
you would want to oversubscribe your nodes?

Agreed -- this is not a good idea.  It suggests that you should make your 
existing code more efficient -- perhaps by overlapping communication and 
computation.

My motivation was to get more work done with a given number of CPUs, and
also to find out how much of burden I was imposing on other users.

My application consists of processes that have different roles.  Some of
the roles don't have much to do (they play important roles, but don't do
much computation).  My hope was that I could add them on without
imposing much of a burden.

If you have a complex workflow with varying computational loads, then
you might want to take a look at runtime systems which allow you to
express this directly through their API, e.g. HPX[1]. HPX has proven to
run with high efficiency on a wide range of architectures, and with a
multitude of different workloads.

Thanks for the pointer.

Second, we do not operate in a batch queuing environment

Why not fix that?

I'm not the sysadmin, though I'm involved in the group that sets policy.
At one point we were using Sun's grid engine, but I don't think it's
installed now.  I'm not sure why.

We have discussed putting in a batch queuing system and nobody was
really pushing for it.  My impression was (and probably still is) that
it was more pain than gain.  There is hassle not only for the sysadmin
to set it up (and, I suppose, monitor it), but for users.  Personally I
run a lot of interactive parallel jobs (the interaction is on rank 0
only).  I have the impression that won't work under a batch system,
though I could be wrong.  I also had the impression we'd need to have an
estimate of how long the job would run when we submit, and we don't
always know.


But I've never really used such a system, and may not appreciate what it
would get us.  The other reason we haven't bothered is that the load on
the cluster was relatively light and contention was low.  That is less
and less true, which probably starts tipping the balance toward a
queuing system.

This is wandering off topic, but if you or anyone else could say more
about why you regard the absence of a queuing system as a problem that
should be fixed, I'd love to hear it.

Ross

Hi Ross

Some pros:
(I don't know of any cons.)

Torque+Maui, SGE/OGE, and Slurm are free.
There are commercial products as well.

Installation and initial configuration may take some effort,

but after that it is mostly peace of mind, and occasional tuning to theworkload.

You can build OpenMPI integrated to them (no need for a hostfile tosubmit jobs,

OpenMPI will use whatever nodes the queue system gave you).

If you build the queue system with cpuset control, a node can be sharedamong several jobs, but the cpus/cores will be assigned specifically

to each job's processes, so that nobody steps on each other toes.
(There is similar control over the memory used per job as well.)

Queue systems won't allow resources to be oversubscribed.
As it is now, what else but courtesy and a great deal of coordination
would guarantee that you and your colleagues won't launch

several computationally demanding jobs on the same node, using the samecpus, perhaps using more memory than the available RAM,

maybe forcing the system to swap to disk, and ruining performance?
I've been to an organization that didn't want to use a queue system,
and where people would have to go knocking on doors
to ask things like: "Would you please release nodes 01 to 32?
You have processes leftover from a dead job running on them for a week,
taking 100% CPU, and there are no nodes available."

The queue system avoids that, it has courtesy and coordination built in,so to speak.


You can configure the queue system from very simple to quite complex
resource use policies, with queues for specific types of jobs, etc.
You can start with single queue and a first-in-first-out job policy,
then make it more complex as the workload increases.

Queue systems do support interactive jobs (even with X-windows GUIs, ifneeded).

You submit the interactive job, the queue system puts you in
a free node, and you work normally there.

I hope this helps,
Gus Correa

Re: [OMPI users] busy waiting and oversubscriptions

Reply via email to