Dear Veronica

The answer depends to a large extent on your own, or your group's, mode of
working.  Specifically, are you interested in the fastest turn-around of
individual jobs, or are you interested the highest throughput of many jobs
running simultaneously, i.e. draining the batch queues as quickly as
possible?  The latter of course assumes that you and/or your group will
normally be submitting a sufficient number of jobs to keep the batch queues
filled.

To take an example let's say you have a 4-node cluster each with 8 CPUs.
The fastest turn-around for a single job, assuming of course that the
parallelised code is available, is probably to run it in parallel over as
many CPUs as possible, at least on a single node.  However as Peter pointed
out the speed-up is unlikely to be linear for multi-threaded or MPI jobs,
so for example you may see only a 4 times speed-up when run on an 8-CPU
node, and using MPI over additional nodes may or may not improve on that,
depending on the application.  All the same, any speed-up is better than no
speed-up if all you want are individual jobs to finish in the shortest
possible times.

Now let's say you submit 32 independent jobs to the batch queue.  Assume
for simplicity that they all take around the same time T when run
individually on a single CPU: what's the best way to run them for the
fastest throughput?

If you disable parallelisation and just run 1 job at a time per node, the
first 4 jobs will finish at about time T, then the 2nd set of 4 jobs in a
further time T, up to the 8th set of 4 jobs, so 8T in all.  That's because
you used only 1/8th of the available computing power (7 out of 8 CPUs per
node were never used).

Now say you turn on parallelisation and again run 4 jobs at a time with
each job multi-threaded over 8 CPUs.  Each job is unlikely to finish in
time T/8 for the reason above: let's say it takes T/4.  If your batch-queue
manager is set up like mine you have to request a fixed number of CPUs to
be allocated to the job and also you must tell the program that that is the
maximum no. of CPUs it can use.  Those CPUs are allocated to you and won't
be assigned to another job for the entire duration of your job.  Of course
you could dispense with the batch queues and just run everything in
background, but a free-for-all is unlikely to be the most efficient or
fairest way of working.  So in this case the first 4 jobs finish around
T/4, the 2nd set of 4 another T/4, up to the 8th set of 4, so 2T altogether
(again a factor of 2 because half the CPUs were allocated to the job but
effectively unused).

Now say you turn parallelisation back off and simply run all 32 jobs
simultaneously, 8 per node across the 4 nodes.  Each job is independent of
the others so they will all finish at around time T.  The main reason for
non-linearity in a multi-threaded job is that the threads usually have to
synchronise at certain points in the code, and anyway it may not be
possible to run multi-threaded for some portions, so some threads are
forced to wait for others to catch up, and this waiting wastes CPU power.
Independent jobs don't have to wait for synchronisation (I'm assuming that
the memory bandwidth is sufficient so that contention for shared memory is
not significant and there's sufficient RAM per node to run one job on each
CPU).

The fastest throughput in the case of frequently-full job queues, if RAM is
not an issue, is therefore obtained by disabling within-job parallelisation
and simply running multiple jobs simultaneously over all available CPUs,
with exactly one job per CPU (running more than one job per CPU is likely
to be penalised by frequent context-switching in the OS).  In that
situation it's irrelevant whether a particular code is parallelised since
you're better off without it!  Indeed in that situation use of
parallelisation in batch jobs could be regarded as anti-social since the
batch-queue manager may have allocated 8 CPUs to your job but you are only
effectively using 4 of them, so you may be preventing other jobs from
making better use of the other 4.  Of course the cluster may not be fully
used all the time so in that case you may benefit from using the spare
capacity by enabling multi-threading.

Cheers

-- Ian


On Fri, 23 Nov 2018 at 10:31, V F <veronicapfiorent...@gmail.com> wrote:

> Dear all,
> Which programs benefit from multi-cpu cluster? Since the physics
> department is getting rid of a old 32 compute node cluster, I was
> hoping to find some benefit using for crystallographic work. Looking a
> ccp4wiki or google-fu did not help
> Many thanks
> Veronica
>
> ########################################################################
>
> To unsubscribe from the CCP4BB list, click the following link:
> https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCP4BB&A=1
>

########################################################################

To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCP4BB&A=1

Reply via email to