Ralph Castain wrote:
> Bottom line for users: the results remain the same. If no other
process wants time, you'll continue to see near 100% utilization even if
we yield because we will always poll for some time before deciding to yield.

Not surprisingly, I am seeing this with recv/send too, at least when
nothing else is running.  This is true even though all workers are on
different nodes (so no need for shared memory connection between them).

Is there a tool in openmpi that will reveal how much "spin time" the
processes are using?  The previous version of the program I'm currently
working on used PVM, and for that implementation gstat, top, etc.
provided a good idea of the percent activity on the compute nodes.  Not
so here.  At the moment our cluster is heterogeneous with 3 nodes about
3X faster than the other 20.  Because of a lack of load balancing
(that's what I am trying to address now) the fast nodes must be idle
around 60% of the time, since they will finish their task long before
the other nodes, but I can't see it, can you?  Here are the relevant
columns from one gstat reading, the idle values jump around between
machines with no apparent pattern.  The 3 faster ones are 02, 05, and
15, but no way to tell that from this data:

[  User,  Nice, System, Idle, Wio]
01 [  49.7,   0.0,  50.3,   0.0,   0.0]
02 [  41.4,   0.0,  58.6,   0.0,   0.0]
03 [  43.2,   0.0,  49.7,   7.0,   0.0]
04 [  38.8,   0.0,  46.0,  15.2,   0.0]
05 [  38.6,   0.0,  46.4,  15.0,   0.0]
06 [  48.3,   0.0,  51.7,   0.0,   0.0]
07 [  38.5,   0.0,  46.6,  14.9,   0.0]
08 [  43.8,   0.0,  51.3,   4.8,   0.0]
09 [  44.9,   0.0,  48.8,   6.3,   0.0]
10 [  48.9,   0.0,  49.1,   2.0,   0.0]
11 [  50.7,   0.0,  49.3,   0.0,   0.0]
12 [  46.8,   0.0,  53.2,   0.0,   0.0]
13 [  48.4,   0.0,  51.6,   0.0,   0.0]
14 [  44.2,   0.0,  48.2,   7.6,   0.0]
15 [  43.3,   0.0,  56.7,   0.0,   0.0]
16 [  44.7,   0.0,  50.3,   5.0,   0.0]
17 [  42.8,   0.0,  57.2,   0.0,   0.0]
18 [  50.7,   0.0,  49.3,   0.0,   0.0]
19 [  46.9,   0.0,  45.2,   7.9,   0.0]
20 [  46.0,   0.0,  48.9,   5.1,   0.0]

Top is even less help, it just shows the worker process on each node at
>98% CPU.

Thanks,

David Mathog
mat...@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech

Reply via email to