Todd:

I assume the system time is being consumed by
the calls to send and receive data over the TCP sockets.
As the number of processes in the job increases, then more
time is spent waiting for data from one of the other processes.

I did a little experiment on a single node to see the difference
in system time consumed when running over TCP vs when
running over shared memory.   When running on a single
node and using the sm btl, I see almost 100% user time. I assume this is because the sm btl handles sending and receiving its data within a shared memory segment. However, when I switch over to TCP, I see my system time
go up.  Note that this is on Solaris.

RUNNING OVER SELF,SM
> mpirun -np 8 -mca btl self,sm hpcc.amd64

  PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP
 3505 rolfv    100 0.0 0.0 0.0 0.0 0.0 0.0 0.0   0  75 182   0 hpcc.amd64/1
 3503 rolfv    100 0.0 0.0 0.0 0.0 0.0 0.0 0.2   0  69 116   0 hpcc.amd64/1
 3499 rolfv     99 0.0 0.0 0.0 0.0 0.0 0.0 0.5   0 106 236   0 hpcc.amd64/1
 3497 rolfv     99 0.0 0.0 0.0 0.0 0.0 0.0 1.0   0 169 200   0 hpcc.amd64/1
 3501 rolfv     98 0.0 0.0 0.0 0.0 0.0 0.0 1.9   0 127 158   0 hpcc.amd64/1
 3507 rolfv     98 0.0 0.0 0.0 0.0 0.0 0.0 2.0   0 244 200   0 hpcc.amd64/1
 3509 rolfv     98 0.0 0.0 0.0 0.0 0.0 0.0 2.0   0 282 212   0 hpcc.amd64/1
 3495 rolfv     97 0.0 0.0 0.0 0.0 0.0 0.0 3.2   0 237  98   0 hpcc.amd64/1

RUNNING OVER SELF,TCP
>mpirun -np 8 -mca btl self,tcp hpcc.amd64

  PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP
 4316 rolfv     93 6.9 0.0 0.0 0.0 0.0 0.0 0.2   5 346 .6M   0 hpcc.amd64/1
 4328 rolfv     91 8.4 0.0 0.0 0.0 0.0 0.0 0.4   3  59 .15   0 hpcc.amd64/1
 4324 rolfv     98 1.1 0.0 0.0 0.0 0.0 0.0 0.7   2 270 .1M   0 hpcc.amd64/1
 4320 rolfv     88  12 0.0 0.0 0.0 0.0 0.0 0.8   4 244 .15   0 hpcc.amd64/1
 4322 rolfv     94 5.1 0.0 0.0 0.0 0.0 0.0 1.3   2 150 .2M   0 hpcc.amd64/1
 4318 rolfv     92 6.7 0.0 0.0 0.0 0.0 0.0 1.4   5 236 .9M   0 hpcc.amd64/1
 4326 rolfv     93 5.3 0.0 0.0 0.0 0.0 0.0 1.7   7 117 .2M   0 hpcc.amd64/1
 4314 rolfv     91 6.6 0.0 0.0 0.0 0.0 1.3 0.9  19 150 .10   0 hpcc.amd64/1

I also ran HPL over a larger cluster of 6 nodes, and noticed even higher
system times. And lastly, I ran a simple MPI test over a cluster of 64 nodes, 2 procs per node
using Sun HPC ClusterTools 6, and saw about a 50/50 split between user
and system time.

 PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP
11525 rolfv 55 44 0.1 0.0 0.0 0.0 0.1 0.4 76 960 .3M 0 maxtrunc_ct6/1 11526 rolfv 54 45 0.0 0.0 0.0 0.0 0.0 1.0 0 362 .4M 0 maxtrunc_ct6/1

Is it possible that everything is working just as it should?

Rolf

Heywood, Todd wrote On 03/22/07 13:30,:

Ralph,

Well, according to the FAQ, aggressive mode can be "forced" so I did try
setting OMPI_MCA_mpi_yield_when_idle=0 before running. I also tried turning
processor/memory affinity on. Efffects were minor. The MPI tasks still cycle
bewteen run and sleep states, driving up system time well over user time.

Mpstat shows SGE is indeed giving 4 or 2 slots per node as approporiate
(depending on memory) and the MPI tasks are using 4 or 2 cores, but to be
sure, I also tried running directly with a hostfile with slots=4 or slots=2.
The same behavior occurs.

This behavior is a function of the size of the job. I.e. As I scale from 200
to 800 tasks the run/sleep cycling increases, so that system time grows from
maybe half the user time to maybe 5 times user time.

This is for TCP/gigE.

Todd


On 3/22/07 12:19 PM, "Ralph Castain" <r...@lanl.gov> wrote:

Just for clarification: ompi_info only shows the *default* value of the MCA
parameter. In this case, mpi_yield_when_idle defaults to aggressive, but
that value is reset internally if the system sees an "oversubscribed"
condition.

The issue here isn't how many cores are on the node, but rather how many
were specifically allocated to this job. If the allocation wasn't at least 2
(in your example), then we would automatically reset mpi_yield_when_idle to
be non-aggressive, regardless of how many cores are actually on the node.

Ralph


On 3/22/07 7:14 AM, "Heywood, Todd" <heyw...@cshl.edu> wrote:

Yes, I'm using SGE. I also just noticed that when 2 tasks/slots run on a
4-core node, the 2 tasks are still cycling between run and sleep, with
higher system time than user time.

Ompi_info shows the MCA parameter mpi_yield_when_idle to be 0 (aggressive),
so that suggests the tasks aren't swapping out on bloccking calls.

Still puzzled.

Thanks,
Todd


On 3/22/07 7:36 AM, "Jeff Squyres" <jsquy...@cisco.com> wrote:

Are you using a scheduler on your system?

More specifically, does Open MPI know that you have for process slots
on each node?  If you are using a hostfile and didn't specify
"slots=4" for each host, Open MPI will think that it's
oversubscribing and will therefore call sched_yield() in the depths
of its progress engine.


On Mar 21, 2007, at 5:08 PM, Heywood, Todd wrote:

P.s. I should have said this this is a pretty course-grained
application,
and netstat doesn't show much communication going on (except in
stages).


On 3/21/07 4:21 PM, "Heywood, Todd" <heyw...@cshl.edu> wrote:

I noticed that my OpenMPI processes are using larger amounts of
system time
than user time (via vmstat, top). I'm running on dual-core, dual-CPU
Opterons, with 4 slots per node, where the program has the nodes to
themselves. A closer look showed that they are constantly
switching between
run and sleep states with 4-8 page faults per second.

Why would this be? It doesn't happen with 4 sequential jobs
running on a
node, where I get 99% user time, maybe 1% system time.

The processes have plenty of memory. This behavior occurs whether
I use
processor/memory affinity or not (there is no oversubscription).

Thanks,

Todd

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

--

=========================
rolf.vandeva...@sun.com
781-442-3043
=========================

Reply via email to