Re: [OMPI users] kernel 2.6.23 vs 2.6.24 - communication/wait times

Oliver Geisler Wed, 31 Mar 2010 10:06:09 -0400

I have tried up to kernel 2.6.33.1 on both architectures (Core2 Duo and
I5) with the same results. The "slow" results are also appearing for
distribution of processes on the 4 cores one single node.
We use
btl = self,sm,tcp
in
/etc/openmpi/openmpi-mca-params.conf
Distributing several process to each one core on several machines is
fast and has "normal" communication times. So I guess tcp communication
shouldn't be the problem.
Also multiple instances of the program, started on one "master" node,
with each instance distributing several processes to one core of "slave"
nodes don't seem to be a problem. In effect 4 instances of the program
occupie all 4 cores on each node which doesn't influence communication
and overall calculation time much.
But running 4 processes from the same "master" instance on 4 cores on
the same node does.



Do you have some more ideas what I can test for? I tried to test
connectivity_c from openmpi examples on 8 nodes/32 processes. It is hard
to get reliable/consistent figures from 'top' since the programm
terminates quite fast and interesting usage is very short. But these are
some shots of 'top' (master and slave nodes show similar images)

System and/or Wait Time are up.

sh-3.2$ mpirun -np 4 -host cluster-05 connectivity_c : -np 28 -host
cluster-06,cluster-07,cluster-08,cluster-09,cluster-10,cluster-11,cluster-12
connectivity_c
Connectivity test on 32 processes PASSED.


Cpu(s): 37.5%us, 46.6%sy,  0.0%ni,  0.0%id, 15.9%wa,  0.0%hi,  0.0%si,
0.0%st
Mem:   8181236k total,   168200k used,  8013036k free,        0k buffers
Swap:        0k total,        0k used,        0k free,   132092k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  P COMMAND
25179 oli       20   0  143m 3436 2196 R   43  0.0   0:00.57 0
25180 oli       20   0  142m 3392 2180 R  100  0.0   0:00.85 3
25182 oli       20   0  142m 3312 2172 R  100  0.0   0:00.93 2
25181 oli       20   0  134m 3052 2172 R  100  0.0   0:00.93 1

Cpu(s): 10.3%us,  8.7%sy,  0.0%ni, 21.4%id, 58.7%wa,  0.8%hi,  0.0%si,
0.0%st
Mem:   8181236k total,   171352k used,  8009884k free,        0k buffers
Swap:        0k total,        0k used,        0k free,   130572k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  P COMMAND
29496 oli       20   0  142m 3300 2176 D   33  0.0   0:00.21 2
29497 oli       20   0  142m 3280 2160 R   25  0.0   0:00.17 0
29494 oli       20   0  134m 3044 2180 D    0  0.0   0:00.01 1
29495 oli       20   0  134m 3036 2172 R   16  0.0   0:00.11 3

Cpu(s): 18.3%us, 36.3%sy,  0.0%ni, 38.0%id,  6.3%wa,  1.1%hi,  0.0%si,
0.0%st
Mem:   8181236k total,   141704k used,  8039532k free,        0k buffers
Swap:        0k total,        0k used,        0k free,    99828k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  P COMMAND
29452 oli       20   0  143m 3452 2212 R   52  0.0   0:00.37 1
29455 oli       20   0  143m 3452 2212 S   57  0.0   0:00.41 3
29453 oli       20   0  143m 3440 2200 S   55  0.0   0:00.39 0
29454 oli       20   0  143m 3440 2200 R   55  0.0   0:00.39 2


Thanks for your thoughts, each input is appreciated.

Oli




On 3/31/2010 8:38 AM, Jeff Squyres wrote:
> I have a very dim recollection of some kernel TCP issues back in some older 
> kernel versions -- such issues affected all TCP communications, not just MPI. 
>  Can you try a newer kernel, perchance?
> 
> 
> On Mar 30, 2010, at 1:26 PM, <open...@docawk.org> <open...@docawk.org> wrote:
> 
>> Hello List,
>>
>> I hope you can help us out on that one, as we are trying to figure out
>> since weeks.
>>
>> The situation: We have a program being capable of slitting to several
>> processes to be shared on nodes within a cluster network using openmpi.
>> We were running that system on "older" cluster hardware (Intel Core2 Duo
>> based, 2GB RAM) using an "older" kernel (2.6.18.6). All nodes are
>> diskless network booting. Recently we upgraded the hardware (Intel i5,
>> 8GB RAM) which also required an upgrade to a recent kernel version
>> (2.6.26+).
>>
>> Here is the problem: We experience overall performance loss on the new
>> hardware and think, we can break it down to a communication issue
>> inbetween the processes.
>>
>> Also, we found out, the issue araises in the transition from kernel
>> 2.6.23 to 2.6.24 (tested on the Core2 Duo system).
>>
>> Here is an output from our programm:
>>
>> 2.6.23.17 (64bit), MPI 1.2.7
>> 5 Iterationen (Core2 Duo) 6 CPU:
>>     93.33 seconds per iteration.
>>  Node   0 communication/computation time:      6.83 /    647.64 seconds.
>>  Node   1 communication/computation time:     10.09 /    644.36 seconds.
>>  Node   2 communication/computation time:      7.27 /    645.03 seconds.
>>  Node   3 communication/computation time:    165.02 /    485.52 seconds.
>>  Node   4 communication/computation time:      6.50 /    643.82 seconds.
>>  Node   5 communication/computation time:      7.80 /    627.63 seconds.
>>  Computation time:    897.00 seconds.
>>
>> 2.6.24.7 (64bit) .. re-evaluated, MPI 1.2.7
>> 5 Iterationen (Core2 Duo) 6 CPU:
>>    131.33 seconds per iteration.
>>  Node   0 communication/computation time:    364.15 /    645.24 seconds.
>>  Node   1 communication/computation time:    362.83 /    645.26 seconds.
>>  Node   2 communication/computation time:    349.39 /    645.07 seconds.
>>  Node   3 communication/computation time:    508.34 /    485.53 seconds.
>>  Node   4 communication/computation time:    349.94 /    643.81 seconds.
>>  Node   5 communication/computation time:    349.07 /    627.47 seconds.
>>  Computation time:   1251.00 seconds.
>>
>> The program is 32 bit software, but it doesn't make any difference
>> whether the kernel is 64 or 32 bit. Also the OpenMPI version 1.4.1 was
>> tested, cut communication times by half (which still is too high), but
>> improvement decreased with increasing kernel version number.
>>
>> The communication time is meant to be the time the master process
>> distributes the data portions for calculation and collecting the results
>> from the slave processes. The value also contains times a slave has to
>> wait to communicate with the master as he is occupied. This explains the
>> extended communication time of node #3 as the calculation time is
>> reduced (based on the nature of the data)
>>
>> The command to start the calculation:
>> mpirun -np 2 -host cluster-17 invert-master -b -s -p inv_grav.inp : -np
>> 4 -host cluster-18,cluster-19
>>
>> Using top (with 'f' and 'j' showing P row) we could track which process
>> runs on which core. We found processes stayed on its initial core in
>> kernel 2.6.23, but started to flip around with 2.6.24. Using the
>> --bind-to-core option in openmpi 1.4.1 kept the processes on its cores
>> again, but that didn't influence the overall outcome, didn't fix the issue.
>>
>> We found top showing ~25% CPU wait time, and processes showing 'D' ,
>> also on slave only nodes. According to our programmer communications are
>> only between the master process and its slaves, but not among slaves. On
>> kernel 2.6.23 and lower CPU usage is 100% on user, no wait or system
>> percentage.
>>
>> Example from top:
>>
>> Cpu(s): 75.3%us,  0.6%sy,  0.0%ni,  0.0%id, 23.1%wa,  0.7%hi,  0.3%si,
>> 0.0%st
>> Mem:   8181236k total,   131224k used,  8050012k free,        0k buffers
>> Swap:        0k total,        0k used,        0k free,    49868k cached
>>
>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  P COMMAND
>>  3386 oli       20   0 90512  20m 3988 R   74  0.3  12:31.80 0 invert-
>>  3387 oli       20   0 85072  15m 3780 D   67  0.2  11:59.30 1 invert-
>>  3388 oli       20   0 85064  14m 3588 D   77  0.2  12:56.90 2 invert-
>>  3389 oli       20   0 84936  14m 3436 R   85  0.2  13:28.30 3 invert-
>>
>>
>> Some system information that might be helpful:
>>
>> Nodes Hardware:
>> 1. "older": Intel Core2 Duo, (2x1)GB RAM
>> 2. "newer": Intel(R) Core(TM) i5 CPU, Mainboard ASUS RS100-E6, (4x2)GB RAM
>>
>> Debian stable (lenny) distribution with
>> ii  libc6                             2.7-18lenny2
>> ii  libopenmpi1                       1.2.7~rc2-2
>> ii  openmpi-bin                       1.2.7~rc2-2
>> ii  openmpi-common                    1.2.7~rc2-2
>>
>> Nodes are booting diskless with nfs-root and a kernel with all drivers
>> needed compiled in.
>>
>> Information on the program using openmpi and tools used to compile it:
>>
>> mpirun --version:
>> mpirun (Open MPI) 1.2.7rc2
>>
>> libopenmpi-dev 1.2.7~rc2-2
>> depends on:
>>  libc6 (2.7-18lenny2)
>>  libopenmpi1 (1.2.7~rc2-2)
>>  openmpi-common (1.2.7~rc2-2)
>>
>>
>> Compilation command:
>> mpif90
>>
>>
>> FORTRAN compiler (FC):
>> gfortran --version:
>> GNU Fortran (Debian 4.3.2-1.1) 4.3.2
>>
>>
>> Called OpenMPI-functions (FORTRAN Bindings):
>> mpi_comm-rank
>> mpi_comm_size
>>
>> mpi_bcast
>> mpi_reduce
>>
>> mpi_isend
>> mpi_wait
>>
>> mpi_send
>> mpi_probe
>> mpi_recv
>>
>> MPI_Wtime
>>
>>
>> Additionally linked libncurses library:
>> libncurses5-dev (5.7+20081213-1)
>> On remote nodes no calls are ever made to this library. On local nodes
>> such calls  (coded in C) are only optionally, but usually they are
>> skipped too (i.e. even no initscr() is called).
>>
>>
>> A signal handler is integrated (coded in C) that reacts specifically on
>> SIGTERM and SIGUSR1 signals.
>>
>>
>> If you need more information (e.g. kernel config etc.) please ask.
>> I hope you can provide some ideas to test and resolve the issue.
>> Thanks anyways.
>>
>> Oli
>>
>>
>> --
>> This message has been scanned for viruses and
>> dangerous content by MailScanner, and is
>> believed to be clean.
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
> 
> 


-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

Re: [OMPI users] kernel 2.6.23 vs 2.6.24 - communication/wait times

Reply via email to