Hello List, I hope you can help us out on that one, as we are trying to figure out since weeks.
The situation: We have a program being capable of slitting to several processes to be shared on nodes within a cluster network using openmpi. We were running that system on "older" cluster hardware (Intel Core2 Duo based, 2GB RAM) using an "older" kernel (2.6.18.6). All nodes are diskless network booting. Recently we upgraded the hardware (Intel i5, 8GB RAM) which also required an upgrade to a recent kernel version (2.6.26+). Here is the problem: We experience overall performance loss on the new hardware and think, we can break it down to a communication issue inbetween the processes. Also, we found out, the issue araises in the transition from kernel 2.6.23 to 2.6.24 (tested on the Core2 Duo system). Here is an output from our programm: 2.6.23.17 (64bit), MPI 1.2.7 5 Iterationen (Core2 Duo) 6 CPU: 93.33 seconds per iteration. Node 0 communication/computation time: 6.83 / 647.64 seconds. Node 1 communication/computation time: 10.09 / 644.36 seconds. Node 2 communication/computation time: 7.27 / 645.03 seconds. Node 3 communication/computation time: 165.02 / 485.52 seconds. Node 4 communication/computation time: 6.50 / 643.82 seconds. Node 5 communication/computation time: 7.80 / 627.63 seconds. Computation time: 897.00 seconds. 2.6.24.7 (64bit) .. re-evaluated, MPI 1.2.7 5 Iterationen (Core2 Duo) 6 CPU: 131.33 seconds per iteration. Node 0 communication/computation time: 364.15 / 645.24 seconds. Node 1 communication/computation time: 362.83 / 645.26 seconds. Node 2 communication/computation time: 349.39 / 645.07 seconds. Node 3 communication/computation time: 508.34 / 485.53 seconds. Node 4 communication/computation time: 349.94 / 643.81 seconds. Node 5 communication/computation time: 349.07 / 627.47 seconds. Computation time: 1251.00 seconds. The program is 32 bit software, but it doesn't make any difference whether the kernel is 64 or 32 bit. Also the OpenMPI version 1.4.1 was tested, cut communication times by half (which still is too high), but improvement decreased with increasing kernel version number. The communication time is meant to be the time the master process distributes the data portions for calculation and collecting the results from the slave processes. The value also contains times a slave has to wait to communicate with the master as he is occupied. This explains the extended communication time of node #3 as the calculation time is reduced (based on the nature of the data) The command to start the calculation: mpirun -np 2 -host cluster-17 invert-master -b -s -p inv_grav.inp : -np 4 -host cluster-18,cluster-19 Using top (with 'f' and 'j' showing P row) we could track which process runs on which core. We found processes stayed on its initial core in kernel 2.6.23, but started to flip around with 2.6.24. Using the --bind-to-core option in openmpi 1.4.1 kept the processes on its cores again, but that didn't influence the overall outcome, didn't fix the issue. We found top showing ~25% CPU wait time, and processes showing 'D' , also on slave only nodes. According to our programmer communications are only between the master process and its slaves, but not among slaves. On kernel 2.6.23 and lower CPU usage is 100% on user, no wait or system percentage. Example from top: Cpu(s): 75.3%us, 0.6%sy, 0.0%ni, 0.0%id, 23.1%wa, 0.7%hi, 0.3%si, 0.0%st Mem: 8181236k total, 131224k used, 8050012k free, 0k buffers Swap: 0k total, 0k used, 0k free, 49868k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND 3386 oli 20 0 90512 20m 3988 R 74 0.3 12:31.80 0 invert- 3387 oli 20 0 85072 15m 3780 D 67 0.2 11:59.30 1 invert- 3388 oli 20 0 85064 14m 3588 D 77 0.2 12:56.90 2 invert- 3389 oli 20 0 84936 14m 3436 R 85 0.2 13:28.30 3 invert- Some system information that might be helpful: Nodes Hardware: 1. "older": Intel Core2 Duo, (2x1)GB RAM 2. "newer": Intel(R) Core(TM) i5 CPU, Mainboard ASUS RS100-E6, (4x2)GB RAM Debian stable (lenny) distribution with ii libc6 2.7-18lenny2 ii libopenmpi1 1.2.7~rc2-2 ii openmpi-bin 1.2.7~rc2-2 ii openmpi-common 1.2.7~rc2-2 Nodes are booting diskless with nfs-root and a kernel with all drivers needed compiled in. Information on the program using openmpi and tools used to compile it: mpirun --version: mpirun (Open MPI) 1.2.7rc2 libopenmpi-dev 1.2.7~rc2-2 depends on: libc6 (2.7-18lenny2) libopenmpi1 (1.2.7~rc2-2) openmpi-common (1.2.7~rc2-2) Compilation command: mpif90 FORTRAN compiler (FC): gfortran --version: GNU Fortran (Debian 4.3.2-1.1) 4.3.2 Called OpenMPI-functions (FORTRAN Bindings): mpi_comm-rank mpi_comm_size mpi_bcast mpi_reduce mpi_isend mpi_wait mpi_send mpi_probe mpi_recv MPI_Wtime Additionally linked libncurses library: libncurses5-dev (5.7+20081213-1) On remote nodes no calls are ever made to this library. On local nodes such calls (coded in C) are only optionally, but usually they are skipped too (i.e. even no initscr() is called). A signal handler is integrated (coded in C) that reacts specifically on SIGTERM and SIGUSR1 signals. If you need more information (e.g. kernel config etc.) please ask. I hope you can provide some ideas to test and resolve the issue. Thanks anyways. Oli -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.