When I see such issues, I immediately start to think about binding patterns. 
How are these jobs being launched - with mpirun or srun? What do you see if you 
set OMPI_MCA_hwloc_base_report_bindings=1 in your environment?

> On Dec 16, 2015, at 11:15 AM, Jingchao Zhang <zh...@unl.edu> wrote:
> 
> Hi Gilles,
> 
> The LAMMPS jobs for both versions are pure MPI. In the SLURM script, 64 cores 
> are requested from 4 nodes. So it's 64 MPI tasks and not necessarily evenly 
> distributed across all the nodes. (each node is equipped with 64 cores.)
> 
> I can reproduce the performance issue using the LAMMPS example 
> "VISCOSITY/in.wall.2d". The run time difference is a jaw-dropping 20 seconds 
> (v-1.8.4) vs. 45 mins (v-1.10.1). Among the multiple tests, I do have one job 
> using v-1.10.1 finished in 20 seconds. Again, unstable performance. We also 
> tested other software packages such as cp2k, VASP and Quantum Espresso, and 
> they all have similar issues. 
> 
> Here is the decomposed MPI time in the LAMMPS job outputs.
> v-1.8.4 (Job execution time: 00:00:20)
> Loop time of 8.94962 on 64 procs for 50000 steps with 1020 atoms
> Pair  time (%) = 0.270092 (3.01791)
> Neigh time (%) = 0.0842548 (0.941435)
> Comm  time (%) = 3.3474 (37.4027)
> Outpt time (%) = 0.00901061 (0.100682)
> Other time (%) = 5.23886 (58.5373)
> 
> v-1.10.1 (Job execution time: 00:45:50)
> Loop time of 2003.07 on 64 procs for 50000 steps with 1020 atoms
> Pair  time (%) = 0.346776 (0.0173122)
> Neigh time (%) = 0.18047 (0.00900966)
> Comm  time (%) = 535.836 (26.7508)
> Outpt time (%) = 1.68608 (0.0841748)
> Other time (%) = 1465.02 (73.1387)
> 
> I wonder if you can share your config.log and ompi_info with your v-1.10.1 
> compilation. Hopefully we can find a solution by comparing the configuration 
> differences. We had been playing with the cma and vader parameters but with 
> no luck. 
> 
> Thanks,
> Jingchao
> 
> Dr. Jingchao Zhang
> Holland Computing Center
> University of Nebraska-Lincoln
> 402-472-6400
> 
> 
> From: users <users-boun...@open-mpi.org <mailto:users-boun...@open-mpi.org>> 
> on behalf of Gilles Gouaillardet <gil...@rist.or.jp 
> <mailto:gil...@rist.or.jp>>
> Sent: Tuesday, December 15, 2015 12:11 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] performance issue with OpenMPI 1.10.1
>  
> Hi,
> 
> First, can you check how many MPI tasks and OpenMP threads are used with both 
> ompi versions ?
> /* it should be 16 MPI tasks x no OpenMP threads */
> 
> can you also post both MPI task timing breakdown (from the output)
> 
> i tried a simple test with the VISCOSITY/in.wall.2d and i did not observe any 
> performance difference.
> 
> can you reproduce the performance drop with an input file from the examples 
> directory ?
> if not, can you post your in.snr input file ?
> 
> Cheers,
> 
> Gilles
> 
> On 12/15/2015 7:18 AM, Jingchao Zhang wrote:
>> Hi all, 
>> 
>> We installed the latest release of OpenMPI 1.10.1 on our Linux cluster and 
>> find it having some performance issues. We tested the OpenMPI performance 
>> against the MD simulation package LAMMPS (http://lammps.sandia.gov/ 
>> <http://lammps.sandia.gov/>). Compared to our previous installation of 
>> version 1.8.4, the 1.10.1 is nearly three times slower when running on 
>> multiple nodes. Run time across four computing nodes have the following 
>> results:
>>      1.10.1  1.8.4
>> 1    0:09:39 0:09:21
>> 2    0:50:29 0:09:23
>> 3    0:50:29 0:09:28
>> 4    0:13:38 0:09:27
>> 5    0:10:43 0:09:34
>> Ave  0:27:00 0:09:27
>> 
>> Unit is hour:minute:second. Five tests are done for each case and the 
>> averaged run time is listed in the last row. Tests on single node have the 
>> same run time results for both 1.10.1 and 1.8.4. 
>> 
>> We use SLURM as our job scheduler and the submit script for the LAMMPS job 
>> is as below:
>> "#!/bin/sh
>> #SBATCH -N 4
>> #SBATCH -n 64
>> #SBATCH --mem=2g
>> #SBATCH --time=00:50:00
>> #SBATCH --error=job.%J.err
>> #SBATCH --output=job.%J.out
>> 
>> module load compiler/gcc/4.7
>> export PATH=$PATH:/util/opt/openmpi/1.10.1/gcc/4.7/bin
>> export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/util/opt/openmpi/1.10.1/gcc/4.7/lib
>> export INCLUDE=$INCLUDE:/util/opt/openmpi/1.10.1/gcc/4.7/include
>> 
>> mpirun lmp_ompi_g++ < in.snr"
>> 
>> The "lmp_ompi_g++" binary is compiled against gcc/4.7 and openmpi/1.10.1. 
>> The compiler flags and MPI information can be found in the attachments. The 
>> problem here as you can see is the unstable performance for v-1.10.1. I 
>> wonder if this is a configuration issue at the compilation stage. 
>> 
>> Below are some information I gathered according to the "Getting Help" page. 
>> Version of Open MPI that we are using:
>> Open MPI version: 1.10.1
>> Open MPI repo revision: v1.10.0-178-gb80f802
>> Open MPI release date: Nov 03, 2015
>> 
>> "config.log" and "ompi_info --all" information are enclosed in the 
>> attachment.
>> 
>> Network information:
>> 1. OpenFabrics version
>> Mellanox/vendor 2.4-1.0.4 
>> Download:<http://www.mellanox.com/page/mlnx_ofed_eula?mtag=linux_sw_drivers&mrequest=downloads&mtype=ofed&mver=MLNX_OFED-2.4-1.0.4&mname=MLNX_OFED_LINUX-2.4-1.0.4-rhel6.6-x86_64.tgz>
>>  
>> <http://www.mellanox.com/page/mlnx_ofed_eula?mtag=linux_sw_drivers&mrequest=downloads&mtype=ofed&mver=MLNX_OFED-2.4-1.0.4&mname=MLNX_OFED_LINUX-2.4-1.0.4-rhel6.6-x86_64.tgz>
>> 
>> 2. Linux version
>> Scientific Linux release 6.6
>> 2.6.32-504.23.4.el6.x86_64
>> 
>> 3. subnet manager
>> OpenSM
>> 
>> 4. ibv_devinfo
>> hca_id: mlx4_0
>>         transport:                      InfiniBand (0)
>>         fw_ver:                         2.9.1000
>>         node_guid:                      0002:c903:0050:6190
>>         sys_image_guid:                 0002:c903:0050:6193
>>         vendor_id:                      0x02c9
>>         vendor_part_id:                 26428
>>         hw_ver:                         0xB0
>>         board_id:                       MT_0D90110009
>>         phys_port_cnt:                  1
>>                 port:   1
>>                         state:                  PORT_ACTIVE (4)
>>                         max_mtu:                4096 (5)
>>                         active_mtu:             4096 (5)
>>                         sm_lid:                 1
>>                         port_lid:               34
>>                         port_lmc:               0x00
>>                         link_layer:             InfiniBand
>> 
>> 5. ifconfig
>> em1       Link encap:Ethernet  HWaddr D0:67:E5:F9:20:76
>>           inet addr:10.138.25.3  Bcast:10.138.255.255  Mask:255.255.0.0
>>           inet6 addr: fe80::d267:e5ff:fef9:2076/64 Scope:Link
>>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>           RX packets:28977969 errors:0 dropped:0 overruns:0 frame:0
>>           TX packets:67069501 errors:0 dropped:0 overruns:0 carrier:0
>>           collisions:0 txqueuelen:1000
>>           RX bytes:3588666680 (3.3 GiB)  TX bytes:8145183622 (7.5 GiB)
>> 
>> Ifconfig uses the ioctl access method to get the full address information, 
>> which limits hardware addresses to 8 bytes.
>> Because Infiniband address has 20 bytes, only the first 8 bytes are 
>> displayed correctly.
>> Ifconfig is obsolete! For replacement check ip.
>> ib0       Link encap:InfiniBand  HWaddr 
>> A0:00:02:20:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
>>           inet addr:10.137.25.3  Bcast:10.137.255.255  Mask:255.255.0.0
>>           inet6 addr: fe80::202:c903:50:6191/64 Scope:Link
>>           UP BROADCAST RUNNING MULTICAST  MTU:2044  Metric:1
>>           RX packets:1776 errors:0 dropped:0 overruns:0 frame:0
>>           TX packets:418 errors:0 dropped:0 overruns:0 carrier:0
>>           collisions:0 txqueuelen:1024
>>           RX bytes:131571 (128.4 KiB)  TX bytes:81418 (79.5 KiB)
>> 
>> lo        Link encap:Local Loopback
>>           inet addr:127.0.0.1  Mask:255.0.0.0
>>           inet6 addr: ::1/128 Scope:Host
>>           UP LOOPBACK RUNNING  MTU:65536  Metric:1
>>           RX packets:40310687 errors:0 dropped:0 overruns:0 frame:0
>>           TX packets:40310687 errors:0 dropped:0 overruns:0 carrier:0
>>           collisions:0 txqueuelen:0
>>           RX bytes:45601859442 (42.4 GiB)  TX bytes:45601859442 (42.4 GiB)
>> 
>> 6. ulimit -l
>> unlimited
>> 
>> Please kindly let me know if more information are needed.
>> 
>> Thanks,
>> Jingchao
>> 
>> Dr. Jingchao Zhang
>> Holland Computing Center
>> University of Nebraska-Lincoln
>> 402-472-6400
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/12/28160.php 
>> <http://www.open-mpi.org/community/lists/users/2015/12/28160.php>
> _______________________________________________
> users mailing list
> us...@open-mpi.org <mailto:us...@open-mpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/12/28166.php 
> <http://www.open-mpi.org/community/lists/users/2015/12/28166.php>

Reply via email to