Re: [OMPI users] performance issue with OpenMPI 1.10.1

Novosielski, Ryan Thu, 17 Dec 2015 20:08:05 -0500 (EST)

I'm no expert, but this one is pretty obvious. The error message says exactly 
what you should change:


 Equivalent MCA parameter:
    Deprecated:  rmaps_base_bycore
    Replacement: rmaps_base_mapping_policy=core

--
____ *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences*
 || \\UTGERS      |---------------------*O*---------------------
 ||_// Biomedical | Ryan Novosielski - Senior Technologist
 || \\ and Health | novos...@rutgers.edu - 973/972.0922 (2x0922)
 ||  \\  Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark
      `'
________________________________________
From: users [users-boun...@open-mpi.org] On Behalf Of Jingchao Zhang 
[zh...@unl.edu]
Sent: Thursday, December 17, 2015 8:01 PM
To: Open MPI Users
Subject: Re: [OMPI users] performance issue with OpenMPI 1.10.1

The "mpirun --hetero-nodes -bind-to core -map-by core" resolves the performance 
issue!


I reran my test in the *same* job.

SLURM resource request:

#!/bin/sh
#SBATCH -N 4
#SBATCH -n 64
#SBATCH --mem=2g
#SBATCH --time=02:00:00
#SBATCH --error=job.%J.err
#SBATCH --output=job.%J.out


env | grep SLURM:

SLURM_CHECKPOINT_IMAGE_DIR=/lustre/work/swanson/jingchao/mpitest/examples/1.10.1/3
SLURM_NODELIST=c[3005,3011,3019,3105]
SLURM_JOB_NAME=submit
SLURMD_NODENAME=c3005
SLURM_TOPOLOGY_ADDR=s0.s5.c3005
SLURM_PRIO_PROCESS=0
SLURM_NODE_ALIASES=(null)
SLURM_TOPOLOGY_ADDR_PATTERN=switch.switch.node
SLURM_NNODES=4
SLURM_JOBID=5462202
SLURM_NTASKS=64
SLURM_TASKS_PER_NODE=34,26,2(x2)
SLURM_JOB_ID=5462202
SLURM_JOB_USER=jingchao
SLURM_JOB_UID=3663
SLURM_NODEID=0
SLURM_SUBMIT_DIR=/lustre/work/swanson/jingchao/mpitest/examples/1.10.1/3
SLURM_TASK_PID=53822
SLURM_NPROCS=64
SLURM_CPUS_ON_NODE=36
SLURM_PROCID=0
SLURM_JOB_NODELIST=c[3005,3011,3019,3105]
SLURM_LOCALID=0
SLURM_JOB_CPUS_PER_NODE=36,26,2(x2)
SLURM_CLUSTER_NAME=tusker
SLURM_GTIDS=0
SLURM_SUBMIT_HOST=login.tusker.hcc.unl.edu
SLURM_JOB_PARTITION=batch
SLURM_JOB_NUM_NODES=4
SLURM_MEM_PER_NODE=2048

v-1.8.4 "mpirun" and v-1.10.1 "mpirun --hetero-nodes -bind-to core -map-by 
core" now give comparable results.

v-1.10.1 "mpirun" still have unstable performance.



I tried adding the following three lines to the "openmpi-mca-params.conf" file

"

orte_hetero_nodes=1
hwloc_base_binding_policy=core
rmaps_base_bycore=1
"

and ran "mpirun lmp_ompi_g++ < in.wall.2d" with v-1.10.1.


This works for most tests but some jobs are hanging with this message:

--------------------------------------------------------------------------
The following command line options and corresponding MCA parameter have
been deprecated and replaced as follows:

  Command line options:
    Deprecated:  --bycore, -bycore
    Replacement: --map-by core

  Equivalent MCA parameter:
    Deprecated:  rmaps_base_bycore
    Replacement: rmaps_base_mapping_policy=core

The deprecated forms *will* disappear in a future version of Open MPI.
Please update to the new syntax.
--------------------------------------------------------------------------

Did I miss something in the "openmpi-mca-params.conf" file?


Thanks,


Dr. Jingchao Zhang
Holland Computing Center
University of Nebraska-Lincoln
402-472-6400


________________________________
From: users <users-boun...@open-mpi.org> on behalf of Gilles Gouaillardet 
<gil...@rist.or.jp>
Sent: Wednesday, December 16, 2015 6:11 PM
To: Open MPI Users
Subject: Re: [OMPI users] performance issue with OpenMPI 1.10.1

binding is somehow involved in this, and i do not believe vader nor openib are 
involved here.

Could you please run again with the two ompi versions but in the *same* job ?
and before invoking mpirun, could you do
env | grep SLURM

per your slurm request, you are running 64 tasks on 4 nodes.
with 1.8.4, you end up running 14+14+14+22 tasks (not ideal, but quite balanced)
with 1.10.1, you end up running 2+2+12+48 tasks (very unbalanced)
so it is quite unfair to compare these two runs.

also, still in the same job, can you add a third run with 1.10.1 and the 
following options
mpirun --hetero-nodes -bind-to core -map-by core ...
and see if it helps

Cheers,

Gilles




On 12/17/2015 6:47 AM, Jingchao Zhang wrote:

Those jobs were launched with mpirun. Please see the attached files for the 
binding report with OMPI_MCA_hwloc_base_report_bindings=1.


Here is a snapshot for v-1.10.1:

[c2613.tusker.hcc.unl.edu:12049] MCW rank 0 is not bound (or bound to all 
available processors)
[c2613.tusker.hcc.unl.edu:12049] MCW rank 1 is not bound (or bound to all 
available processors)
[c2615.tusker.hcc.unl.edu:11136] MCW rank 2 is not bound (or bound to all 
available processors)
[c2615.tusker.hcc.unl.edu:11136] MCW rank 3 is not bound (or bound to all 
available processors)
[c2907.tusker.hcc.unl.edu:64131] MCW rank 9 is not bound (or bound to all 
available processors)
[c2907.tusker.hcc.unl.edu:64131] MCW rank 10 is not bound (or bound to all 
available processors)
[c2907.tusker.hcc.unl.edu:64131] MCW rank 11 is not bound (or bound to all 
available processors)
[c2907.tusker.hcc.unl.edu:64131] MCW rank 12 is not bound (or bound to all 
available processors)
[c2907.tusker.hcc.unl.edu:64131] MCW rank 13 is not bound (or bound to all 
available processors)
[c2907.tusker.hcc.unl.edu:64131] MCW rank 14 is not bound (or bound to all 
available processors)
[c2907.tusker.hcc.unl.edu:64131] MCW rank 15 is not bound (or bound to all 
available processors)
[c2907.tusker.hcc.unl.edu:64131] MCW rank 4 is not bound (or bound to all 
available processors)
[c2907.tusker.hcc.unl.edu:64131] MCW rank 5 is not bound (or bound to all 
available processors)
[c2907.tusker.hcc.unl.edu:64131] MCW rank 6 is not bound (or bound to all 
available processors)
[c2907.tusker.hcc.unl.edu:64131] MCW rank 7 is not bound (or bound to all 
available processors)
[c2907.tusker.hcc.unl.edu:64131] MCW rank 8 is not bound (or bound to all 
available processors)

The report for 1.8.4 doesn't have this issue. Any suggestions to resolve it?


Thanks,

Jingchao


Dr. Jingchao Zhang
Holland Computing Center
University of Nebraska-Lincoln
402-472-6400


________________________________
From: users <users-boun...@open-mpi.org><mailto:users-boun...@open-mpi.org> on 
behalf of Ralph Castain <r...@open-mpi.org><mailto:r...@open-mpi.org>
Sent: Wednesday, December 16, 2015 1:52 PM
To: Open MPI Users
Subject: Re: [OMPI users] performance issue with OpenMPI 1.10.1

When I see such issues, I immediately start to think about binding patterns. 
How are these jobs being launched - with mpirun or srun? What do you see if you 
set OMPI_MCA_hwloc_base_report_bindings=1 in your environment?

On Dec 16, 2015, at 11:15 AM, Jingchao Zhang 
<zh...@unl.edu<mailto:zh...@unl.edu>> wrote:

Hi Gilles,

The LAMMPS jobs for both versions are pure MPI. In the SLURM script, 64 cores 
are requested from 4 nodes. So it's 64 MPI tasks and not necessarily evenly 
distributed across all the nodes. (each node is equipped with 64 cores.)

I can reproduce the performance issue using the LAMMPS example 
"VISCOSITY/in.wall.2d". The run time difference is a jaw-dropping 20 seconds 
(v-1.8.4) vs. 45 mins (v-1.10.1). Among the multiple tests, I do have one job 
using v-1.10.1 finished in 20 seconds. Again, unstable performance. We also 
tested other software packages such as cp2k, VASP and Quantum Espresso, and 
they all have similar issues.

Here is the decomposed MPI time in the LAMMPS job outputs.
v-1.8.4 (Job execution time: 00:00:20)
Loop time of 8.94962 on 64 procs for 50000 steps with 1020 atoms
Pair  time (%) = 0.270092 (3.01791)
Neigh time (%) = 0.0842548 (0.941435)
Comm  time (%) = 3.3474 (37.4027)
Outpt time (%) = 0.00901061 (0.100682)
Other time (%) = 5.23886 (58.5373)

v-1.10.1 (Job execution time: 00:45:50)
Loop time of 2003.07 on 64 procs for 50000 steps with 1020 atoms
Pair  time (%) = 0.346776 (0.0173122)
Neigh time (%) = 0.18047 (0.00900966)
Comm  time (%) = 535.836 (26.7508)
Outpt time (%) = 1.68608 (0.0841748)
Other time (%) = 1465.02 (73.1387)

I wonder if you can share your config.log and ompi_info with your v-1.10.1 
compilation. Hopefully we can find a solution by comparing the configuration 
differences. We had been playing with the cma and vader parameters but with no 
luck.

Thanks,
Jingchao

Dr. Jingchao Zhang
Holland Computing Center
University of Nebraska-Lincoln
402-472-6400


________________________________
From: users <users-boun...@open-mpi.org<mailto:users-boun...@open-mpi.org>> on 
behalf of Gilles Gouaillardet 
<<mailto:gil...@rist.or.jp>gil...@rist.or.jp<mailto:gil...@rist.or.jp>>
Sent: Tuesday, December 15, 2015 12:11 AM
To: Open MPI Users
Subject: Re: [OMPI users] performance issue with OpenMPI 1.10.1

Hi,

First, can you check how many MPI tasks and OpenMP threads are used with both 
ompi versions ?
/* it should be 16 MPI tasks x no OpenMP threads */

can you also post both MPI task timing breakdown (from the output)

i tried a simple test with the VISCOSITY/in.wall.2d and i did not observe any 
performance difference.

can you reproduce the performance drop with an input file from the examples 
directory ?
if not, can you post your in.snr input file ?

Cheers,

Gilles

On 12/15/2015 7:18 AM, Jingchao Zhang wrote:
Hi all,

We installed the latest release of OpenMPI 1.10.1 on our Linux cluster and find 
it having some performance issues. We tested the OpenMPI performance against 
the MD simulation package LAMMPS 
(<http://lammps.sandia.gov/>http://lammps.sandia.gov/). Compared to our 
previous installation of version 1.8.4, the 1.10.1 is nearly three times slower 
when running on multiple nodes. Run time across four computing nodes have the 
following results:
        1.10.1  1.8.4
1       0:09:39 0:09:21
2       0:50:29 0:09:23
3       0:50:29 0:09:28
4       0:13:38 0:09:27
5       0:10:43 0:09:34
Ave     0:27:00 0:09:27

Unit is hour:minute:second. Five tests are done for each case and the averaged 
run time is listed in the last row. Tests on single node have the same run time 
results for both 1.10.1 and 1.8.4.

We use SLURM as our job scheduler and the submit script for the LAMMPS job is 
as below:
"#!/bin/sh
#SBATCH -N 4
#SBATCH -n 64
#SBATCH --mem=2g
#SBATCH --time=00:50:00
#SBATCH --error=job.%J.err
#SBATCH --output=job.%J.out

module load compiler/gcc/4.7
export PATH=$PATH:/util/opt/openmpi/1.10.1/gcc/4.7/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/util/opt/openmpi/1.10.1/gcc/4.7/lib
export INCLUDE=$INCLUDE:/util/opt/openmpi/1.10.1/gcc/4.7/include

mpirun lmp_ompi_g++ < in.snr"

The "lmp_ompi_g++" binary is compiled against gcc/4.7 and openmpi/1.10.1. The 
compiler flags and MPI information can be found in the attachments. The problem 
here as you can see is the unstable performance for v-1.10.1. I wonder if this 
is a configuration issue at the compilation stage.

Below are some information I gathered according to the "Getting Help" page.
Version of Open MPI that we are using:
Open MPI version: 1.10.1
Open MPI repo revision: v1.10.0-178-gb80f802
Open MPI release date: Nov 03, 2015

"config.log" and "ompi_info --all" information are enclosed in the attachment.

Network information:
1. OpenFabrics version
Mellanox/vendor 2.4-1.0.4 
Download:<http://www.mellanox.com/page/mlnx_ofed_eula?mtag=linux_sw_drivers&mrequest=downloads&mtype=ofed&mver=MLNX_OFED-2.4-1.0.4&mname=MLNX_OFED_LINUX-2.4-1.0.4-rhel6.6-x86_64.tgz><http://www.mellanox.com/page/mlnx_ofed_eula?mtag=linux_sw_drivers&mrequest=downloads&mtype=ofed&mver=MLNX_OFED-2.4-1.0.4&mname=MLNX_OFED_LINUX-2.4-1.0.4-rhel6.6-x86_64.tgz><http://www.mellanox.com/page/mlnx_ofed_eula?mtag=linux_sw_drivers&mrequest=downloads&mtype=ofed&mver=MLNX_OFED-2.4-1.0.4&mname=MLNX_OFED_LINUX-2.4-1.0.4-rhel6.6-x86_64.tgz>

2. Linux version
Scientific Linux release 6.6
2.6.32-504.23.4.el6.x86_64

3. subnet manager
OpenSM

4. ibv_devinfo
hca_id: mlx4_0
        transport:                      InfiniBand (0)
        fw_ver:                         2.9.1000
        node_guid:                      0002:c903:0050:6190
        sys_image_guid:                 0002:c903:0050:6193
        vendor_id:                      0x02c9
        vendor_part_id:                 26428
        hw_ver:                         0xB0
        board_id:                       MT_0D90110009
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               34
                        port_lmc:               0x00
                        link_layer:             InfiniBand

5. ifconfig
em1       Link encap:Ethernet  HWaddr D0:67:E5:F9:20:76
          inet addr:10.138.25.3  Bcast:10.138.255.255  Mask:255.255.0.0
          inet6 addr: fe80::d267:e5ff:fef9:2076/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:28977969 errors:0 dropped:0 overruns:0 frame:0
          TX packets:67069501 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:3588666680 (3.3 GiB)  TX bytes:8145183622 (7.5 GiB)

Ifconfig uses the ioctl access method to get the full address information, 
which limits hardware addresses to 8 bytes.
Because Infiniband address has 20 bytes, only the first 8 bytes are displayed 
correctly.
Ifconfig is obsolete! For replacement check ip.
ib0       Link encap:InfiniBand  HWaddr 
A0:00:02:20:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
          inet addr:10.137.25.3  Bcast:10.137.255.255  Mask:255.255.0.0
          inet6 addr: fe80::202:c903:50:6191/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:2044  Metric:1
          RX packets:1776 errors:0 dropped:0 overruns:0 frame:0
          TX packets:418 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1024
          RX bytes:131571 (128.4 KiB)  TX bytes:81418 (79.5 KiB)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:40310687 errors:0 dropped:0 overruns:0 frame:0
          TX packets:40310687 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:45601859442 (42.4 GiB)  TX bytes:45601859442 (42.4 GiB)

6. ulimit -l
unlimited

Please kindly let me know if more information are needed.

Thanks,
Jingchao

Dr. Jingchao Zhang
Holland Computing Center
University of Nebraska-Lincoln
402-472-6400



_______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/12/28160.php

_______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: <http://www.open-mpi.org/mailman/listinfo.cgi/users> 
http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
<http://www.open-mpi.org/community/lists/users/2015/12/28166.php> 
http://www.open-mpi.org/community/lists/users/2015/12/28166.php




_______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/12/28169.php

Re: [OMPI users] performance issue with OpenMPI 1.10.1

Reply via email to