Re: [OMPI users] Performance Issues on SMP Workstation

Elken, Tom Mon, 06 Feb 2017 08:49:21 -0800

“c.) the workstation is hyper threaded and cluster is not”

You might turn off hyperthreading (HT) on the workstation, and re-run.
I’ve seen some OS’s on some systems get confused and assign multiple OS “cpus” 
to the same HW core/thread.


In any case, if you turn HT off, and top shows you that tasks are running on 
different ‘cpus’, you can be sure they are running on different cores, and less 
likely to interfere with each other.

-Tom

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Andy Witzig
Sent: Monday, February 06, 2017 8:25 AM
To: Open MPI Users <users@lists.open-mpi.org>
Subject: Re: [OMPI users] Performance Issues on SMP Workstation

Hi all,

My apologies for not replying sooner on this issue - I’ve been swamped with 
other tasking.  Here’s my latest:

1.) I have looked deep into bindings on both systems (used --report-bindings 
option) and nothing came to light.  I’ve tried multiple variations on bindings 
settings and only minor improvements were made on the workstation.

2.) I used the mpirun --tag-output grep Cpus_allowed_list /proc/self/status 
command and everything was in order on both systems.

3.) I used ompi_info -c (per recommendation of Penguin Computing support staff) 
and looked at the differences in configuration.  I’m pasting the output below 
for reference.  The only settings in the cluster configuration that were not 
present in the workstation configuration were: --enable-__cxa_atexit, 
--disable-libunwind-exceptions, and --disable-dssi.  There were several 
settings present in the workstation configuration that were not set in the 
cluster configuration.  Any reason why the same version of OpenMPI would have 
such different settings?

3.) I used hwloc and lstopo to compare system hardware and confirmed that the 
workstation has either equivalent or superior specs to the cluster node setup.

3.) Primary differences I can see right now are:
            a.) OpenMPI 1.6.4 was compiled using gcc 4.4.7 on the cluster and I 
am compiling with gcc 5.4.0 on the workstation;
        b.) OpenMPI compile configurations are different;
            b.) the cluster uses Torque/PBS to submit the jobs and;
            c.) the workstation is hyper threaded and cluster is not
        d.) Workstation runs on Ubuntu while cluster runs on CentOS

My next steps will be to compile/install gcc 4.4.7 on the Workstation and 
recompile OpenMPI 1.6.4 to ensure the software configuration is equivalent, and 
do my best to replicate the cluster configuration settings.  I will also look 
into the profiling tools that Christoph mentioned and see if any details come 
to light.

Thanks much,
Andy

---------------------------WORKSTATION OMPI_INFO -C 
OUTPUT---------------------------
Using built-in specs.
COLLECT_GCC=/usr/bin/gfortran
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/5/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v
--with-pkgversion='Ubuntu 5.4.0-6ubuntu1~16.04.4'
--with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs<file:///\\usr\share\doc\gcc-5\README.Bugs>
--enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++
--prefix=/usr
--program-suffix=-5
--enable-shared
--enable-linker-build-id
--libexecdir=/usr/lib
--without-included-gettext
--enable-threads=posix
--libdir=/usr/lib
--enable-nls
--with-sysroot=/
--enable-clocale=gnu
--enable-libstdcxx-debug
--enable-libstdcxx-time=yes
--with-default-libstdcxx-abi=new
--enable-gnu-unique-object
--disable-vtable-verify
--enable-libmpx
--enable-plugin
--with-system-zlib
--disable-browser-plugin
--enable-java-awt=gtk
--enable-gtk-cairo
--with-java-home=/usr/lib/jvm/java-1.5.0-gcj-5-amd64/jre
--enable-java-home
--with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-amd64
--with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-amd64
--with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar
--enable-objc-gc
--enable-multiarch
--disable-werror
--with-arch-32=i686
--with-abi=m64
--with-multilib-list=m32,m64,mx32
--enable-multilib
--with-tune=generic
--enable-checking=release
--build=x86_64-linux-gnu
--host=x86_64-linux-gnu
--target=x86_64-linux-gnu
Thread model: posix
gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)

---------------------------CLUSTER OMPI_INFO -C 
OUTPUT---------------------------
Using built-in specs.
Target: x86_64-redhat-linux
Configured with: ./configure

--prefix=/public/apps/gcc/4.4.7
--enable-shared
--enable-threads=posix
--enable-checking=release
--with-system-zlib
--enable-__cxa_atexit
--disable-libunwind-exceptions
--enable-gnu-unique-object
--disable-dssi
--with-arch_32=i686
--build=x86_64-redhat-linux build_alias=x86_64-redhat-linux
--enable-languages=c,c++,fortran,objc,obj-c++
Thread model: posix
gcc version 4.4.7 (GCC)

On Feb 2, 2017, at 5:28 AM, Gilles Gouaillardet 
<gilles.gouaillar...@gmail.com<mailto:gilles.gouaillar...@gmail.com>> wrote:

i cannot remember what is the default binding (if any) on Open MPI 1.6
nor whether the default is the same with or without PBS

you can simply
mpirun --tag-output grep Cpus_allowed_list /proc/self/status
and see if you note any discrepancy between your systems

you might also consider upgrading to the latest Open MPI 2.0.2, and see how 
things gi

Cheers,

Gilles

On Thursday, February 2, 2017, <nietham...@hlrs.de<mailto:nietham...@hlrs.de>> 
wrote:
Hello Andy,

You can also use the --report-bindings option of mpirun to check which cores
your program will use and to which cores the processes are bound to.


Are you using the same backend compiler on both systems?

Do you have performance tools available on the systems where you can see in
which part of the Program the time is lost? Common tools would be Score-P/
Vampir/CUBE, TAU, extrae/Paraver.

Best
Christoph

On Wednesday, 1 February 2017 21:09:28 CET Andy Witzig wrote:
> Thank you, Bennet.  From my testing, I?ve seen that the application usually
> performs better at much smaller ranks on the workstation.  I?ve tested on
> the cluster and do not see the same response (i.e. see better performance
> with ranks of -np 15 or 20).   The workstation is not shared and is not
> doing any other work.  I ran the application on the Workstation with top
> and confirmed that 20 procs were fully loaded.
>
> I?ll look into the diagnostics you mentioned and get back with you.
>
> Best regards,
> Andy
>
> On Feb 1, 2017, at 6:15 PM, Bennet Fauber <ben...@umich.edu<javascript:;>> 
> wrote:
>
> How do they compare if you run a much smaller number of ranks, say -np 2 or
> 4?
>
> Is the workstation shared and doing any other work?
>
> You could insert some diagnostics into your script, for example,
> uptime and free, both before and after running your MPI program and
> compare.
>
> You could also run top in batch mode in the background for your own
> username, then run your MPI program, and compare the results from top.
> We've seen instances where the MPI ranks only get distributed to a
> small number of processors, which you see if they all have small
> percentages of CPU.
>
> Just flailing in the dark...
>
> -- bennet
>
> On Wed, Feb 1, 2017 at 6:36 PM, Andy Witzig 
> <cap1...@icloud.com<javascript:;>> wrote:
> > Thank for the idea.  I did the test and only get a single host.
> >
> > Thanks,
> > Andy
> >
> > On Feb 1, 2017, at 5:04 PM, r...@open-mpi.org<javascript:;> wrote:
> >
> > Simple test: replace your executable with ?hostname?. If you see multiple
> > hosts come out on your cluster, then you know why the performance is
> > different.
> >
> > On Feb 1, 2017, at 2:46 PM, Andy Witzig <cap1...@icloud.com<javascript:;>> 
> > wrote:
> >
> > Honestly, I?m not exactly sure what scheme is being used.  I am using the
> > default template from Penguin Computing for job submission.  It looks
> > like:
> >
> > #PBS -S /bin/bash
> > #PBS -q T30
> > #PBS -l walltime=24:00:00,nodes=1:ppn=20
> > #PBS -j oe
> > #PBS -N test
> > #PBS -r n
> >
> > mpirun $EXECUTABLE $INPUT_FILE
> >
> > I?m not configuring OpenMPI anywhere else. It is possible the Penguin
> > Computing folks have pre-configured my MPI environment.  I?ll see what I
> > can find.
> >
> > Best regards,
> > Andy
> >
> > On Feb 1, 2017, at 4:32 PM, Douglas L Reeder 
> > <d...@centurylink.net<javascript:;>> wrote:
> >
> > Andy,
> >
> > What allocation scheme are you using on the cluster. For some codes we see
> > noticeable differences using fillup vs round robin, not 4x though. Fillup
> > is more shared memory use while round robin uses more infinniband.
> >
> > Doug
> >
> > On Feb 1, 2017, at 3:25 PM, Andy Witzig <cap1...@icloud.com<javascript:;>> 
> > wrote:
> >
> > Hi Tom,
> >
> > The cluster uses an Infiniband interconnect.  On the cluster I?m
> > requesting: #PBS -l walltime=24:00:00,nodes=1:ppn=20.  So technically,
> > the run on the cluster should be SMP on the node, since there are 20
> > cores/node.  On the workstation I?m just using the command: mpirun -np 20
> > ?. I haven?t finished setting Torque/PBS up yet.
> >
> > Best regards,
> > Andy
> >
> > On Feb 1, 2017, at 4:10 PM, Elken, Tom <tom.el...@intel.com<javascript:;>> 
> > wrote:
> >
> > For this case:  " a cluster system with 2.6GHz Intel Haswell with 20 cores
> > / node and 128GB RAM/node.  "
> >
> > are you running 5 ranks per node on 4 nodes?
> > What interconnect are you using for the cluster?
> >
> > -Tom
> >
> > -----Original Message-----
> > From: users [mailto:users-boun...@lists.open-mpi.org<javascript:;>] On 
> > Behalf Of Andrew
> > Witzig
> > Sent: Wednesday, February 01, 2017 1:37 PM
> > To: Open MPI Users
> > Subject: Re: [OMPI users] Performance Issues on SMP Workstation
> >
> > By the way, the workstation has a total of 36 cores / 72 threads, so using
> > mpirun
> > -np 20 is possible (and should be equivalent) on both platforms.
> >
> > Thanks,
> > cap79
> >
> > On Feb 1, 2017, at 2:52 PM, Andy Witzig <cap1...@icloud.com<javascript:;>> 
> > wrote:
> >
> > Hi all,
> >
> > I?m testing my application on a SMP workstation (dual Intel Xeon E5-2697
> > V4
> >
> > 2.3 GHz Intel Broadwell (boost 2.8-3.1GHz) processors 128GB RAM) and am
> > seeing a 4x performance drop compared to a cluster system with 2.6GHz
> > Intel
> > Haswell with 20 cores / node and 128GB RAM/node.  Both applications have
> > been compiled using OpenMPI 1.6.4.  I have tried running:
> >
> >
> > mpirun -np 20 $EXECUTABLE $INPUT_FILE
> > mpirun -np 20 --mca btl self,sm $EXECUTABLE $INPUT_FILE
> >
> > and others, but cannot achieve the same performance on the workstation as
> > is
> >
> > seen on the cluster.  The workstation outperforms on other non-MPI but
> > multi-
> > threaded applications, so I don?t think it?s a hardware issue.
> >
> >
> > Any help you can provide would be appreciated.
> >
> > Thanks,
> > cap79
> > _______________________________________________
> > users mailing list
> > users@lists.open-mpi.org<javascript:;>
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> >
> > _______________________________________________
> > users mailing list
> > users@lists.open-mpi.org<javascript:;>
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> >
> > _______________________________________________
> > users mailing list
> > users@lists.open-mpi.org<javascript:;>
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> >
> > _______________________________________________
> > users mailing list
> > users@lists.open-mpi.org<javascript:;>
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> >
> >
> > _______________________________________________
> > users mailing list
> > users@lists.open-mpi.org<javascript:;>
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> >
> > _______________________________________________
> > users mailing list
> > users@lists.open-mpi.org<javascript:;>
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> >
> >
> > _______________________________________________
> > users mailing list
> > users@lists.open-mpi.org<javascript:;>
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> >
> >
> > _______________________________________________
> > users mailing list
> > users@lists.open-mpi.org<javascript:;>
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org<javascript:;>
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org<javascript:;>
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org<javascript:;>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Performance Issues on SMP Workstation

Reply via email to