I don't think it's a bug in OMPI, but more likely reflects improvements in the default collective algorithms. If you want to further improve performance, you should bind your processes to a core (if your application isn't threaded) or to a socket (if threaded).
As someone previously noted, apps will always run slower on multiple nodes vs everything on a single node due to the shared memory vs IB differences. Nothing you can do about that one. On Oct 28, 2013, at 10:36 PM, San B <forum....@gmail.com> wrote: > As discussed earlier, the executable which was compiled with > OpenMPI-1.4.5 gave very low performance of 12338.809 seconds when job > executed on two nodes(8 cores per node). The same job run on single node(all > 16cores) got executed in just 3692.403 seconds. Now I compiled the > application with OpenMPI-1.6.5 and got executed in 5527.320 seconds on two > nodes. > > Is this a performance gain with OMPI-1.6.5 over OMPI-1.4.5 or an issue > with OPENMPI itself? > > > On Tue, Oct 15, 2013 at 5:32 PM, San B <forum....@gmail.com> wrote: > Hi, > > As per your instruction, I did the profiling of the application with > mpiP. Following is the difference between the two runs: > > Run 1: 16 mpi processes on single node > > @--- MPI Time (seconds) --------------------------------------------------- > --------------------------------------------------------------------------- > Task AppTime MPITime MPI% > 0 3.61e+03 661 18.32 > 1 3.61e+03 627 17.37 > 2 3.61e+03 700 19.39 > 3 3.61e+03 665 18.41 > 4 3.61e+03 702 19.45 > 5 3.61e+03 703 19.48 > 6 3.61e+03 740 20.50 > 7 3.61e+03 763 21.14 > ... > ... > > Run 2: 16 mpi processes on two nodes - 8 mpi processes per node > > @--- MPI Time (seconds) --------------------------------------------------- > --------------------------------------------------------------------------- > Task AppTime MPITime MPI% > 0 1.27e+04 1.06e+04 84.14 > 1 1.27e+04 1.07e+04 84.34 > 2 1.27e+04 1.07e+04 84.20 > 3 1.27e+04 1.07e+04 84.20 > 4 1.27e+04 1.07e+04 84.22 > 5 1.27e+04 1.07e+04 84.25 > 6 1.27e+04 1.06e+04 84.02 > 7 1.27e+04 1.07e+04 84.35 > 8 1.27e+04 1.07e+04 84.29 > > > The time spent for MPI functions in run 1 is less than 20%, where > as it is more than 80% in the run 2. For more details, I've attached both > output files. Please go thru these files and suggest what optimization we can > do with OpenMPI or Intel MKL. > > Thanks > > > On Mon, Oct 7, 2013 at 12:15 PM, San B <forum....@gmail.com> wrote: > Hi, > > I'm facing a performance issue with a scientific application(Fortran). The > issue is, it runs faster on single node but runs very slow on multiple nodes. > For example, a 16 core job on single node finishes in 1hr 2mins, but the same > job on two nodes (i.e. 8 cores per node & remaining 8 cores kept free) takes > 3hr 20mins. The code is compiled with ifort-13.1.1, openmpi-1.4.5 and intel > MKL libraries - lapack, blas, scalapack, blacs & fftw. What could be the > problem here with? > > Is it possible to do any tuning in OpenMPI? FY More info: The cluster has > Intel Sandybridge processor (E5-2670), infiniband and Hyperthreading is > Enabled. Jobs are submitted thru LSF scheduler. > > Does HyperThreading causing any problem here? > > > Thanks > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users