Re: [OMPI users] OpenMPI at scale on Cray XK7
Hi, Nathan, could you please advise what is expected startup time for OpenMPI job at such scale (128K ranks)? I'm interesting in 1) time from mpirun start to completion of MPI_Init() 2) time from MPI_Init() start to completion of MPI_Init() >From my experience for 52800 rank job 1) took around 20 min 2) took around 12 min that actually looks like a hung. Any advice how to improve startup times of large scale jobs would be very much appreciated. Best regards, Andrey -Original Message- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Nathan Hjelm Sent: Tuesday, April 23, 2013 2:47 AM To: Open MPI Users Subject: Re: [OMPI users] OpenMPI at scale on Cray XK7 On Mon, Apr 22, 2013 at 03:17:16PM -0700, Mike Clark wrote: > Hi, > > I am trying to run OpenMPI on the Cray XK7 system at Oak Ridge National Lab (Titan), and am running in an issue whereby MPI_Init seems to hang indefinitely, but this issue only arises at large scale, e.g., when running on 18560 compute nodes (with two MPI processes per node). The application runs successfully on 4600 nodes, and we are currently trying to test a 9000 node job to see if this fails or runs. > > We are launching our job using something like the following > > # mpirun command > mpicmd="$OMP_DIR/bin/mpirun --prefix $OMP_DIR -np 37120 --npernode 2 --bind-to core --bind-to numa $app $args" > # Print and Run the Command > echo $mpicmd > $mpicmd >& $output > > Are there any issues that I should be aware of when running OpenMPI on 37120 processes or when running on the Cray Gemini Interconnect? We have only tested Open MPI up to 131072 ranks on 8192 nodes. Have you tried running DDT on the process to see where it is hung up? I have a Titan account so I can help with debugging. I would like to get this issue fixed in 1.7.2. -Nathan ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Copying installed runtimes to another machine and using it
Hi, Am 23.04.2013 um 03:39 schrieb Manee: > When I copy my OpenMPI installed directory to another computer (the runtime > files), and point PATH and LD_LIBRARY_PATH to this installed folder (to make > mpirun point to the copied folder's bin), it does not seem to run (it's not > supposed to run because I compiled it on a different machine with a different > prefix and just copying the runtimes). > > Is there a way to compile the libraries such that it could be copied to a > different machine and be used? It's necessary to set OPAL_PREFIX to the path of the Open MPI copy before executing `mpiexec`. http://www.open-mpi.org/faq/?category=building#installdirs -- Reuti > > Thanks > MM > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Using Boost::Thread for multithreading within OpenMPI processes
Hi Jacky, I'm a regular reader of this list but seldom a poster. In this case however I might actually be qualified to answer some questions or provide some insight given I'm not sure how many other folks here use Boost.Thread. The first question is really what sort of threading model you want to use with MPI, which others here are probably more qualified to advise you on. In our applications we're using Boost.Thread with MPI_THREAD_MULTIPLE, which is a not all-together enjoyable experience because the openib BTL lacks support for thread multiple (at least as of the last time I checked). That being said, Boost.Thread behaves just like any pthread code on the linux clusters we run on, as well as one BlueGene/P. With MPI_THREAD_SERIALIZED writing hybrid-parallel code is pretty painless. Most of the work required involved adding two-stage collectives such that threads first perform collectives locally and then a single thread participates in the MPI collective operation. If you end up using Boost.MPI you could probably even write your own wrappers to encapsulate the local computation required for MPI collective operations. Unfortunately Boost.MPI currently lacks full support for even MPI-2 but if it includes the subset of functionality you need it may be worthwhile. Extensions are fairly straightforward to implement as well. I've implemented a few different approaches to MPI + threading in the context of Boost, from explicit thread management to thread pools, and currently a complete runtime system. Most of it is research code, though there's no reason it couldn't be released, and some of it probably will be eventually. If you'd like to describe your intended use case I'm happy to offer any advice I can based on what I've learned. Cheers, Nick On Apr 22, 2013, at 3:25 PM, Thomas Watson wrote: > Hi, > > I would like to create a pool of threads (using Boost::Thread) within each > OpenMPI process to accelerate my application on multicore CPUs. My > application is already built on OpenMPI, but it currently exploits > parallelism only at the process level. > > I am wondering if anyone can point me to some good > tutorials/documents/examples on how to integrate Boost multithreading with > OpenMPI applications? > > Thanks! > > Jacky > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] OpenMPI at scale on Cray XK7
On Tue, Apr 23, 2013 at 12:21:49PM +0400, wrote: > Hi, > > Nathan, could you please advise what is expected startup time for OpenMPI > job at such scale (128K ranks)? I'm interesting in > 1) time from mpirun start to completion of MPI_Init() It takes less than a minute to run: mpirun -n 131072 /bin/true > 2) time from MPI_Init() start to completion of MPI_Init() A simple MPI application took about about 1.25 mins to run. If you want to see our setup you can take a look at contrib/platform/lanl/cray_xe6. > >From my experience for 52800 rank job > 1) took around 20 min > 2) took around 12 min > that actually looks like a hung. How many nodes? I have never seen launch times that bad on Cielo. You could try adding -mca routed debruijn -novm and see if that helps. It will reduce the amount of communication between compute nodes and the login node. > Any advice how to improve startup times of large scale jobs would be very > much appreciated. The bottleneck for launching at scale is the initial communication between the orteds and mpirun. At this time I don't know what can be done to improve that (I have some ideas but nothing has been implemented yet). At 8192 nodes this takes less than a minute. Everything else should be fairly quick. -Nathan Hjelm HPC-3, LANL
Re: [OMPI users] OpenMPI at scale on Cray XK7
On Apr 23, 2013, at 10:09 AM, Nathan Hjelm wrote: > On Tue, Apr 23, 2013 at 12:21:49PM +0400, > wrote: >> Hi, >> >> Nathan, could you please advise what is expected startup time for OpenMPI >> job at such scale (128K ranks)? I'm interesting in >> 1) time from mpirun start to completion of MPI_Init() > > It takes less than a minute to run: > > mpirun -n 131072 /bin/true > > >> 2) time from MPI_Init() start to completion of MPI_Init() > > A simple MPI application took about about 1.25 mins to run. If you want to > see our setup you can take a look at contrib/platform/lanl/cray_xe6. > >>> From my experience for 52800 rank job >> 1) took around 20 min >> 2) took around 12 min >> that actually looks like a hung. > > How many nodes? I have never seen launch times that bad on Cielo. You could > try adding -mca routed debruijn -novm and see if that helps. It will reduce > the amount of communication between compute nodes and the login node. I believe the debrujin module was turned off a while ago due to a bug that wasn't fixed. However, try using "-mca routed radix -mca routed_radix 64" > >> Any advice how to improve startup times of large scale jobs would be very >> much appreciated. > > The bottleneck for launching at scale is the initial communication between > the orteds and mpirun. At this time I don't know what can be done to improve > that (I have some ideas but nothing has been implemented yet). At 8192 nodes > this takes less than a minute. Everything else should be fairly quick. I mentioned this to Pasha on the phone call today. We had previously collaborated to get a pretty fast startup time on this machine - I believe we used static ports to reduce the initial comm scaling. You might want to check with him. > > -Nathan Hjelm > HPC-3, LANL > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] OpenMPI at scale on Cray XK7
On Tue, Apr 23, 2013 at 10:17:46AM -0700, Ralph Castain wrote: > > On Apr 23, 2013, at 10:09 AM, Nathan Hjelm wrote: > > > On Tue, Apr 23, 2013 at 12:21:49PM +0400, > > wrote: > >> Hi, > >> > >> Nathan, could you please advise what is expected startup time for OpenMPI > >> job at such scale (128K ranks)? I'm interesting in > >> 1) time from mpirun start to completion of MPI_Init() > > > > It takes less than a minute to run: > > > > mpirun -n 131072 /bin/true > > > > > >> 2) time from MPI_Init() start to completion of MPI_Init() > > > > A simple MPI application took about about 1.25 mins to run. If you want to > > see our setup you can take a look at contrib/platform/lanl/cray_xe6. > > > >>> From my experience for 52800 rank job > >> 1) took around 20 min > >> 2) took around 12 min > >> that actually looks like a hung. > > > > How many nodes? I have never seen launch times that bad on Cielo. You could > > try adding -mca routed debruijn -novm and see if that helps. It will reduce > > the amount of communication between compute nodes and the login node. > > I believe the debrujin module was turned off a while ago due to a bug that > wasn't fixed. However, try using Was it turned off or was the priority lowered? If it was lowered then -mca routed debruijn should work. The -novm is to avoid the bug (as I understand it). I am working on fixing the bug now in hope it will be ready for 1.7.2. -Nathan
Re: [OMPI users] OpenMPI at scale on Cray XK7
On Apr 23, 2013, at 10:45 AM, Nathan Hjelm wrote: > On Tue, Apr 23, 2013 at 10:17:46AM -0700, Ralph Castain wrote: >> >> On Apr 23, 2013, at 10:09 AM, Nathan Hjelm wrote: >> >>> On Tue, Apr 23, 2013 at 12:21:49PM +0400, >>> wrote: Hi, Nathan, could you please advise what is expected startup time for OpenMPI job at such scale (128K ranks)? I'm interesting in 1) time from mpirun start to completion of MPI_Init() >>> >>> It takes less than a minute to run: >>> >>> mpirun -n 131072 /bin/true >>> >>> 2) time from MPI_Init() start to completion of MPI_Init() >>> >>> A simple MPI application took about about 1.25 mins to run. If you want to >>> see our setup you can take a look at contrib/platform/lanl/cray_xe6. >>> > From my experience for 52800 rank job 1) took around 20 min 2) took around 12 min that actually looks like a hung. >>> >>> How many nodes? I have never seen launch times that bad on Cielo. You could >>> try adding -mca routed debruijn -novm and see if that helps. It will reduce >>> the amount of communication between compute nodes and the login node. >> >> I believe the debrujin module was turned off a while ago due to a bug that >> wasn't fixed. However, try using > > Was it turned off or was the priority lowered? If it was lowered then -mca > routed debruijn should work. The -novm is to avoid the bug (as I understand > it). I am working on fixing the bug now in hope it will be ready for 1.7.2. Pretty sure it is ompi_ignored and thus, not in the tarball > > -Nathan > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] OpenMPI at scale on Cray XK7
Hi, Just to follow up on this. We have managed to get OpenMPI to run at large scale, to do so we had to use aprun instead of using openmpi's mpirun command. While this has allowed us to now run at the full scale of Titan, we have found a huge drop in MPI_Alltoall performance when running at 18K processors. E.g., performance per node has decreased by a factor 200X versus running at 4.6K nodes. Is there any obvious explanation for this that I could have overlooked such as a buffer size or option that needs to be set (configure option or environment variable) when running at such large scale? We are running inter-communicator one-way sending if this makes any difference. Yours optimistically, Mike. On 4/22/13 3:17 PM, "Mike Clark" wrote: >Hi, > >I am trying to run OpenMPI on the Cray XK7 system at Oak Ridge National >Lab (Titan), and am running in an issue whereby MPI_Init seems to hang >indefinitely, but this issue only arises at large scale, e.g., when >running on 18560 compute nodes (with two MPI processes per node). The >application runs successfully on 4600 nodes, and we are currently trying >to test a 9000 node job to see if this fails or runs. > >We are launching our job using something like the following > ># mpirun command > >mpicmd="$OMP_DIR/bin/mpirun --prefix $OMP_DIR -np 37120 --npernode 2 >--bind-to core --bind-to numa $app $args" ># Print and Run the Command > >echo $mpicmd >$mpicmd >& $output > >Are there any issues that I should be aware of when running OpenMPI on >37120 processes or when running on the Cray Gemini Interconnect? > >We are using OpenMPI 1.7.1 (1.7.x is required for Cray Gemini support) >and gcc 4.7.2. > >Thanks, > >Mike. --- This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. ---