Re: [OMPI users] OpenMPI at scale on Cray XK7

2013-04-23 Thread Дербунович Андрей
Hi,

Nathan,  could you please advise what is expected startup time for OpenMPI
job at such scale (128K ranks)? I'm interesting in
1) time from mpirun start to completion of MPI_Init()
2) time from MPI_Init() start to completion of MPI_Init()

>From my experience for 52800 rank job
1) took around 20 min
2) took around 12 min
that actually looks like a hung.

Any advice how to improve startup times of large scale jobs would be very
much appreciated.

Best regards,
Andrey


-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Nathan Hjelm
Sent: Tuesday, April 23, 2013 2:47 AM
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI at scale on Cray XK7

On Mon, Apr 22, 2013 at 03:17:16PM -0700, Mike Clark wrote:
> Hi,
>
> I am trying to run OpenMPI on the Cray XK7 system at Oak Ridge National
Lab (Titan), and am running in an issue whereby MPI_Init seems to hang
indefinitely, but this issue only arises at large scale, e.g., when
running on 18560 compute nodes (with two MPI processes per node).  The
application runs successfully on 4600 nodes, and we are currently trying
to test a 9000 node job to see if this fails or runs.
>
> We are launching our job using something like the following
>
> # mpirun command

> mpicmd="$OMP_DIR/bin/mpirun --prefix $OMP_DIR -np 37120 --npernode 2
--bind-to core --bind-to numa $app $args"
> # Print  and Run the Command

> echo $mpicmd
> $mpicmd >& $output
>
> Are there any issues that I should be aware of when running OpenMPI on
37120 processes or when running on the Cray Gemini Interconnect?

We have only tested Open MPI up to 131072 ranks on 8192 nodes. Have you
tried running DDT on the process to see where it is hung up?

I have a Titan account so I can help with debugging. I would like to get
this issue fixed in 1.7.2.

-Nathan
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] Copying installed runtimes to another machine and using it

2013-04-23 Thread Reuti
Hi,

Am 23.04.2013 um 03:39 schrieb Manee:

> When I copy my OpenMPI installed directory to another computer (the runtime 
> files), and point PATH and LD_LIBRARY_PATH to this installed folder (to make 
> mpirun point to the copied folder's bin), it does not seem to run (it's not 
> supposed to run because I compiled it on a different machine with a different 
> prefix and just copying the runtimes). 
> 
> Is there a way to compile the libraries such that it could be copied to a 
> different machine and be used? 

It's necessary to set OPAL_PREFIX to the path of the Open MPI copy before 
executing `mpiexec`.

http://www.open-mpi.org/faq/?category=building#installdirs

-- Reuti


> 
> Thanks
> MM
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Using Boost::Thread for multithreading within OpenMPI processes

2013-04-23 Thread Nick Edmonds
Hi Jacky,

I'm a regular reader of this list but seldom a poster.  In this case however I 
might actually be qualified to answer some questions or provide some insight 
given I'm not sure how many other folks here use Boost.Thread.  The first 
question is really what sort of threading model you want to use with MPI, which 
others here are probably more qualified to advise you on.  

In our applications we're using Boost.Thread with MPI_THREAD_MULTIPLE, which is 
a not all-together enjoyable experience because the openib BTL lacks support 
for thread multiple (at least as of the last time I checked).  That being said, 
Boost.Thread behaves just like any pthread code on the linux clusters we run 
on, as well as one BlueGene/P.  With MPI_THREAD_SERIALIZED writing 
hybrid-parallel code is pretty painless.  Most of the work required involved 
adding two-stage collectives such that threads first perform collectives 
locally and then a single thread participates in the MPI collective operation.  

If you end up using Boost.MPI you could probably even write your own wrappers 
to encapsulate the local computation required for MPI collective operations.  
Unfortunately Boost.MPI currently lacks full support for even MPI-2 but if it 
includes the subset of functionality you need it may be worthwhile.  Extensions 
are fairly straightforward to implement as well.

I've implemented a few different approaches to MPI + threading in the context 
of Boost, from explicit thread management to thread pools, and currently a 
complete runtime system.  Most of it is research code, though there's no reason 
it couldn't be released, and some of it probably will be eventually.  If you'd 
like to describe your intended use case I'm happy to offer any advice I can 
based on what I've learned.

Cheers,
Nick

On Apr 22, 2013, at 3:25 PM, Thomas Watson wrote:

> Hi,
> 
> I would like to create a pool of threads (using Boost::Thread) within each 
> OpenMPI process to accelerate my application on multicore CPUs. My 
> application is already built on OpenMPI, but it currently exploits 
> parallelism only at the process level. 
> 
> I am wondering if anyone can point me to some good 
> tutorials/documents/examples on how to integrate Boost multithreading with 
> OpenMPI applications?
> 
> Thanks!
> 
> Jacky
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] OpenMPI at scale on Cray XK7

2013-04-23 Thread Nathan Hjelm
On Tue, Apr 23, 2013 at 12:21:49PM +0400,   
wrote:
> Hi,
> 
> Nathan,  could you please advise what is expected startup time for OpenMPI
> job at such scale (128K ranks)? I'm interesting in
> 1) time from mpirun start to completion of MPI_Init()

It takes less than a minute to run:

mpirun -n 131072 /bin/true


> 2) time from MPI_Init() start to completion of MPI_Init()

A simple MPI application took about about 1.25 mins to run. If you want to see 
our setup you can take a look at contrib/platform/lanl/cray_xe6.

> >From my experience for 52800 rank job
> 1) took around 20 min
> 2) took around 12 min
> that actually looks like a hung.

How many nodes? I have never seen launch times that bad on Cielo. You could try 
adding -mca routed debruijn -novm and see if that helps. It will reduce the 
amount of communication between compute nodes and the login node.

> Any advice how to improve startup times of large scale jobs would be very
> much appreciated.

The bottleneck for launching at scale is the initial communication between the 
orteds and mpirun. At this time I don't know what can be done to improve that 
(I have some ideas but nothing has been implemented yet). At 8192 nodes this 
takes less than a minute. Everything else should be fairly quick.

-Nathan Hjelm
HPC-3, LANL



Re: [OMPI users] OpenMPI at scale on Cray XK7

2013-04-23 Thread Ralph Castain

On Apr 23, 2013, at 10:09 AM, Nathan Hjelm  wrote:

> On Tue, Apr 23, 2013 at 12:21:49PM +0400,   
> wrote:
>> Hi,
>> 
>> Nathan,  could you please advise what is expected startup time for OpenMPI
>> job at such scale (128K ranks)? I'm interesting in
>> 1) time from mpirun start to completion of MPI_Init()
> 
> It takes less than a minute to run:
> 
> mpirun -n 131072 /bin/true
> 
> 
>> 2) time from MPI_Init() start to completion of MPI_Init()
> 
> A simple MPI application took about about 1.25 mins to run. If you want to 
> see our setup you can take a look at contrib/platform/lanl/cray_xe6.
> 
>>> From my experience for 52800 rank job
>> 1) took around 20 min
>> 2) took around 12 min
>> that actually looks like a hung.
> 
> How many nodes? I have never seen launch times that bad on Cielo. You could 
> try adding -mca routed debruijn -novm and see if that helps. It will reduce 
> the amount of communication between compute nodes and the login node.

I believe the debrujin module was turned off a while ago due to a bug that 
wasn't fixed. However, try using

"-mca routed radix -mca routed_radix 64"


> 
>> Any advice how to improve startup times of large scale jobs would be very
>> much appreciated.
> 
> The bottleneck for launching at scale is the initial communication between 
> the orteds and mpirun. At this time I don't know what can be done to improve 
> that (I have some ideas but nothing has been implemented yet). At 8192 nodes 
> this takes less than a minute. Everything else should be fairly quick.

I mentioned this to Pasha on the phone call today. We had previously 
collaborated to get a pretty fast startup time on this machine - I believe we 
used static ports to reduce the initial comm scaling. You might want to check 
with him.

> 
> -Nathan Hjelm
> HPC-3, LANL
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] OpenMPI at scale on Cray XK7

2013-04-23 Thread Nathan Hjelm
On Tue, Apr 23, 2013 at 10:17:46AM -0700, Ralph Castain wrote:
> 
> On Apr 23, 2013, at 10:09 AM, Nathan Hjelm  wrote:
> 
> > On Tue, Apr 23, 2013 at 12:21:49PM +0400,   
> > wrote:
> >> Hi,
> >> 
> >> Nathan,  could you please advise what is expected startup time for OpenMPI
> >> job at such scale (128K ranks)? I'm interesting in
> >> 1) time from mpirun start to completion of MPI_Init()
> > 
> > It takes less than a minute to run:
> > 
> > mpirun -n 131072 /bin/true
> > 
> > 
> >> 2) time from MPI_Init() start to completion of MPI_Init()
> > 
> > A simple MPI application took about about 1.25 mins to run. If you want to 
> > see our setup you can take a look at contrib/platform/lanl/cray_xe6.
> > 
> >>> From my experience for 52800 rank job
> >> 1) took around 20 min
> >> 2) took around 12 min
> >> that actually looks like a hung.
> > 
> > How many nodes? I have never seen launch times that bad on Cielo. You could 
> > try adding -mca routed debruijn -novm and see if that helps. It will reduce 
> > the amount of communication between compute nodes and the login node.
> 
> I believe the debrujin module was turned off a while ago due to a bug that 
> wasn't fixed. However, try using

Was it turned off or was the priority lowered? If it was lowered then -mca 
routed debruijn should work. The -novm is to avoid the bug (as I understand 
it). I am working on fixing the bug now in hope it will be ready for 1.7.2.

-Nathan


Re: [OMPI users] OpenMPI at scale on Cray XK7

2013-04-23 Thread Ralph Castain

On Apr 23, 2013, at 10:45 AM, Nathan Hjelm  wrote:

> On Tue, Apr 23, 2013 at 10:17:46AM -0700, Ralph Castain wrote:
>> 
>> On Apr 23, 2013, at 10:09 AM, Nathan Hjelm  wrote:
>> 
>>> On Tue, Apr 23, 2013 at 12:21:49PM +0400,   
>>> wrote:
 Hi,
 
 Nathan,  could you please advise what is expected startup time for OpenMPI
 job at such scale (128K ranks)? I'm interesting in
 1) time from mpirun start to completion of MPI_Init()
>>> 
>>> It takes less than a minute to run:
>>> 
>>> mpirun -n 131072 /bin/true
>>> 
>>> 
 2) time from MPI_Init() start to completion of MPI_Init()
>>> 
>>> A simple MPI application took about about 1.25 mins to run. If you want to 
>>> see our setup you can take a look at contrib/platform/lanl/cray_xe6.
>>> 
> From my experience for 52800 rank job
 1) took around 20 min
 2) took around 12 min
 that actually looks like a hung.
>>> 
>>> How many nodes? I have never seen launch times that bad on Cielo. You could 
>>> try adding -mca routed debruijn -novm and see if that helps. It will reduce 
>>> the amount of communication between compute nodes and the login node.
>> 
>> I believe the debrujin module was turned off a while ago due to a bug that 
>> wasn't fixed. However, try using
> 
> Was it turned off or was the priority lowered? If it was lowered then -mca 
> routed debruijn should work. The -novm is to avoid the bug (as I understand 
> it). I am working on fixing the bug now in hope it will be ready for 1.7.2.

Pretty sure it is ompi_ignored and thus, not in the tarball

> 
> -Nathan
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] OpenMPI at scale on Cray XK7

2013-04-23 Thread Mike Clark
Hi,

Just to follow up on this.  We have managed to get OpenMPI to run at large
scale, to do so we had to use aprun instead of using openmpi's mpirun
command.

While this has allowed us to now run at the full scale of Titan, we have
found a huge drop in MPI_Alltoall performance when running at 18K
processors.  E.g., performance per node has decreased by a factor 200X
versus running at 4.6K nodes.  Is there any obvious explanation for this
that I could have overlooked such as a buffer size or option that needs to
be set (configure option or environment variable) when running at such
large scale?  We are running inter-communicator one-way sending if this
makes any difference.

Yours optimistically,

Mike.


On 4/22/13 3:17 PM, "Mike Clark"  wrote:

>Hi,
>
>I am trying to run OpenMPI on the Cray XK7 system at Oak Ridge National
>Lab (Titan), and am running in an issue whereby MPI_Init seems to hang
>indefinitely, but this issue only arises at large scale, e.g., when
>running on 18560 compute nodes (with two MPI processes per node).  The
>application runs successfully on 4600 nodes, and we are currently trying
>to test a 9000 node job to see if this fails or runs.
>
>We are launching our job using something like the following
>
># mpirun command  
>  
>mpicmd="$OMP_DIR/bin/mpirun --prefix $OMP_DIR -np 37120 --npernode 2
>--bind-to core --bind-to numa $app $args"
># Print  and Run the Command
>  
>echo $mpicmd
>$mpicmd >& $output
>
>Are there any issues that I should be aware of when running OpenMPI on
>37120 processes or when running on the Cray Gemini Interconnect?
>
>We are using OpenMPI 1.7.1 (1.7.x is required for Cray Gemini support)
>and gcc 4.7.2.
>
>Thanks,
>
>Mike.

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---