On Sep 5, 2014, at 10:44 AM, McGrattan, Kevin B. Dr. <kevin.mcgrat...@nist.gov> wrote:
> I am testing a new cluster that we just bought, which is why I am loading > things this way. I am deliberately increasing network traffic. But in > general, we submit jobs intermittently with various numbers of MPI processes. > I have read that a good strategy is to map by socket, which in our case means > that we assign 2 MPI processes to node1, which has two sockets, 2 MPI > processes to node2, and so on. For my test cases, each has 16 MPI processes, > which means that each job is spread out over 8 nodes. Yes, if I were to > always load up the entire cluster, I could map the way you suggest, but I am > looking for a strategy that gives me optimum performance for small cluster > loads and for large too. > > Can anyone confirm whether or not it is best to map by socket in cases where > you have a light load on your cluster? It would be about the worst thing you can do, to be honest. Reason is that each socket is typically a separate NUMA region, and so the shared memory system would be sub-optimized in that configuration. It would be much better to map-by core to avoid the NUMA issues. If you want multiple cores per process (say for threading purposes), then you can use the pe option to assign them - something like this: --map-by core:pe=2 would map procs by core, with each process being bound to 2 cores. You'd want to make the pe count work out so that no process was bound across a socket boundary as that is really bad. HTH Ralph > > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff Squyres > (jsquyres) > Sent: Friday, September 05, 2014 10:37 AM > To: Open MPI User's List > Subject: Re: [OMPI users] How does binding option affect network traffic? > > I'm confused, then: why you wouldn't want to minimize the number of servers > that a single job runs on? > > I ask because it sounds to me like you're running 12 jobs, each with 1 > process per server. And therefore all 12 jobs are running on each server, > like this: > > <image001.jpg> > With this layout, you're thrashing the server networking resources -- you're > forcing the maximum use of the network. > > Why don't you pack the jobs in to as few servers as possible, and therefore > use shared memory as much as possible, and as little network as possible? > This is the conventional wisdom. ...perhaps I'm missing something in your > setup? > > <image002.jpg> > > > > On Sep 3, 2014, at 10:02 AM, McGrattan, Kevin B. Dr. > <kevin.mcgrat...@nist.gov> wrote: > > > No, there are 12 cores per node, and 12 MPI processes are assigned to each > node. The total RAM usage is about 10% of available. We suspect that the > problem might be the combination of MPI message passing and disk I/O to the > master node, both of which are handled by Infiniband. But I do not know how > to monitor the traffic, and I do not know how much is too much. Ganglia > reports Gigabit Ethernet usage, but we're primarily using IB. > > -----Original Message----- > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff Squyres > (jsquyres) > Sent: Tuesday, September 02, 2014 5:41 PM > To: Open MPI User's List > Subject: Re: [OMPI users] How does binding option affect network traffic? > > Ah, ok -- I think I missed this part of the thread: each of your individual > MPI processes suck up huge gobbs of memory. > > So just to be clear, in general: you don't intend to run more MPI processes > than cores per server, *and* you intend to run fewer MPI processes per server > than would consume the entire amount of RAM. > > Are both of those always correct (at the same time)? > > If so, it sounds like the first runs that you posted about were heavily > overloading the servers in terms of RAM usage. Specifically: if you were > running out of (registered) RAM, I can understand why Open MPI would hang. > We have a few known issues that the openib BTL will hang if it runs out of > registered memory -- but this is such a small corner case (because no one > runs that way) that we've honestly never bothered to fix the issue (it's > actually a really complicated resource exhaustion issue -- it's kinda hard to > know what the Right Thing is to do when you've run out of memory...). > > > > On Sep 2, 2014, at 9:37 AM, McGrattan, Kevin B. Dr. > <kevin.mcgrat...@nist.gov> wrote: > > > Thanks for the advice. Our jobs vary in size, from just a few MPI processes > to about 64. Jobs are submitted at random, which is why I want to map by > socket. If the cluster is empty, and someone submits a job with 16 MPI > processes, I would think it would run most efficiently if it used 8 nodes, 2 > processes per node. If we just fill up two nodes as you suggest, we overload > the RAM on those two nodes. > > -----Original Message----- > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of > tmish...@jcity.maeda.co.jp > Sent: Friday, August 29, 2014 5:24 PM > To: Open MPI Users > Subject: Re: [OMPI users] How does binding option affect network traffic? > > Hi, > > Your cluster is very similar to ours where Torque and OpenMPI is installed. > > I would use this cmd line: > > #PBS -l nodes=2:ppn=12 > mpirun --report-bindings -np 16 <executable file name> > > Here --map-by socket:pe=1 and -bind-to core is assumed as default setting. > Then, you can run 10 jobs independently and simultaneously beacause you have > 20 nodes totally. > > While each node in your cluser has 12 cores, the nprocs per node running on a > node is 8, which means 66.666 % use, not 100%. > I think this loss can not be avoided as long as you use 16*N MPI per job. > It's a kind of mismatch with your cluster which has 12 cores per node. > If you can use 12*N MPI per job, then it's most effective. > Is there any reason why you use 16*N MPI per job? > > Tetsuya > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/25201.php > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/09/25220.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/09/25233.php > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/09/25249.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/09/25281.php