On Sep 5, 2014, at 10:44 AM, McGrattan, Kevin B. Dr. <kevin.mcgrat...@nist.gov> 
wrote:

> I am testing a new cluster that we just bought, which is why I am loading 
> things this way. I am deliberately increasing network traffic. But in 
> general, we submit jobs intermittently with various numbers of MPI processes. 
> I have read that a good strategy is to map by socket, which in our case means 
> that we assign 2 MPI processes to node1, which has two sockets, 2 MPI 
> processes to node2, and so on. For my test cases, each has 16 MPI processes, 
> which means that each job is spread out over 8 nodes. Yes, if I were to 
> always load up the entire cluster, I could map the way you suggest, but I am 
> looking for a strategy that gives me optimum performance for small cluster 
> loads and for large too.
>  
> Can anyone confirm whether or not it is best to map by socket in cases where 
> you have a light load on your cluster?

It would be about the worst thing you can do, to be honest. Reason is that each 
socket is typically a separate NUMA region, and so the shared memory system 
would be sub-optimized in that configuration. It would be much better to map-by 
core to avoid the NUMA issues.

If you want multiple cores per process (say for threading purposes), then you 
can use the pe option to assign them - something like this:

--map-by core:pe=2

would map procs by core, with each process being bound to 2 cores. You'd want 
to make the pe count work out so that no process was bound across a socket 
boundary as that is really bad.

HTH
Ralph

>  
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff Squyres 
> (jsquyres)
> Sent: Friday, September 05, 2014 10:37 AM
> To: Open MPI User's List
> Subject: Re: [OMPI users] How does binding option affect network traffic?
>  
> I'm confused, then: why you wouldn't want to minimize the number of servers 
> that a single job runs on?
>  
> I ask because it sounds to me like you're running 12 jobs, each with 1 
> process per server.  And therefore all 12 jobs are running on each server, 
> like this:
>  
> <image001.jpg>
> With this layout, you're thrashing the server networking resources -- you're 
> forcing the maximum use of the network.
>  
> Why don't you pack the jobs in to as few servers as possible, and therefore 
> use shared memory as much as possible, and as little network as possible?  
> This is the conventional wisdom.  ...perhaps I'm missing something in your 
> setup?
>  
> <image002.jpg>
>  
>  
> 
> On Sep 3, 2014, at 10:02 AM, McGrattan, Kevin B. Dr. 
> <kevin.mcgrat...@nist.gov> wrote:
> 
> 
> No, there are 12 cores per node, and 12 MPI processes are assigned to each 
> node. The total RAM usage is about 10% of available. We suspect that the 
> problem might be the combination of MPI message passing and disk I/O to the 
> master node, both of which are handled by Infiniband. But I do not know how 
> to monitor the traffic, and I do not know how much is too much. Ganglia 
> reports Gigabit Ethernet usage, but we're primarily using IB. 
> 
> -----Original Message-----
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff Squyres 
> (jsquyres)
> Sent: Tuesday, September 02, 2014 5:41 PM
> To: Open MPI User's List
> Subject: Re: [OMPI users] How does binding option affect network traffic?
> 
> Ah, ok -- I think I missed this part of the thread: each of your individual 
> MPI processes suck up huge gobbs of memory.
> 
> So just to be clear, in general: you don't intend to run more MPI processes 
> than cores per server, *and* you intend to run fewer MPI processes per server 
> than would consume the entire amount of RAM.
> 
> Are both of those always correct (at the same time)?
> 
> If so, it sounds like the first runs that you posted about were heavily 
> overloading the servers in terms of RAM usage.  Specifically: if you were 
> running out of (registered) RAM, I can understand why Open MPI would hang.  
> We have a few known issues that the openib BTL will hang if it runs out of 
> registered memory -- but this is such a small corner case (because no one 
> runs that way) that we've honestly never bothered to fix the issue (it's 
> actually a really complicated resource exhaustion issue -- it's kinda hard to 
> know what the Right Thing is to do when you've run out of memory...).
> 
> 
> 
> On Sep 2, 2014, at 9:37 AM, McGrattan, Kevin B. Dr. 
> <kevin.mcgrat...@nist.gov> wrote:
> 
> 
> Thanks for the advice. Our jobs vary in size, from just a few MPI processes 
> to about 64. Jobs are submitted at random, which is why I want to map by 
> socket. If the cluster is empty, and someone submits a job with 16 MPI 
> processes, I would think it would run most efficiently if it used 8 nodes, 2 
> processes per node. If we just fill up two nodes as you suggest, we overload 
> the RAM on those two nodes.
> 
> -----Original Message-----
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of 
> tmish...@jcity.maeda.co.jp
> Sent: Friday, August 29, 2014 5:24 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] How does binding option affect network traffic?
> 
> Hi,
> 
> Your cluster is very similar to ours where Torque and OpenMPI is installed.
> 
> I would use this cmd line:
> 
> #PBS -l nodes=2:ppn=12
> mpirun --report-bindings -np 16 <executable file name>
> 
> Here --map-by socket:pe=1 and -bind-to core is assumed as default setting.
> Then, you can run 10 jobs independently and simultaneously beacause you have 
> 20 nodes totally.
> 
> While each node in your cluser has 12 cores, the nprocs per node running on a 
> node is 8, which means 66.666 % use, not 100%.
> I think this loss can not be avoided as long as you use 16*N MPI per job.
> It's a kind of mismatch with your cluster which has 12 cores per node.
> If you can use 12*N MPI per job, then it's most effective.
> Is there any reason why you use 16*N MPI per job?
> 
> Tetsuya
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/08/25201.php
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/09/25220.php
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/09/25233.php
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/09/25249.php
>  
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
>  
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/09/25281.php

Reply via email to