Should be okay. I suspect you are correct in that something isn't right in the fabric.
On Fri, Aug 29, 2014 at 1:06 PM, McGrattan, Kevin B. Dr. < kevin.mcgrat...@nist.gov> wrote: > I am able to run all 15 of my jobs simultaneously; 16 MPI processes per > job; mapping by socket and binding to socket. On a given socket, 6 MPI > processes from 6 separate mpiruns share the 6 cores, or at least I assume > they are sharing. The load for all CPUs and all processes is 100%. I > understand that I am loading the system to its limits, but is what I am > doing OK? My jobs are running, and the only problem seems to be that some > jobs are hanging at random times. This is a new cluster I am shaking down, > and I am guessing that the message passing traffic is causing packet > losses. I am working with the vendor to sort this out, but I am curious > whether or not I am using OpenMPI appropriately. > > > > #PBS -l nodes=8:ppn=2 > mpirun --report-bindings --bind-to socket --map-by socket:PE=1 -np 16 > <executable > file name> > > The bindings are: > > [burn004:07244] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core > 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket > 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.] > > [burn004:07244] MCW rank 1 bound to socket 1[core 6[hwt 0]], socket > 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket > 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B] > > [burn008:07256] MCW rank 3 bound to socket 1[core 6[hwt 0]], socket 1[core > 7[hwt 0]], socket 1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core > 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././.][B/B/B/B/B/B] > > [burn008:07256] MCW rank 2 bound to socket 0[core 0[hwt 0]], socket 0[core > 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core > 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.] and so on. > > > > *From:* users [mailto:users-boun...@open-mpi.org] *On Behalf Of *Ralph > Castain > *Sent:* Friday, August 29, 2014 3:26 PM > *To:* Open MPI Users > > *Subject:* Re: [OMPI users] How does binding option affect network > traffic? > > > > > > On Aug 29, 2014, at 10:51 AM, McGrattan, Kevin B. Dr. < > kevin.mcgrat...@nist.gov> wrote: > > > > Thanks for the tip. I understand how using the --cpuset option would > help me in the example I described. However, suppose I have multiple users > submitting MPI jobs of various sizes? I wouldn't know a priori which cores > were in use and which weren't. I always assumed that this is what these > various schedulers did. Is there a way to map-by socket but not allow a > single core to be used by more than one process. At first glance, I thought > that --map-by socket and --bind-to core would do this. Would one of these > "NOOVERSUBSCRIBE" options help? > > > > I'm afraid not - the issue here is that the mpirun's don't know about each > other. What you'd need to do is have your scheduler assign cores for our > use - we'll pick that up and stay inside that envelope. The exact scheduler > command depends on the scheduler, of course, but the scheduler would then > have the more global picture and could keep things separated. > > > > > Also, in my test case, I have exactly the right amount of cores (240) to > run 15 jobs using 16 MPI processes. I am shaking down a new cluster we just > bought. This is an extreme case, but not atypical of the way we use our > clusters. > > > > Well, you do, but not exactly the way you showed you were trying to use > this. If you try to run as you described, with 2ppn for each mpirun and 12 > cores/node, you can run a maximum of 6 mpirun's at a time across a given > set of nodes. So you'd need to stage your allocations correctly to make it > work. > > > > > > > > > ------------------------------ > > Date: Thu, 28 Aug 2014 13:27:12 -0700 > From: Ralph Castain <r...@open-mpi.org> > To: Open MPI Users <us...@open-mpi.org> > Subject: Re: [OMPI users] How does binding option affect network > traffic? > Message-ID: <637caef5-bbb3-46c2-9387-decdf8cbd...@open-mpi.org> > Content-Type: text/plain; charset="windows-1252" > > > On Aug 28, 2014, at 11:50 AM, McGrattan, Kevin B. Dr. < > kevin.mcgrat...@nist.gov> wrote: > > > My institute recently purchased a linux cluster with 20 nodes; 2 sockets > per node; 6 cores per socket. OpenMPI v 1.8.1 is installed. I want to run > 15 jobs. Each job requires 16 MPI processes. For each job, I want to use > two cores on each node, mapping by socket. If I use these options: > > #PBS -l nodes=8:ppn=2 > mpirun --report-bindings --bind-to core --map-by socket:PE=1 -np 16 > <executable file name> > > The reported bindings are: > > [burn001:09186] MCW rank 0 bound to socket 0[core 0[hwt 0]]: > [B/././././.][./././././.] [burn001:09186] MCW rank 1 bound to socket > 1[core 6[hwt 0]]: [./././././.][B/././././.] [burn004:07113] MCW rank > 6 bound to socket 0[core 0[hwt 0]]: [B/././././.][./././././.] > [burn004:07113] MCW rank 7 bound to socket 1[core 6[hwt 0]]: > [./././././.][B/././././.] and so on? > > These bindings appear to be OK, but when I do a ?top ?H? on each node, I > see that all 15 jobs use core 0 and core 6 on each node. This means, I > believe, that I am only using 1/6 or my resources. > > > That is correct. The problem is that each mpirun execution has no idea > what the others are doing, or even that they exist. Thus, they will each > independently bind to core zero and core 6, as you observe. You can get > around this by submitting each with a separate --cpuset argument telling it > which cpus it is allowed to use - something like this (note that there is > no value to having pe=1 as that is automatically what happens with bind-to > core): > > mpirun --cpuset 0,6 --bind-to core .... > mpirun --cpuset 1,7 --bind-to core ... > > etc. You specified only two procs/node with your PBS request, so we'll > only map two on each node. This command line tells the first mpirun to only > use cores 0 and 6, and to bind each proc to one of those cores. The second > uses only cores 1 and 7, and thus is separated from the first command. > > However, you should note that you can't run 15 jobs at the same time in > the manner you describe without overloading some cores as you only have 12 > cores/node. This will create a poor-performance situation. > > > > I want to use 100%. So I try this: > > #PBS -l nodes=8:ppn=2 > mpirun --report-bindings --bind-to socket --map-by socket:PE=1 -np 16 > <executable file name> > > Now it appears that I am getting 100% usage of all cores on all nodes. The > bindings are: > > [burn004:07244] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket > 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], > socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: > [B/B/B/B/B/B][./././././.] [burn004:07244] MCW rank 1 bound to socket > 1[core 6[hwt 0]], socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket > 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: > [./././././.][B/B/B/B/B/B] [burn008:07256] MCW rank 3 bound to socket > 1[core 6[hwt 0]], socket 1[core 7[hwt 0]], socket 1[core 8[hwt 0]], socket > 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: > [./././././.][B/B/B/B/B/B] [burn008:07256] MCW rank 2 bound to socket > 0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket > 0[core 3[hwt 0]], socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: > [B/B/B/B/B/B][./././././.] and so on? > > The problem now is that some of my jobs are hanging. They all start > running fine, and produce output. But at some point I lose about 4 out of > 15 jobs due to hanging. I suspect that an MPI message is passed and not > received. The number of jobs that hang and the time when they hang varies > from test to test. We have run these cases successfully on our old cluster > dozens of times ? they are part of our benchmark suite. > > > Did you have more cores on your old cluster? I suspect the problem here is > resource exhaustion, especially if you are using Infiniband as you are > overloading some of the cores, as mentioned above. > > > > When I run these jobs using a map by core strategy (that is, the MPI > processes are just mapped by core, and each job only uses 16 cores on two > nodes), I do not see as much hanging. It still occurs, but less often. This > leads me to suspect that there is something about the increased network > traffic due to the map-by-socket approach that is the cause of the problem. > But I do not know what to do about it. I think that the map-by-socket > approach is the right one, but I do not know if I have my OpenMPI options > just right. > > Can you tell me what OpenMPI options to use, and can you tell me how I > might debug the hanging issue. > > > > Kevin McGrattan > National Institute of Standards and Technology > 100 Bureau Drive, Mail Stop 8664 > Gaithersburg, Maryland 20899 > > 301 975 2712 > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/25195.php > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/25199.php >