Re: [OMPI users] Problem with mpiexec --cpus-per-proc in multiple nodes in OMPI 1.6.4
Just an update: I have this fixed in the OMPI trunk. It didn't make 1.7.0, but will be in 1.7.1 and beyond. On Mar 21, 2013, at 2:09 PM, Gus Correa wrote: > Thank you, Ralph. > > I will try to use a rankfile. > > In any case, the --cpus-per-proc option is a very useful feature: > for hybrid MPI+OpenMP programs, for these processors with one FPU > shared by two cores, etc. > If it gets fixed in a later release of OMPI that would be great. > > Thank you, > Gus Correa > > > On 03/21/2013 04:03 PM, Ralph Castain wrote: >> I've heard this from a couple of other sources - > it looks like there is a problem on the daemons when > they compute the location for -cpus-per-proc. > I'm not entirely sure why that would be as the code > is supposed to be common with mpirun, but there are > a few differences. >> >> I will take a look at it - I don't know of any workaround, > I'm afraid. >> >> On Mar 21, 2013, at 12:01 PM, Gus Correa wrote: >> >>> Dear Open MPI Pros >>> >>> I am having trouble using mpiexec with --cpus-per-proc >>> on multiple nodes in OMPI 1.6.4. >>> >>> I know there is an ongoing thread on similar runtime issues >>> of OMPI 1.7. >>> By no means I am trying to hijack T. Mishima's questions. >>> My question is genuine, though, and perhaps related to his. >>> >>> I am testing a new cluster remotely, with monster >>> dual socket 16-core AMD Bulldozer processors (32 cores per node). >>> I am using OMPI 1.6.4 built with Torque 4.2.1 support. >>> >>> I read that on these processors each pair of cores share an FPU. >>> Hence, I am trying to run *one MPI process* on each >>> *pair of successive cores*. >>> This trick seems to yield better performance >>> (at least for HPL/Linpack) than using all cores. >>> I.e., the goal is to use "each other core", or perhaps >>> to allow each process to wobble across two successive cores only, >>> hence granting exclusive use of one FPU per process. >>> [BTW, this is *not* an attempt to do hybrid MPI+OpenMP. >>> The code is HPL with MPI+BLAS/Lapack and NO OpenMP.] >>> >>> To achieve this, I am using the mpiexec --cpus-per-proc option. >>> It works on one node, which is great. >>> However, unless I made a silly syntax or arithmetic mistake, >>> it doesn't seem to work on more than one node. >>> >>> For instance, this works: >>> >>> #PBS -l nodes=1:ppn=32 >>> ... >>> mpiexec -np 16 \ >>>--cpus-per-proc 2 \ >>>--bind-to-core \ >>>--report-bindings \ >>>--tag-output \ >>> >>> I get a pretty nice process-to-cores distribution, with 16 processes, and >>> each process bound to a couple of successive cores, >>> as expected: >>> >>> [1,7]:[node33:04744] MCW rank 7 bound to socket 0[core 14-15]: [. . >>> . . . . . . . . . . . . B B][. . . . . . . . . . . . . . . .] >>> [1,8]:[node33:04744] MCW rank 8 bound to socket 1[core 0-1]: [. . . >>> . . . . . . . . . . . . .][B B . . . . . . . . . . . . . .] >>> [1,9]:[node33:04744] MCW rank 9 bound to socket 1[core 2-3]: [. . . >>> . . . . . . . . . . . . .][. . B B . . . . . . . . . . . .] >>> [1,10]:[node33:04744] MCW rank 10 bound to socket 1[core 4-5]: [. . >>> . . . . . . . . . . . . . .][. . . . B B . . . . . . . . . .] >>> [1,11]:[node33:04744] MCW rank 11 bound to socket 1[core 6-7]: [. . >>> . . . . . . . . . . . . . .][. . . . . . B B . . . . . . . .] >>> [1,12]:[node33:04744] MCW rank 12 bound to socket 1[core 8-9]: [. . >>> . . . . . . . . . . . . . .][. . . . . . . . B B . . . . . .] >>> [1,13]:[node33:04744] MCW rank 13 bound to socket 1[core 10-11]: [. >>> . . . . . . . . . . . . . . .][. . . . . . . . . . B B . . . .] >>> [1,14]:[node33:04744] MCW rank 14 bound to socket 1[core 12-13]: [. >>> . . . . . . . . . . . . . . .][. . . . . . . . . . . . B B . .] >>> [1,15]:[node33:04744] MCW rank 15 bound to socket 1[core 14-15]: [. >>> . . . . . . . . . . . . . . .][. . . . . . . . . . . . . . B B] >>> [1,0]:[node33:04744] MCW rank 0 bound to socket 0[core 0-1]: [B B . >>> . . . . . . . . . . . . .][. . . . . . . . . . . . . . . .] >>> [1,1]:[node33:04744] MCW rank 1 bound to socket 0[core 2-3]: [. . B >>> B . . . . . . . . . . . .][. . . . . . . . . . . . . . . .] >>> [1,2]:[node33:04744] MCW rank 2 bound to socket 0[core 4-5]: [. . . >>> . B B . . . . . . . . . .][. . . . . . . . . . . . . . . .] >>> [1,3]:[node33:04744] MCW rank 3 bound to socket 0[core 6-7]: [. . . >>> . . . B B . . . . . . . .][. . . . . . . . . . . . . . . .] >>> [1,4]:[node33:04744] MCW rank 4 bound to socket 0[core 8-9]: [. . . >>> . . . . . B B . . . . . .][. . . . . . . . . . . . . . . .] >>> [1,5]:[node33:04744] MCW rank 5 bound to socket 0[core 10-11]: [. . >>> . . . . . . . . B B . . . .][. . . . . . . . . . . . . . . .] >>> [1,6]:[node33:04744] MCW rank 6 bound to socket 0[core 12-13]: [. . >>> . . . . . . . . . . B B . .][. . . . . . . . . . . . . . . .] >>> >>> >>> *** >>> >>> However, when I try to use eight nodes, >>> the job fails and I get the error message below (repeatedly from
Re: [OMPI users] Problem with mpiexec --cpus-per-proc in multiple nodes in OMPI 1.6.4
Thank you, Ralph! Gus Correa On 03/29/2013 09:33 AM, Ralph Castain wrote: Just an update: I have this fixed in the OMPI trunk. It didn't make 1.7.0, but will be in 1.7.1 and beyond. On Mar 21, 2013, at 2:09 PM, Gus Correa wrote: Thank you, Ralph. I will try to use a rankfile. In any case, the --cpus-per-proc option is a very useful feature: for hybrid MPI+OpenMP programs, for these processors with one FPU shared by two cores, etc. If it gets fixed in a later release of OMPI that would be great. Thank you, Gus Correa On 03/21/2013 04:03 PM, Ralph Castain wrote: I've heard this from a couple of other sources - it looks like there is a problem on the daemons when they compute the location for -cpus-per-proc. I'm not entirely sure why that would be as the code is supposed to be common with mpirun, but there are a few differences. I will take a look at it - I don't know of any workaround, I'm afraid. On Mar 21, 2013, at 12:01 PM, Gus Correa wrote: Dear Open MPI Pros I am having trouble using mpiexec with --cpus-per-proc on multiple nodes in OMPI 1.6.4. I know there is an ongoing thread on similar runtime issues of OMPI 1.7. By no means I am trying to hijack T. Mishima's questions. My question is genuine, though, and perhaps related to his. I am testing a new cluster remotely, with monster dual socket 16-core AMD Bulldozer processors (32 cores per node). I am using OMPI 1.6.4 built with Torque 4.2.1 support. I read that on these processors each pair of cores share an FPU. Hence, I am trying to run *one MPI process* on each *pair of successive cores*. This trick seems to yield better performance (at least for HPL/Linpack) than using all cores. I.e., the goal is to use "each other core", or perhaps to allow each process to wobble across two successive cores only, hence granting exclusive use of one FPU per process. [BTW, this is *not* an attempt to do hybrid MPI+OpenMP. The code is HPL with MPI+BLAS/Lapack and NO OpenMP.] To achieve this, I am using the mpiexec --cpus-per-proc option. It works on one node, which is great. However, unless I made a silly syntax or arithmetic mistake, it doesn't seem to work on more than one node. For instance, this works: #PBS -l nodes=1:ppn=32 ... mpiexec -np 16 \ --cpus-per-proc 2 \ --bind-to-core \ --report-bindings \ --tag-output \ I get a pretty nice process-to-cores distribution, with 16 processes, and each process bound to a couple of successive cores, as expected: [1,7]:[node33:04744] MCW rank 7 bound to socket 0[core 14-15]: [. . . . . . . . . . . . . . B B][. . . . . . . . . . . . . . . .] [1,8]:[node33:04744] MCW rank 8 bound to socket 1[core 0-1]: [. . . . . . . . . . . . . . . .][B B . . . . . . . . . . . . . .] [1,9]:[node33:04744] MCW rank 9 bound to socket 1[core 2-3]: [. . . . . . . . . . . . . . . .][. . B B . . . . . . . . . . . .] [1,10]:[node33:04744] MCW rank 10 bound to socket 1[core 4-5]: [. . . . . . . . . . . . . . . .][. . . . B B . . . . . . . . . .] [1,11]:[node33:04744] MCW rank 11 bound to socket 1[core 6-7]: [. . . . . . . . . . . . . . . .][. . . . . . B B . . . . . . . .] [1,12]:[node33:04744] MCW rank 12 bound to socket 1[core 8-9]: [. . . . . . . . . . . . . . . .][. . . . . . . . B B . . . . . .] [1,13]:[node33:04744] MCW rank 13 bound to socket 1[core 10-11]: [. . . . . . . . . . . . . . . .][. . . . . . . . . . B B . . . .] [1,14]:[node33:04744] MCW rank 14 bound to socket 1[core 12-13]: [. . . . . . . . . . . . . . . .][. . . . . . . . . . . . B B . .] [1,15]:[node33:04744] MCW rank 15 bound to socket 1[core 14-15]: [. . . . . . . . . . . . . . . .][. . . . . . . . . . . . . . B B] [1,0]:[node33:04744] MCW rank 0 bound to socket 0[core 0-1]: [B B . . . . . . . . . . . . . .][. . . . . . . . . . . . . . . .] [1,1]:[node33:04744] MCW rank 1 bound to socket 0[core 2-3]: [. . B B . . . . . . . . . . . .][. . . . . . . . . . . . . . . .] [1,2]:[node33:04744] MCW rank 2 bound to socket 0[core 4-5]: [. . . . B B . . . . . . . . . .][. . . . . . . . . . . . . . . .] [1,3]:[node33:04744] MCW rank 3 bound to socket 0[core 6-7]: [. . . . . . B B . . . . . . . .][. . . . . . . . . . . . . . . .] [1,4]:[node33:04744] MCW rank 4 bound to socket 0[core 8-9]: [. . . . . . . . B B . . . . . .][. . . . . . . . . . . . . . . .] [1,5]:[node33:04744] MCW rank 5 bound to socket 0[core 10-11]: [. . . . . . . . . . B B . . . .][. . . . . . . . . . . . . . . .] [1,6]:[node33:04744] MCW rank 6 bound to socket 0[core 12-13]: [. . . . . . . . . . . . B B . .][. . . . . . . . . . . . . . . .] *** However, when I try to use eight nodes, the job fails and I get the error message below (repeatedly from several nodes): #PBS -l nodes=8:ppn=32 ... mpiexec -np 128 \ --cpus-per-proc 2 \ --bind-to-core \ --report-bindings \ --tag-output \ Error message: -- An invalid physical processor ID was returned when atte