Just an update: I have this fixed in the OMPI trunk. It didn't make 1.7.0, but will be in 1.7.1 and beyond.
On Mar 21, 2013, at 2:09 PM, Gus Correa <g...@ldeo.columbia.edu> wrote: > Thank you, Ralph. > > I will try to use a rankfile. > > In any case, the --cpus-per-proc option is a very useful feature: > for hybrid MPI+OpenMP programs, for these processors with one FPU > shared by two cores, etc. > If it gets fixed in a later release of OMPI that would be great. > > Thank you, > Gus Correa > > > On 03/21/2013 04:03 PM, Ralph Castain wrote: >> I've heard this from a couple of other sources - > it looks like there is a problem on the daemons when > they compute the location for -cpus-per-proc. > I'm not entirely sure why that would be as the code > is supposed to be common with mpirun, but there are > a few differences. >> >> I will take a look at it - I don't know of any workaround, > I'm afraid. >> >> On Mar 21, 2013, at 12:01 PM, Gus Correa<g...@ldeo.columbia.edu> wrote: >> >>> Dear Open MPI Pros >>> >>> I am having trouble using mpiexec with --cpus-per-proc >>> on multiple nodes in OMPI 1.6.4. >>> >>> I know there is an ongoing thread on similar runtime issues >>> of OMPI 1.7. >>> By no means I am trying to hijack T. Mishima's questions. >>> My question is genuine, though, and perhaps related to his. >>> >>> I am testing a new cluster remotely, with monster >>> dual socket 16-core AMD Bulldozer processors (32 cores per node). >>> I am using OMPI 1.6.4 built with Torque 4.2.1 support. >>> >>> I read that on these processors each pair of cores share an FPU. >>> Hence, I am trying to run *one MPI process* on each >>> *pair of successive cores*. >>> This trick seems to yield better performance >>> (at least for HPL/Linpack) than using all cores. >>> I.e., the goal is to use "each other core", or perhaps >>> to allow each process to wobble across two successive cores only, >>> hence granting exclusive use of one FPU per process. >>> [BTW, this is *not* an attempt to do hybrid MPI+OpenMP. >>> The code is HPL with MPI+BLAS/Lapack and NO OpenMP.] >>> >>> To achieve this, I am using the mpiexec --cpus-per-proc option. >>> It works on one node, which is great. >>> However, unless I made a silly syntax or arithmetic mistake, >>> it doesn't seem to work on more than one node. >>> >>> For instance, this works: >>> >>> #PBS -l nodes=1:ppn=32 >>> ... >>> mpiexec -np 16 \ >>> --cpus-per-proc 2 \ >>> --bind-to-core \ >>> --report-bindings \ >>> --tag-output \ >>> >>> I get a pretty nice process-to-cores distribution, with 16 processes, and >>> each process bound to a couple of successive cores, >>> as expected: >>> >>> [1,7]<stderr>:[node33:04744] MCW rank 7 bound to socket 0[core 14-15]: [. . >>> . . . . . . . . . . . . B B][. . . . . . . . . . . . . . . .] >>> [1,8]<stderr>:[node33:04744] MCW rank 8 bound to socket 1[core 0-1]: [. . . >>> . . . . . . . . . . . . .][B B . . . . . . . . . . . . . .] >>> [1,9]<stderr>:[node33:04744] MCW rank 9 bound to socket 1[core 2-3]: [. . . >>> . . . . . . . . . . . . .][. . B B . . . . . . . . . . . .] >>> [1,10]<stderr>:[node33:04744] MCW rank 10 bound to socket 1[core 4-5]: [. . >>> . . . . . . . . . . . . . .][. . . . B B . . . . . . . . . .] >>> [1,11]<stderr>:[node33:04744] MCW rank 11 bound to socket 1[core 6-7]: [. . >>> . . . . . . . . . . . . . .][. . . . . . B B . . . . . . . .] >>> [1,12]<stderr>:[node33:04744] MCW rank 12 bound to socket 1[core 8-9]: [. . >>> . . . . . . . . . . . . . .][. . . . . . . . B B . . . . . .] >>> [1,13]<stderr>:[node33:04744] MCW rank 13 bound to socket 1[core 10-11]: [. >>> . . . . . . . . . . . . . . .][. . . . . . . . . . B B . . . .] >>> [1,14]<stderr>:[node33:04744] MCW rank 14 bound to socket 1[core 12-13]: [. >>> . . . . . . . . . . . . . . .][. . . . . . . . . . . . B B . .] >>> [1,15]<stderr>:[node33:04744] MCW rank 15 bound to socket 1[core 14-15]: [. >>> . . . . . . . . . . . . . . .][. . . . . . . . . . . . . . B B] >>> [1,0]<stderr>:[node33:04744] MCW rank 0 bound to socket 0[core 0-1]: [B B . >>> . . . . . . . . . . . . .][. . . . . . . . . . . . . . . .] >>> [1,1]<stderr>:[node33:04744] MCW rank 1 bound to socket 0[core 2-3]: [. . B >>> B . . . . . . . . . . . .][. . . . . . . . . . . . . . . .] >>> [1,2]<stderr>:[node33:04744] MCW rank 2 bound to socket 0[core 4-5]: [. . . >>> . B B . . . . . . . . . .][. . . . . . . . . . . . . . . .] >>> [1,3]<stderr>:[node33:04744] MCW rank 3 bound to socket 0[core 6-7]: [. . . >>> . . . B B . . . . . . . .][. . . . . . . . . . . . . . . .] >>> [1,4]<stderr>:[node33:04744] MCW rank 4 bound to socket 0[core 8-9]: [. . . >>> . . . . . B B . . . . . .][. . . . . . . . . . . . . . . .] >>> [1,5]<stderr>:[node33:04744] MCW rank 5 bound to socket 0[core 10-11]: [. . >>> . . . . . . . . B B . . . .][. . . . . . . . . . . . . . . .] >>> [1,6]<stderr>:[node33:04744] MCW rank 6 bound to socket 0[core 12-13]: [. . >>> . . . . . . . . . . B B . .][. . . . . . . . . . . . . . . .] >>> >>> >>> *************** >>> >>> However, when I try to use eight nodes, >>> the job fails and I get the error message below (repeatedly from >>> several nodes): >>> >>> #PBS -l nodes=8:ppn=32 >>> ... >>> mpiexec -np 128 \ >>> --cpus-per-proc 2 \ >>> --bind-to-core \ >>> --report-bindings \ >>> --tag-output \ >>> >>> >>> Error message: >>> >>> -------------------------------------------------------------------------- >>> An invalid physical processor ID was returned when attempting to bind >>> an MPI process to a unique processor on node: >>> >>> Node: node18 >>> >>> This usually means that you requested binding to more processors than >>> exist (e.g., trying to bind N MPI processes to M processors, where N> >>> M), or that the node has an unexpectedly different topology. >>> >>> Double check that you have enough unique processors for all the >>> MPI processes that you are launching on this host, and that all nodes >>> have identical topologies. >>> >>> You job will now abort. >>> -------------------------------------------------------------------------- >>> >>> Oddly enough, the binding map *is* shown on STDERR, >>> and it sounds *correct*, pretty much the same binding map above >>> that I get for a single node. >>> >>> ***************** >>> >>> Finally, replacing "--cpus-per-core 2" by "--npernode 16" >>> works to some extent, but doesn't reach my goal. >>> I.e., the job doesn't fail, and each node gets 16 MPI >>> processes indeed. >>> However, it doesn't bind the processes the way I want. >>> Regardless of whether I continue to use "--bind-to-core" >>> or replace it by "--bind-to-socket" >>> all 16 processes on each node always bind to socket 0, >>> and nothing goes to socket 1. >>> >>> ************ >>> >>> Is there any simple workaround to this >>> (other than using a --rankfile), >>> to make --cpus-per-proc work with multiple nodes, >>> using "each other core"? >>> >>> [Only if it is simple workaround. I must finish this >>> remote test soon. Otherwise I can revisit this issue later.] >>> >>> Thank you, >>> Gus Correa >>> >>> >>> >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users