On Oct 3, 2012, at 8:40 AM, Siegmar Gross <siegmar.gr...@informatik.hs-fulda.de> wrote:
> Hi, > >> As I said, in the absence of a hostfile, -host assigns ONE slot for >> each time a host is named. So the equivalent hostfile would have >> "slots=1" to create the same pattern as your -host cmd line. > > That would mean that a hostfile has nothing to do with the underlying > hardware and that it would be a mystery to find out how to set it up. That's correct - an unfortunate aspect of using hostfiles. This is one of the big motivations for the changes in 1.7 and beyond. > Now I found a different solution so that I'm a little bit satisfied that > I don't need a different hostfile for every "mpiexec" command. I > sorted the output and removed the output from "hostname" so that > everything is more readable. Is the keyword "sockets" available in > openmpi-1.7 and openmpi-1.9 as well? No - it is no longer required with 1.7 and beyond because we now have the ability to directly sense the hardware, so we no longer need users to tell us. > > tyr fd1026 252 cat host_sunpc0_1 > > sunpc0 sockets=2 slots=4 > sunpc1 sockets=2 slots=4 > > tyr fd1026 253 mpiexec -report-bindings -hostfile host_sunpc0_1 \ > -np 4 -npersocket 1 -cpus-per-proc 2 -bynode -bind-to-core hostname > [sunpc0:12641] MCW rank 0 bound to socket 0[core 0-1]: [B B][. .] > [sunpc1:01402] MCW rank 1 bound to socket 0[core 0-1]: [B B][. .] > [sunpc0:12641] MCW rank 2 bound to socket 1[core 0-1]: [. .][B B] > [sunpc1:01402] MCW rank 3 bound to socket 1[core 0-1]: [. .][B B] > > tyr fd1026 254 mpiexec -report-bindings -host sunpc0,sunpc1 \ > -np 4 -cpus-per-proc 2 -bind-to-core -bysocket hostname > [sunpc0:12676] MCW rank 0 bound to socket 0[core 0-1]: [B B][. .] > [sunpc1:01437] MCW rank 1 bound to socket 0[core 0-1]: [B B][. .] > [sunpc0:12676] MCW rank 2 bound to socket 1[core 0-1]: [. .][B B] > [sunpc1:01437] MCW rank 3 bound to socket 1[core 0-1]: [. .][B B] > > tyr fd1026 258 mpiexec -report-bindings -hostfile host_sunpc0_1 \ > -np 2 -npernode 1 -cpus-per-proc 4 -bind-to-core hostname > [sunpc0:12833] MCW rank 0 bound to socket 0[core 0-1] > socket 1[core 0-1]: [B B][B B] > [sunpc1:01561] MCW rank 1 bound to socket 0[core 0-1] > socket 1[core 0-1]: [B B][B B] > > tyr fd1026 259 mpiexec -report-bindings -host sunpc0,sunpc1 \ > -np 2 -cpus-per-proc 4 -bind-to-core hostname > [sunpc0:12869] MCW rank 0 bound to socket 0[core 0-1] > socket 1[core 0-1]: [B B][B B] > [sunpc1:01600] MCW rank 1 bound to socket 0[core 0-1] > socket 1[core 0-1]: [B B][B B] > > > Thank you very much for your answers and your time. I have learned > a lot about process bindings through our discussion. Now I'm waiting > for a bug fix for my problem with rankfiles. :-)) > > > Kind regards > > Siegmar > > > >> On Oct 3, 2012, at 7:12 AM, Siegmar Gross > <siegmar.gr...@informatik.hs-fulda.de> wrote: >> >>> Hi, >>> >>> I thought that "slot" is the smallest manageable entity so that I >>> must set "slot=4" for a dual-processor dual-core machine with one >>> hardware-thread per core. Today I learned about the new keyword >>> "sockets" for a hostfile (I didn't find it in "man orte_hosts"). >>> How would I specify a system with two dual-core processors so that >>> "mpiexec -report-bindings -hostfile host_sunpc0_1 -np 4 >>> -cpus-per-proc 2 -bind-to-core hostname" or even >>> "mpiexec -report-bindings -hostfile host_sunpc0_1 -np 2 >>> -cpus-per-proc 4 -bind-to-core hostname" would work in the same way >>> as the commands below. >>> >>> tyr fd1026 217 mpiexec -report-bindings -host sunpc0,sunpc1 -np 2 \ >>> -cpus-per-proc 4 -bind-to-core hostname >>> [sunpc0:11658] MCW rank 0 bound to socket 0[core 0-1] >>> socket 1[core 0-1]: [B B][B B] >>> sunpc0 >>> [sunpc1:00553] MCW rank 1 bound to socket 0[core 0-1] >>> socket 1[core 0-1]: [B B][B B] >>> sunpc1 >>> >>> >>> Thank you very much for your help in advance. >>> >>> >>> Kind regards >>> >>> Siegmar >>> >>> >>> >>>>> I recognized another problem with procecss bindings. The command >>>>> works, if I use "-host" and it breaks, if I use "-hostfile" with >>>>> the same machines. >>>>> >>>>> tyr fd1026 178 mpiexec -report-bindings -host sunpc0,sunpc1 -np 4 \ >>>>> -cpus-per-proc 2 -bind-to-core hostname >>>>> sunpc1 >>>>> [sunpc1:00086] MCW rank 1 bound to socket 0[core 0-1]: [B B][. .] >>>>> [sunpc1:00086] MCW rank 3 bound to socket 1[core 0-1]: [. .][B B] >>>>> sunpc0 >>>>> [sunpc0:10929] MCW rank 0 bound to socket 0[core 0-1]: [B B][. .] >>>>> sunpc0 >>>>> [sunpc0:10929] MCW rank 2 bound to socket 1[core 0-1]: [. .][B B] >>>>> sunpc1 >>>>> >>>>> >>>> >>>> Yes, this works because you told us there is only ONE slot on each >>>> host. As a result, we split the 4 processes across the two hosts >>>> (both of which are now oversubscribed), resulting in TWO processes >>>> running on each host. Since there are 4 cores on each host, and >>>> you asked for 2 cores/process, we can make this work. >>>> >>>> >>>>> tyr fd1026 179 cat host_sunpc0_1 >>>>> sunpc0 slots=4 >>>>> sunpc1 slots=4 >>>>> >>>>> >>>>> tyr fd1026 180 mpiexec -report-bindings -hostfile host_sunpc0_1 -np 4 \ >>>>> -cpus-per-proc 2 -bind-to-core hostname >>>> >>>> And this will of course not work. In your hostfile, you told us there >>>> are FOUR slots on each host. Since the default is to map by slot, we >>>> correctly mapped all four processes to the first node. We then tried >>>> to bind 2 cores for each process, resulting in 8 cores - which is >>>> more than you have. >>>> >>>> >>>>> -------------------------------------------------------------------------- >>>>> An invalid physical processor ID was returned when attempting to bind >>>>> an MPI process to a unique processor. >>>>> >>>>> This usually means that you requested binding to more processors than >>>>> exist (e.g., trying to bind N MPI processes to M processors, where N > >>>>> M). Double check that you have enough unique processors for all the >>>>> MPI processes that you are launching on this host. >>>>> >>>>> You job will now abort. >>>>> -------------------------------------------------------------------------- >>>>> sunpc0 >>>>> [sunpc0:10964] MCW rank 0 bound to socket 0[core 0-1]: [B B][. .] >>>>> sunpc0 >>>>> [sunpc0:10964] MCW rank 1 bound to socket 1[core 0-1]: [. .][B B] >>>>> -------------------------------------------------------------------------- >>>>> mpiexec was unable to start the specified application as it encountered >>>>> an error >>>>> on node sunpc0. More information may be available above. >>>>> -------------------------------------------------------------------------- >>>>> 4 total processes failed to start >>>>> >>>>> >>>>> Perhaps this error is related to the other errors. Thank you very >>>>> much for any help in advance. >>>>> >>>>> >>>>> Kind regards >>>>> >>>>> Siegmar >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> >>> >> >> >