On Oct 3, 2012, at 8:40 AM, Siegmar Gross 
<siegmar.gr...@informatik.hs-fulda.de> wrote:

> Hi,
> 
>> As I said, in the absence of a hostfile, -host assigns ONE slot for
>> each time a host is named. So the equivalent hostfile would have
>> "slots=1" to create the same pattern as your -host cmd line.
> 
> That would mean that a hostfile has nothing to do with the underlying
> hardware and that it would be a mystery to find out how to set it up.

That's correct - an unfortunate aspect of using hostfiles. This is one of the 
big motivations for the changes in 1.7 and beyond.

> Now I found a different solution so that I'm a little bit satisfied that
> I don't need a different hostfile for every "mpiexec" command. I
> sorted the output and removed the output from "hostname" so that
> everything is more readable. Is the keyword "sockets" available in
> openmpi-1.7 and openmpi-1.9 as well?

No - it is no longer required with 1.7 and beyond because we now have the 
ability to directly sense the hardware, so we no longer need users to tell us.


> 
> tyr fd1026 252 cat host_sunpc0_1                                              
>   
> sunpc0 sockets=2 slots=4
> sunpc1 sockets=2 slots=4
> 
> tyr fd1026 253 mpiexec -report-bindings -hostfile host_sunpc0_1 \
>  -np 4 -npersocket 1 -cpus-per-proc 2 -bynode -bind-to-core hostname
> [sunpc0:12641] MCW rank 0 bound to socket 0[core 0-1]: [B B][. .]
> [sunpc1:01402] MCW rank 1 bound to socket 0[core 0-1]: [B B][. .]
> [sunpc0:12641] MCW rank 2 bound to socket 1[core 0-1]: [. .][B B]
> [sunpc1:01402] MCW rank 3 bound to socket 1[core 0-1]: [. .][B B]
> 
> tyr fd1026 254 mpiexec -report-bindings -host sunpc0,sunpc1 \
>  -np 4 -cpus-per-proc 2 -bind-to-core -bysocket hostname
> [sunpc0:12676] MCW rank 0 bound to socket 0[core 0-1]: [B B][. .]
> [sunpc1:01437] MCW rank 1 bound to socket 0[core 0-1]: [B B][. .]
> [sunpc0:12676] MCW rank 2 bound to socket 1[core 0-1]: [. .][B B]
> [sunpc1:01437] MCW rank 3 bound to socket 1[core 0-1]: [. .][B B]
> 
> tyr fd1026 258 mpiexec -report-bindings -hostfile host_sunpc0_1 \
>  -np 2 -npernode 1 -cpus-per-proc 4 -bind-to-core hostname
> [sunpc0:12833] MCW rank 0 bound to socket 0[core 0-1]
>                                   socket 1[core 0-1]: [B B][B B]
> [sunpc1:01561] MCW rank 1 bound to socket 0[core 0-1]
>                                   socket 1[core 0-1]: [B B][B B]
> 
> tyr fd1026 259 mpiexec -report-bindings -host sunpc0,sunpc1 \
>  -np 2 -cpus-per-proc 4 -bind-to-core hostname
> [sunpc0:12869] MCW rank 0 bound to socket 0[core 0-1]
>                                   socket 1[core 0-1]: [B B][B B]
> [sunpc1:01600] MCW rank 1 bound to socket 0[core 0-1]
>                                   socket 1[core 0-1]: [B B][B B]
> 
> 
> Thank you very much for your answers and your time. I have learned
> a lot about process bindings through our discussion. Now I'm waiting
> for a bug fix for my problem with rankfiles. :-))
> 
> 
> Kind regards
> 
> Siegmar
> 
> 
> 
>> On Oct 3, 2012, at 7:12 AM, Siegmar Gross 
> <siegmar.gr...@informatik.hs-fulda.de> wrote:
>> 
>>> Hi,
>>> 
>>> I thought that "slot" is the smallest manageable entity so that I
>>> must set "slot=4" for a dual-processor dual-core machine with one
>>> hardware-thread per core. Today I learned about the new keyword
>>> "sockets" for a hostfile (I didn't find it in "man orte_hosts").
>>> How would I specify a system with two dual-core processors so that
>>> "mpiexec -report-bindings -hostfile host_sunpc0_1 -np 4 
>>> -cpus-per-proc 2 -bind-to-core hostname" or even
>>> "mpiexec -report-bindings -hostfile host_sunpc0_1 -np 2 
>>> -cpus-per-proc 4 -bind-to-core hostname" would work in the same way
>>> as the commands below.
>>> 
>>> tyr fd1026 217 mpiexec -report-bindings -host sunpc0,sunpc1 -np 2 \
>>> -cpus-per-proc 4 -bind-to-core hostname
>>> [sunpc0:11658] MCW rank 0 bound to socket 0[core 0-1]
>>> socket 1[core 0-1]: [B B][B B]
>>> sunpc0
>>> [sunpc1:00553] MCW rank 1 bound to socket 0[core 0-1]
>>> socket 1[core 0-1]: [B B][B B]
>>> sunpc1
>>> 
>>> 
>>> Thank you very much for your help in advance.
>>> 
>>> 
>>> Kind regards
>>> 
>>> Siegmar
>>> 
>>> 
>>> 
>>>>> I recognized another problem with procecss bindings. The command
>>>>> works, if I use "-host" and it breaks, if I use "-hostfile" with 
>>>>> the same machines.
>>>>> 
>>>>> tyr fd1026 178 mpiexec -report-bindings -host sunpc0,sunpc1 -np 4 \
>>>>> -cpus-per-proc 2 -bind-to-core hostname
>>>>> sunpc1
>>>>> [sunpc1:00086] MCW rank 1 bound to socket 0[core 0-1]: [B B][. .]
>>>>> [sunpc1:00086] MCW rank 3 bound to socket 1[core 0-1]: [. .][B B]
>>>>> sunpc0
>>>>> [sunpc0:10929] MCW rank 0 bound to socket 0[core 0-1]: [B B][. .]
>>>>> sunpc0
>>>>> [sunpc0:10929] MCW rank 2 bound to socket 1[core 0-1]: [. .][B B]
>>>>> sunpc1
>>>>> 
>>>>> 
>>>> 
>>>> Yes, this works because you told us there is only ONE slot on each
>>>> host. As a result, we split the 4 processes across the two hosts
>>>> (both of which are now oversubscribed), resulting in TWO processes
>>>> running on each host. Since there are 4 cores on each host, and
>>>> you asked for 2 cores/process, we can make this work.
>>>> 
>>>> 
>>>>> tyr fd1026 179 cat host_sunpc0_1 
>>>>> sunpc0 slots=4
>>>>> sunpc1 slots=4
>>>>> 
>>>>> 
>>>>> tyr fd1026 180 mpiexec -report-bindings -hostfile host_sunpc0_1 -np 4 \
>>>>> -cpus-per-proc 2 -bind-to-core hostname
>>>> 
>>>> And this will of course not work. In your hostfile, you told us there
>>>> are FOUR slots on each host. Since the default is to map by slot, we
>>>> correctly mapped all four processes to the first node. We then tried
>>>> to bind 2 cores for each process, resulting in 8 cores - which is
>>>> more than you have.
>>>> 
>>>> 
>>>>> --------------------------------------------------------------------------
>>>>> An invalid physical processor ID was returned when attempting to bind
>>>>> an MPI process to a unique processor.
>>>>> 
>>>>> This usually means that you requested binding to more processors than
>>>>> exist (e.g., trying to bind N MPI processes to M processors, where N >
>>>>> M).  Double check that you have enough unique processors for all the
>>>>> MPI processes that you are launching on this host.
>>>>> 
>>>>> You job will now abort.
>>>>> --------------------------------------------------------------------------
>>>>> sunpc0
>>>>> [sunpc0:10964] MCW rank 0 bound to socket 0[core 0-1]: [B B][. .]
>>>>> sunpc0
>>>>> [sunpc0:10964] MCW rank 1 bound to socket 1[core 0-1]: [. .][B B]
>>>>> --------------------------------------------------------------------------
>>>>> mpiexec was unable to start the specified application as it encountered
>>>>> an error
>>>>> on node sunpc0. More information may be available above.
>>>>> --------------------------------------------------------------------------
>>>>> 4 total processes failed to start
>>>>> 
>>>>> 
>>>>> Perhaps this error is related to the other errors. Thank you very
>>>>> much for any help in advance.
>>>>> 
>>>>> 
>>>>> Kind regards
>>>>> 
>>>>> Siegmar
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> 
>>> 
>> 
>> 
> 


Reply via email to