I am afraid --map-by ppr:32:socket --bind-to core --cpu-list 0,2,4,6...
somehow conflicts internally with other policies. I have also tried with
--cpu-set with identical results. Probably rankfile is my only option too.

On 28/02/2021 22:44, Ralph Castain via users wrote:
> The only way I know of to do what you want is
>
> --map-by ppr:32:socket --bind-to core --cpu-list 0,2,4,6,...
>
> where you list out the exact cpus you want to use.
>
>
>> On Feb 28, 2021, at 9:58 AM, Luis Cebamanos via users
>> <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> wrote:
>>
>> I could do--map-by ppr:32:socket:PE=1 --bind-to core (output below)
>> but I cannot see the way of mapping every 2 cores 0,2,4,....
>>
>>  [epsilon110:1489563] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]:
>> [BB/../../..
>> /../../../../../../../../../../../../../../../../../../../../../../../../../../.
>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>> ../../../../../../..][../../../../../../../../../../../../../../../../../../../.
>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>> ../../../../../../../../../../../../../../../../../..]
>> [epsilon110:1489563] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]:
>> [../BB/../..
>> /../../../../../../../../../../../../../../../../../../../../../../../../../../.
>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>> ../../../../../../..][../../../../../../../../../../../../../../../../../../../.
>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>> ../../../../../../../../../../../../../../../../../..]
>>
>> On 28/02/2021 16:24, Ralph Castain via users wrote:
>>> Did you read the documentation on rankfile? The "slot=N" directive
>>> saids to "put this proc on core N". In your file, you stipulate that
>>>
>>> rank 0 is to be placed solely on core 0
>>> rank 1 is to be placed solely on core 2
>>> etc.
>>>
>>> That is not what you asked for in your mpirun cmd. You asked that
>>> each proc be mapped to TWO cores (PE=2) or FOUR threads (PE=4 with
>>> bind-to HWT). If you wanted that same thing in a rankfile, it should
>>> have said
>>>
>>> rank 0 slots=0-1
>>> rank 1 slots=2-3
>>> etc.
>>>
>>> Hence the difference. I was simply correcting your mpirun cmd line
>>> as you said you wanted two CORES, and that isn't guaranteed if you
>>> are stipulating things in terms of HWTs as not every machine has two
>>> HWTs/core.
>>>
>>>
>>>
>>>> On Feb 28, 2021, at 7:43 AM, Luis Cebamanos via users
>>>> <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> wrote:
>>>>
>>>> Hi Ralph,
>>>>
>>>> Thanks for this, however --map-by ppr:32:socket:PE=2 --bind-to core
>>>> reports the same binding than --map-by ppr:32:socket:PE=4 --bind-to
>>>> hwthread:
>>>>
>>>> [epsilon104:2861230] MCW rank 0 bound to socket 0[core 0[hwt 0-1]],
>>>> socket 0[core 1[hwt 0-1]]: [BB/BB/../../../../
>>>> ../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../
>>>> ../../../../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..
>>>> /../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..
>>>> /../../../../../../../..]
>>>> [epsilon104:2861230] MCW rank 1 bound to socket 0[core 2[hwt 0-1]],
>>>> socket 0[core 3[hwt 0-1]]: [../../BB/BB/../../
>>>> ../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../
>>>> ../../../../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..
>>>> /../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..
>>>> /../../../../../../../..]
>>>> [epsilon104:2861230] MCW rank 2 bound to socket 0[core 4[hwt 0-1]],
>>>> socket 0[core 5[hwt 0-1]]: [../../../../BB/BB/
>>>> ../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../
>>>> ../../../../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..
>>>> /../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..
>>>> /../../../../../../../..]
>>>>
>>>> And this is still different from the output produce using the rankfile.
>>>>
>>>> Cheers,
>>>> Luis
>>>>
>>>> On 28/02/2021 14:06, Ralph Castain via users wrote:
>>>>> Your command line is incorrect:
>>>>>
>>>>> --map-by ppr:32:socket:PE=4 --bind-to hwthread
>>>>>
>>>>> should be
>>>>>
>>>>> --map-by ppr:32:socket:PE=2 --bind-to core
>>>>>
>>>>>
>>>>>
>>>>>> On Feb 28, 2021, at 5:57 AM, Luis Cebamanos via users
>>>>>> <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> wrote:
>>>>>>
>>>>>> I should have said, "I would like to run 128 MPI processes on 2
>>>>>> nodes" and not 64 like I initially said...
>>>>>>
>>>>>> On Sat, 27 Feb 2021, 15:03 Luis Cebamanos, <luic...@gmail.com
>>>>>> <mailto:luic...@gmail.com>> wrote:
>>>>>>
>>>>>>     Hello OMPI users,
>>>>>>
>>>>>>     On 128 core nodes, 2 sockets x 64 cores/socket (2
>>>>>>     hwthreads/core) , I am
>>>>>>     trying to match the behavior of running with a rankfile with
>>>>>>     manual
>>>>>>     mapping/ranking/binding.
>>>>>>
>>>>>>     I would like to run 64 MPI processes on 2 nodes, 1 MPI
>>>>>>     process every 2
>>>>>>     cores. This is, I want to run 32 MPI processes per socket on
>>>>>>     2 128-core
>>>>>>     nodes. My mapping should be something like:
>>>>>>
>>>>>>     Node 0
>>>>>>     =====
>>>>>>     rank 0  -  core 0
>>>>>>     rank 1  -  core 2
>>>>>>     rank 3 -   core 4
>>>>>>     ...
>>>>>>     rank 63 - core 126
>>>>>>
>>>>>>
>>>>>>     Node 1
>>>>>>     ====
>>>>>>     rank 64  -  core 0
>>>>>>     rank 65  -  core 2
>>>>>>     rank 66 -   core 4
>>>>>>     ...
>>>>>>     rank 127- core 126
>>>>>>
>>>>>>     If I use a rankfile:
>>>>>>     rank 0=epsilon102 slot=0
>>>>>>     rank 1=epsilon102 slot=2
>>>>>>     rank 2=epsilon102 slot=4
>>>>>>     rank 3=epsilon102 slot=6
>>>>>>     rank 4=epsilon102 slot=8
>>>>>>     rank 5=epsilon102slot=10
>>>>>>     ....
>>>>>>     rank 123=epsilon103 slot=118
>>>>>>     rank 124=epsilon103 slot=120
>>>>>>     rank 125=epsilon103 slot=122
>>>>>>     rank 126=epsilon103 slot=124
>>>>>>     rank 127=epsilon103 slot=126
>>>>>>
>>>>>>     My --report-binding looks like:
>>>>>>
>>>>>>     [epsilon102:2635370] MCW rank 0 bound to socket 0[core 0[hwt
>>>>>>     0-1]]:
>>>>>>     [BB/../../..
>>>>>>     
>>>>>> /../../../../../../../../../../../../../../../../../../../../../../../../../../.
>>>>>>     
>>>>>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>>>>>>     
>>>>>> ../../../../../../..][../../../../../../../../../../../../../../../../../../../.
>>>>>>     
>>>>>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>>>>>>     ../../../../../../../../../../../../../../../../../..]
>>>>>>     [epsilon102:2635370] MCW rank 1 bound to socket 0[core 2[hwt
>>>>>>     0-1]]:
>>>>>>     [../../BB/..
>>>>>>     
>>>>>> /../../../../../../../../../../../../../../../../../../../../../../../../../../.
>>>>>>     
>>>>>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>>>>>>     
>>>>>> ../../../../../../..][../../../../../../../../../../../../../../../../../../../.
>>>>>>     
>>>>>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>>>>>>     ../../../../../../../../../../../../../../../../../..]
>>>>>>     [epsilon102:2635370] MCW rank 2 bound to socket 0[core 4[hwt
>>>>>>     0-1]]:
>>>>>>     [../../../..
>>>>>>     
>>>>>> /BB/../../../../../../../../../../../../../../../../../../../../../../../../../.
>>>>>>     
>>>>>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>>>>>>     
>>>>>> ../../../../../../..][../../../../../../../../../../../../../../../../../../../.
>>>>>>     
>>>>>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>>>>>>     ../../../../../../../../../../../../../../../../../..]
>>>>>>
>>>>>>
>>>>>>     However, I cannot match this report-binding output by
>>>>>>     manually using
>>>>>>     --map-by and --bind-to. I had the impression that this will
>>>>>>     be the same:
>>>>>>
>>>>>>     mpirun -np $SLURM_NTASKS  --report-bindings --map-by
>>>>>>     ppr:32:socket:PE=4
>>>>>>     --bind-to hwthread
>>>>>>
>>>>>>     But this output is not quite the same:
>>>>>>
>>>>>>     [epsilon102:2631529] MCW rank 0 bound to socket 0[core 0[hwt
>>>>>>     0-1]],
>>>>>>     socket 0[cor
>>>>>>     e 1[hwt 0-1]]:
>>>>>>     [BB/BB/../../../../../../../../../../../../../../../../../../../.
>>>>>>     
>>>>>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>>>>>>     
>>>>>> ../../../../../../../../../../../../../../../..][../../../../../../../../../../.
>>>>>>     
>>>>>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>>>>>>     
>>>>>> ../../../../../../../../../../../../../../../../../../../../../../../../../../..]
>>>>>>     [epsilon102:2631529] MCW rank 1 bound to socket 0[core 2[hwt
>>>>>>     0-1]],
>>>>>>     socket 0[cor
>>>>>>     e 3[hwt 0-1]]:
>>>>>>     [../../BB/BB/../../../../../../../../../../../../../../../../../.
>>>>>>     
>>>>>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>>>>>>     
>>>>>> ../../../../../../../../../../../../../../../..][../../../../../../../../../../.
>>>>>>     
>>>>>> ./../../../../../../../../../../../../../../../../../../../../../../../../../../
>>>>>>     
>>>>>> ../../../../../../../../../../../../../../../../../../../../../../../../../../..]
>>>>>>
>>>>>>     What am I missing to match the rankfile behavior? Regarding
>>>>>>     performance,
>>>>>>     what difference does it make between the first and the second
>>>>>>     outputs?
>>>>>>
>>>>>>     Thanks for your help!
>>>>>>     Luis
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to