Thank you, Christoph. I did not consider the --cpu-list.
However, this is okay if I have a single script that is launching several
jobs (please note that each job may have a different number of CPUs). In my
case I have the same script (that launches mpirun), which is called many
times. The script is periodically called till all 48 CPUs are busy, and
whenever other jobs are finished...

If the script will be the same, I'm afraid that all the jobs subsequent to
the first will be bound to the same cpulist. Am I wrong?
mpirun -n xx --cpu-list "$(seq -s, 0 47)" --bind-to cpu-list:ordered $app
(with xx < 48) will bound all the processes to the first 6 CPUs?
I need a way to dynamically manage the cpulist, but I was hoping that
mpirun can do the tasks.
If not, I'm afraid that the simplest solution is to use --bind-to none
Thank you,
Carlo


Il giorno gio 20 ago 2020 alle ore 17:24 Christoph Niethammer <
nietham...@hlrs.de> ha scritto:

> Hello Carlo,
>
> If you execute multiple mpirun commands they will not know about each
> others resource bindings.
> E.g. if you bind to cores each mpirun will start with the same core to
> assign with again.
> This results then in over subscription of the cores, which slows down your
> programs - as you did realize.
>
>
> You can use "--cpu-list" together with "--bind-to cpu-list:ordered"
> So if you start all your simulations in a single script this would look
> like
>
> mpirun -n 6 --cpu-list "$(seq -s, 0 5)" --bind-to cpu-list:ordered  $app
> mpirun -n 6 --cpu-list "$(seq -s, 6 11)" --bind-to cpu-list:ordered  $app
> ...
> mpirun -n 6 --cpu-list "$(seq -s, 42 47)" --bind-to cpu-list:ordered  $app
>
>
> Best
> Christoph
>
>
> ----- Original Message -----
> From: "Open MPI Users" <users@lists.open-mpi.org>
> To: "Open MPI Users" <users@lists.open-mpi.org>
> Cc: "Carlo Nervi" <carlo.ne...@unito.it>
> Sent: Thursday, 20 August, 2020 12:17:21
> Subject: [OMPI users] OMPI 4.0.4 how to use mpirun properly in numa
> architecture
>
> Dear OMPI community,
> I'm a simple end-user with no particular experience.
> I compile quantum chemical programs and use them in parallel.
>
> My system is a 4 socket, 12 core per socket Opteron 6168 system for a total
> of 48 cores and 64 Gb of RAM. It has 8 NUMA nodes:
>
> openmpi $ hwloc-info
> depth 0:           1 Machine (type #0)
>  depth 1:          4 Package (type #1)
>   depth 2:         8 L3Cache (type #6)
>    depth 3:        48 L2Cache (type #5)
>     depth 4:       48 L1dCache (type #4)
>      depth 5:      48 L1iCache (type #9)
>       depth 6:     48 Core (type #2)
>        depth 7:    48 PU (type #3)
> Special depth -3:  8 NUMANode (type #13)
> Special depth -4:  3 Bridge (type #14)
> Special depth -5:  5 PCIDev (type #15)
> Special depth -6:  5 OSDev (type #16)
>
> lstopo:
>
> openmpi $ lstopo
> Machine (63GB total)
>   Package L#0
>     L3 L#0 (5118KB)
>       NUMANode L#0 (P#0 7971MB)
>       L2 L#0 (512KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0
> (P#0)
>       L2 L#1 (512KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU L#1
> (P#1)
>       L2 L#2 (512KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core L#2 + PU L#2
> (P#2)
>       L2 L#3 (512KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core L#3 + PU L#3
> (P#3)
>       L2 L#4 (512KB) + L1d L#4 (64KB) + L1i L#4 (64KB) + Core L#4 + PU L#4
> (P#4)
>       L2 L#5 (512KB) + L1d L#5 (64KB) + L1i L#5 (64KB) + Core L#5 + PU L#5
> (P#5)
>       HostBridge
>         PCIBridge
>           PCI 02:00.0 (Ethernet)
>             Net "enp2s0f0"
>           PCI 02:00.1 (Ethernet)
>             Net "enp2s0f1"
>         PCI 00:11.0 (RAID)
>           Block(Disk) "sdb"
>           Block(Disk) "sdc"
>           Block(Disk) "sda"
>         PCI 00:14.1 (IDE)
>         PCIBridge
>           PCI 01:04.0 (VGA)
>     L3 L#1 (5118KB)
>       NUMANode L#1 (P#1 8063MB)
>       L2 L#6 (512KB) + L1d L#6 (64KB) + L1i L#6 (64KB) + Core L#6 + PU L#6
> (P#6)
>       L2 L#7 (512KB) + L1d L#7 (64KB) + L1i L#7 (64KB) + Core L#7 + PU L#7
> (P#7)
>       L2 L#8 (512KB) + L1d L#8 (64KB) + L1i L#8 (64KB) + Core L#8 + PU L#8
> (P#8)
>       L2 L#9 (512KB) + L1d L#9 (64KB) + L1i L#9 (64KB) + Core L#9 + PU L#9
> (P#9)
>       L2 L#10 (512KB) + L1d L#10 (64KB) + L1i L#10 (64KB) + Core L#10 + PU
> L#10 (P#10)
>       L2 L#11 (512KB) + L1d L#11 (64KB) + L1i L#11 (64KB) + Core L#11 + PU
> L#11 (P#11)
>   Package L#1
>     L3 L#2 (5118KB)
>       NUMANode L#2 (P#2 8063MB)
>       L2 L#12 (512KB) + L1d L#12 (64KB) + L1i L#12 (64KB) + Core L#12 + PU
> L#12 (P#12)
>       L2 L#13 (512KB) + L1d L#13 (64KB) + L1i L#13 (64KB) + Core L#13 + PU
> L#13 (P#13)
>       L2 L#14 (512KB) + L1d L#14 (64KB) + L1i L#14 (64KB) + Core L#14 + PU
> L#14 (P#14)
>       L2 L#15 (512KB) + L1d L#15 (64KB) + L1i L#15 (64KB) + Core L#15 + PU
> L#15 (P#15)
>       L2 L#16 (512KB) + L1d L#16 (64KB) + L1i L#16 (64KB) + Core L#16 + PU
> L#16 (P#16)
>       L2 L#17 (512KB) + L1d L#17 (64KB) + L1i L#17 (64KB) + Core L#17 + PU
> L#17 (P#17)
>     L3 L#3 (5118KB)
>       NUMANode L#3 (P#3 8063MB)
>       L2 L#18 (512KB) + L1d L#18 (64KB) + L1i L#18 (64KB) + Core L#18 + PU
> L#18 (P#18)
>       L2 L#19 (512KB) + L1d L#19 (64KB) + L1i L#19 (64KB) + Core L#19 + PU
> L#19 (P#19)
>       L2 L#20 (512KB) + L1d L#20 (64KB) + L1i L#20 (64KB) + Core L#20 + PU
> L#20 (P#20)
>       L2 L#21 (512KB) + L1d L#21 (64KB) + L1i L#21 (64KB) + Core L#21 + PU
> L#21 (P#21)
>       L2 L#22 (512KB) + L1d L#22 (64KB) + L1i L#22 (64KB) + Core L#22 + PU
> L#22 (P#22)
>       L2 L#23 (512KB) + L1d L#23 (64KB) + L1i L#23 (64KB) + Core L#23 + PU
> L#23 (P#23)
>   Package L#2
>     L3 L#4 (5118KB)
>       NUMANode L#4 (P#4 8063MB)
>       L2 L#24 (512KB) + L1d L#24 (64KB) + L1i L#24 (64KB) + Core L#24 + PU
> L#24 (P#24)
>       L2 L#25 (512KB) + L1d L#25 (64KB) + L1i L#25 (64KB) + Core L#25 + PU
> L#25 (P#25)
>       L2 L#26 (512KB) + L1d L#26 (64KB) + L1i L#26 (64KB) + Core L#26 + PU
> L#26 (P#26)
>       L2 L#27 (512KB) + L1d L#27 (64KB) + L1i L#27 (64KB) + Core L#27 + PU
> L#27 (P#27)
>       L2 L#28 (512KB) + L1d L#28 (64KB) + L1i L#28 (64KB) + Core L#28 + PU
> L#28 (P#28)
>       L2 L#29 (512KB) + L1d L#29 (64KB) + L1i L#29 (64KB) + Core L#29 + PU
> L#29 (P#29)
>     L3 L#5 (5118KB)
>       NUMANode L#5 (P#5 8063MB)
>       L2 L#30 (512KB) + L1d L#30 (64KB) + L1i L#30 (64KB) + Core L#30 + PU
> L#30 (P#30)
>       L2 L#31 (512KB) + L1d L#31 (64KB) + L1i L#31 (64KB) + Core L#31 + PU
> L#31 (P#31)
>       L2 L#32 (512KB) + L1d L#32 (64KB) + L1i L#32 (64KB) + Core L#32 + PU
> L#32 (P#32)
>       L2 L#33 (512KB) + L1d L#33 (64KB) + L1i L#33 (64KB) + Core L#33 + PU
> L#33 (P#33)
>       L2 L#34 (512KB) + L1d L#34 (64KB) + L1i L#34 (64KB) + Core L#34 + PU
> L#34 (P#34)
>       L2 L#35 (512KB) + L1d L#35 (64KB) + L1i L#35 (64KB) + Core L#35 + PU
> L#35 (P#35)
>   Package L#3
>     L3 L#6 (5118KB)
>       NUMANode L#6 (P#6 8063MB)
>       L2 L#36 (512KB) + L1d L#36 (64KB) + L1i L#36 (64KB) + Core L#36 + PU
> L#36 (P#36)
>       L2 L#37 (512KB) + L1d L#37 (64KB) + L1i L#37 (64KB) + Core L#37 + PU
> L#37 (P#37)
>       L2 L#38 (512KB) + L1d L#38 (64KB) + L1i L#38 (64KB) + Core L#38 + PU
> L#38 (P#38)
>       L2 L#39 (512KB) + L1d L#39 (64KB) + L1i L#39 (64KB) + Core L#39 + PU
> L#39 (P#39)
>       L2 L#40 (512KB) + L1d L#40 (64KB) + L1i L#40 (64KB) + Core L#40 + PU
> L#40 (P#40)
>       L2 L#41 (512KB) + L1d L#41 (64KB) + L1i L#41 (64KB) + Core L#41 + PU
> L#41 (P#41)
>     L3 L#7 (5118KB)
>       NUMANode L#7 (P#7 8062MB)
>       L2 L#42 (512KB) + L1d L#42 (64KB) + L1i L#42 (64KB) + Core L#42 + PU
> L#42 (P#42)
>       L2 L#43 (512KB) + L1d L#43 (64KB) + L1i L#43 (64KB) + Core L#43 + PU
> L#43 (P#43)
>       L2 L#44 (512KB) + L1d L#44 (64KB) + L1i L#44 (64KB) + Core L#44 + PU
> L#44 (P#44)
>       L2 L#45 (512KB) + L1d L#45 (64KB) + L1i L#45 (64KB) + Core L#45 + PU
> L#45 (P#45)
>       L2 L#46 (512KB) + L1d L#46 (64KB) + L1i L#46 (64KB) + Core L#46 + PU
> L#46 (P#46)
>       L2 L#47 (512KB) + L1d L#47 (64KB) + L1i L#47 (64KB) + Core L#47 + PU
> L#47 (P#47)
>
> openmpi $ numactl -H
> available: 8 nodes (0-7)
> node 0 cpus: 0 1 2 3 4 5
> node 0 size: 7971 MB
> node 0 free: 6858 MB
> node 1 cpus: 6 7 8 9 10 11
> node 1 size: 8062 MB
> node 1 free: 6860 MB
> node 2 cpus: 12 13 14 15 16 17
> node 2 size: 8062 MB
> node 2 free: 6979 MB
> node 3 cpus: 18 19 20 21 22 23
> node 3 size: 8062 MB
> node 3 free: 7132 MB
> node 4 cpus: 24 25 26 27 28 29
> node 4 size: 8062 MB
> node 4 free: 6276 MB
> node 5 cpus: 30 31 32 33 34 35
> node 5 size: 8062 MB
> node 5 free: 7190 MB
> node 6 cpus: 36 37 38 39 40 41
> node 6 size: 8062 MB
> node 6 free: 7059 MB
> node 7 cpus: 42 43 44 45 46 47
> node 7 size: 8061 MB
> node 7 free: 7075 MB
> node distances:
> node   0   1   2   3   4   5   6   7
>   0:  10  16  16  22  16  22  16  22
>   1:  16  10  22  16  22  16  22  16
>   2:  16  22  10  16  16  22  16  22
>   3:  22  16  16  10  22  16  22  16
>   4:  16  22  16  22  10  16  16  22
>   5:  22  16  22  16  16  10  22  16
>   6:  16  22  16  22  16  22  10  16
>   7:  22  16  22  16  22  16  16  10
>
>
> I compiled openmpi 4.0.4 but probably some bugs alter the behavior of
> mpirun, and therefore I'm asking you suggestions on how to properly run
> parallel code on my system. I'm using Linux Gentoo.
> Questions:
>
> 1) If I recompile openmpi changing the version, should I also recompile all
> my openmpi programs? I'm a little bit confused since I tried to downgrade
> to 4.0.1, but new random errors pop up.
>
> 2) I have many jobs to run, each of them can use from 1 to N cpus (all mpi:
> by now I tend to avoid OpenMP). Ideally I would like to run the same simple
> mpirun command for each job so that mpirun should distribute jobs and
> processes automatically, taking in account the NUMA architecture (8 NUMA, 6
> CPUs per NUMA node).
> I tried to use --map-by numa in 4.0.1 (--map-by numa gives errors in
> 4.0.4), but many processes run at 10-30% of CPU.
> Then I switched back to 4.0.4 (the version I used to compile my programs),
> and the only way I found effective is to use "mpirun --bind-to none". In
> this way all cpus are running at 100%, but I lose efficiency due to NUMA.
>
> What is the correct way (if exists) to bind (almost) all the processes  of
> the same job to a single NUMA node? I'm not searching for a perfect
> solution, but at least distribute the first 8 parallel jobs to 8 NUMA
> nodes.
>
> I hope I made myself clear!
> Thank for your patience and sorry for the long letter,
> Carlo
>
>
>
>
> --
>
> ------------------------------------------------------------
> Prof. Carlo Nervi carlo.ne...@unito.it  Tel:+39 0116707507/8
> Fax: +39 0116707855      -      Dipartimento di Chimica, via
> P. Giuria 7, 10125 Torino, Italy.    http://lem.ch.unito.it/
>
> ICCC 2020 5-10 July 2020, Rimini, Italy: http://www.iccc2020.com
> International Conference on Coordination Chemistry (ICCC 2020)
>
>  <http://www.iccc2020.com/>
>


-- 

------------------------------------------------------------
Prof. Carlo Nervi carlo.ne...@unito.it  Tel:+39 0116707507/8
Fax: +39 0116707855      -      Dipartimento di Chimica, via
P. Giuria 7, 10125 Torino, Italy.    http://lem.ch.unito.it/

ICCC 2020 5-10 July 2020, Rimini, Italy: http://www.iccc2020.com
International Conference on Coordination Chemistry (ICCC 2020)

 <http://www.iccc2020.com/>

Reply via email to