https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80822
Bug ID: 80822 Summary: libgomp incorrect affinity when OMP_PLACES=threads Product: gcc Version: 6.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libgomp Assignee: unassigned at gcc dot gnu.org Reporter: weeks at iastate dot edu CC: jakub at gcc dot gnu.org Target Milestone: --- Created attachment 41385 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41385&action=edit xthi.c from Cray, Inc. modified to remove MPI code On the NERSC Cori system, the Haswell nodes have two Intel Xeon E5-2698 v3 processors, each with 16 CPU cores with HyperThreading enabled. Using OMP_PLACES=threads, libgomp from gcc 6.3.0 appears to mistakenly assume that CPU (hardware thread) 0 and 1 share the same core, while in reality 0 and 32 are on the same core, etc. To illustrate, attached (xthi-omp.c) is a version of xthi.c from the "Cray XC Series User Application Placement Guide (CLE 6.0.UP01) S-2496" (https://pubs.cray.com/content/00330629-DC/FA00256413) that has been modified to remove the MPI code. The output of en MPI 1.10.2 "lstopo --of console" command (lstopo.out) that shows the processor topology is at the bottom of this text. In the first example (OMP_NUM_THREADS=32 OMP_PLACES=threads OMP_PROC_BIND=spread), CPU cores 0, 2, 4, ..., 30 each have two OpenMP threads, while CPU cores 1,3,...,31 have none: ====================================================================== $ cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list 0,32 $ gcc --version gcc (GCC) 6.3.0 20161221 (Cray Inc.) Copyright (C) 2016 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. $ gcc -fopenmp -o xthi-omp.x xthi-omp.c $ OMP_NUM_THREADS=32 OMP_PLACES=threads OMP_PROC_BIND=spread ./xthi-omp.x | sort -k 4n,4n Hello from thread 0, on nid00009. (core affinity = 0) Hello from thread 1, on nid00009. (core affinity = 2) Hello from thread 2, on nid00009. (core affinity = 4) Hello from thread 3, on nid00009. (core affinity = 6) Hello from thread 4, on nid00009. (core affinity = 8) Hello from thread 5, on nid00009. (core affinity = 10) Hello from thread 6, on nid00009. (core affinity = 12) Hello from thread 7, on nid00009. (core affinity = 14) Hello from thread 8, on nid00009. (core affinity = 16) Hello from thread 9, on nid00009. (core affinity = 18) Hello from thread 10, on nid00009. (core affinity = 20) Hello from thread 11, on nid00009. (core affinity = 22) Hello from thread 12, on nid00009. (core affinity = 24) Hello from thread 13, on nid00009. (core affinity = 26) Hello from thread 14, on nid00009. (core affinity = 28) Hello from thread 15, on nid00009. (core affinity = 30) Hello from thread 16, on nid00009. (core affinity = 32) Hello from thread 17, on nid00009. (core affinity = 34) Hello from thread 18, on nid00009. (core affinity = 36) Hello from thread 19, on nid00009. (core affinity = 38) Hello from thread 20, on nid00009. (core affinity = 40) Hello from thread 21, on nid00009. (core affinity = 42) Hello from thread 22, on nid00009. (core affinity = 44) Hello from thread 23, on nid00009. (core affinity = 46) Hello from thread 24, on nid00009. (core affinity = 48) Hello from thread 25, on nid00009. (core affinity = 50) Hello from thread 26, on nid00009. (core affinity = 52) Hello from thread 27, on nid00009. (core affinity = 54) Hello from thread 28, on nid00009. (core affinity = 56) Hello from thread 29, on nid00009. (core affinity = 58) Hello from thread 30, on nid00009. (core affinity = 60) Hello from thread 31, on nid00009. (core affinity = 62) ====================================================================== In the second example, OMP_PROC_BIND=close results in 1 OpenMP thread per core, opposite of the intended effect: ====================================================================== $ OMP_NUM_THREADS=32 OMP_PLACES=threads OMP_PROC_BIND=close ./xthi-omp.x | sort -k 4n,4n Hello from thread 0, on nid00009. (core affinity = 0) Hello from thread 1, on nid00009. (core affinity = 1) Hello from thread 2, on nid00009. (core affinity = 2) Hello from thread 3, on nid00009. (core affinity = 3) Hello from thread 4, on nid00009. (core affinity = 4) Hello from thread 5, on nid00009. (core affinity = 5) Hello from thread 6, on nid00009. (core affinity = 6) Hello from thread 7, on nid00009. (core affinity = 7) Hello from thread 8, on nid00009. (core affinity = 8) Hello from thread 9, on nid00009. (core affinity = 9) Hello from thread 10, on nid00009. (core affinity = 10) Hello from thread 11, on nid00009. (core affinity = 11) Hello from thread 12, on nid00009. (core affinity = 12) Hello from thread 13, on nid00009. (core affinity = 13) Hello from thread 14, on nid00009. (core affinity = 14) Hello from thread 15, on nid00009. (core affinity = 15) Hello from thread 16, on nid00009. (core affinity = 16) Hello from thread 17, on nid00009. (core affinity = 17) Hello from thread 18, on nid00009. (core affinity = 18) Hello from thread 19, on nid00009. (core affinity = 19) Hello from thread 20, on nid00009. (core affinity = 20) Hello from thread 21, on nid00009. (core affinity = 21) Hello from thread 22, on nid00009. (core affinity = 22) Hello from thread 23, on nid00009. (core affinity = 23) Hello from thread 24, on nid00009. (core affinity = 24) Hello from thread 25, on nid00009. (core affinity = 25) Hello from thread 26, on nid00009. (core affinity = 26) Hello from thread 27, on nid00009. (core affinity = 27) Hello from thread 28, on nid00009. (core affinity = 28) Hello from thread 29, on nid00009. (core affinity = 29) Hello from thread 30, on nid00009. (core affinity = 30) Hello from thread 31, on nid00009. (core affinity = 31) ====================================================================== The Intel 17.0.2 OpenMP runtime uses the correct affinity in both cases: ====================================================================== $ icc --version icc (ICC) 17.0.2 20170213 Copyright (C) 1985-2017 Intel Corporation. All rights reserved. $ icc -qopenmp -o ./xthi-omp.x xthi-omp.c $ OMP_NUM_THREADS=32 OMP_PLACES=threads OMP_PROC_BIND=spread ./xthi-omp.x | sort -k 4n,4n Hello from thread 0, on nid00009. (core affinity = 0) Hello from thread 1, on nid00009. (core affinity = 1) Hello from thread 2, on nid00009. (core affinity = 2) Hello from thread 3, on nid00009. (core affinity = 3) Hello from thread 4, on nid00009. (core affinity = 4) Hello from thread 5, on nid00009. (core affinity = 5) Hello from thread 6, on nid00009. (core affinity = 6) Hello from thread 7, on nid00009. (core affinity = 7) Hello from thread 8, on nid00009. (core affinity = 8) Hello from thread 9, on nid00009. (core affinity = 9) Hello from thread 10, on nid00009. (core affinity = 10) Hello from thread 11, on nid00009. (core affinity = 11) Hello from thread 12, on nid00009. (core affinity = 12) Hello from thread 13, on nid00009. (core affinity = 13) Hello from thread 14, on nid00009. (core affinity = 14) Hello from thread 15, on nid00009. (core affinity = 15) Hello from thread 16, on nid00009. (core affinity = 16) Hello from thread 17, on nid00009. (core affinity = 17) Hello from thread 18, on nid00009. (core affinity = 18) Hello from thread 19, on nid00009. (core affinity = 19) Hello from thread 20, on nid00009. (core affinity = 20) Hello from thread 21, on nid00009. (core affinity = 21) Hello from thread 22, on nid00009. (core affinity = 22) Hello from thread 23, on nid00009. (core affinity = 23) Hello from thread 24, on nid00009. (core affinity = 24) Hello from thread 25, on nid00009. (core affinity = 25) Hello from thread 26, on nid00009. (core affinity = 26) Hello from thread 27, on nid00009. (core affinity = 27) Hello from thread 28, on nid00009. (core affinity = 28) Hello from thread 29, on nid00009. (core affinity = 29) Hello from thread 30, on nid00009. (core affinity = 30) Hello from thread 31, on nid00009. (core affinity = 31) $ OMP_NUM_THREADS=32 OMP_PLACES=threads OMP_PROC_BIND=close ./xthi-omp.x | sort -k 4n,4n Hello from thread 0, on nid00009. (core affinity = 0) Hello from thread 1, on nid00009. (core affinity = 32) Hello from thread 2, on nid00009. (core affinity = 1) Hello from thread 3, on nid00009. (core affinity = 33) Hello from thread 4, on nid00009. (core affinity = 2) Hello from thread 5, on nid00009. (core affinity = 34) Hello from thread 6, on nid00009. (core affinity = 3) Hello from thread 7, on nid00009. (core affinity = 35) Hello from thread 8, on nid00009. (core affinity = 4) Hello from thread 9, on nid00009. (core affinity = 36) Hello from thread 10, on nid00009. (core affinity = 5) Hello from thread 11, on nid00009. (core affinity = 37) Hello from thread 12, on nid00009. (core affinity = 6) Hello from thread 13, on nid00009. (core affinity = 38) Hello from thread 14, on nid00009. (core affinity = 7) Hello from thread 15, on nid00009. (core affinity = 39) Hello from thread 16, on nid00009. (core affinity = 8) Hello from thread 17, on nid00009. (core affinity = 40) Hello from thread 18, on nid00009. (core affinity = 9) Hello from thread 19, on nid00009. (core affinity = 41) Hello from thread 20, on nid00009. (core affinity = 10) Hello from thread 21, on nid00009. (core affinity = 42) Hello from thread 22, on nid00009. (core affinity = 11) Hello from thread 23, on nid00009. (core affinity = 43) Hello from thread 24, on nid00009. (core affinity = 12) Hello from thread 25, on nid00009. (core affinity = 44) Hello from thread 26, on nid00009. (core affinity = 13) Hello from thread 27, on nid00009. (core affinity = 45) Hello from thread 28, on nid00009. (core affinity = 14) Hello from thread 29, on nid00009. (core affinity = 46) Hello from thread 30, on nid00009. (core affinity = 15) Hello from thread 31, on nid00009. (core affinity = 47) ====================================================================== Output of "lstopo --of console": ====================================================================== Machine (126GB total) NUMANode L#0 (P#0 63GB) + Package L#0 + L3 L#0 (40MB) L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 PU L#0 (P#0) PU L#1 (P#32) L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 PU L#2 (P#1) PU L#3 (P#33) L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 PU L#4 (P#2) PU L#5 (P#34) L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 PU L#6 (P#3) PU L#7 (P#35) L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 PU L#8 (P#4) PU L#9 (P#36) L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 PU L#10 (P#5) PU L#11 (P#37) L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 PU L#12 (P#6) PU L#13 (P#38) L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 PU L#14 (P#7) PU L#15 (P#39) L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 PU L#16 (P#8) PU L#17 (P#40) L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 PU L#18 (P#9) PU L#19 (P#41) L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 PU L#20 (P#10) PU L#21 (P#42) L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 PU L#22 (P#11) PU L#23 (P#43) L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 PU L#24 (P#12) PU L#25 (P#44) L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 PU L#26 (P#13) PU L#27 (P#45) L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 PU L#28 (P#14) PU L#29 (P#46) L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 PU L#30 (P#15) PU L#31 (P#47) NUMANode L#1 (P#1 63GB) + Package L#1 + L3 L#1 (40MB) L2 L#16 (256KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16 PU L#32 (P#16) PU L#33 (P#48) L2 L#17 (256KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17 PU L#34 (P#17) PU L#35 (P#49) L2 L#18 (256KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 PU L#36 (P#18) PU L#37 (P#50) L2 L#19 (256KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 PU L#38 (P#19) PU L#39 (P#51) L2 L#20 (256KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20 PU L#40 (P#20) PU L#41 (P#52) L2 L#21 (256KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21 PU L#42 (P#21) PU L#43 (P#53) L2 L#22 (256KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22 PU L#44 (P#22) PU L#45 (P#54) L2 L#23 (256KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23 PU L#46 (P#23) PU L#47 (P#55) L2 L#24 (256KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24 PU L#48 (P#24) PU L#49 (P#56) L2 L#25 (256KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25 PU L#50 (P#25) PU L#51 (P#57) L2 L#26 (256KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26 PU L#52 (P#26) PU L#53 (P#58) L2 L#27 (256KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27 PU L#54 (P#27) PU L#55 (P#59) L2 L#28 (256KB) + L1d L#28 (32KB) + L1i L#28 (32KB) + Core L#28 PU L#56 (P#28) PU L#57 (P#60) L2 L#29 (256KB) + L1d L#29 (32KB) + L1i L#29 (32KB) + Core L#29 PU L#58 (P#29) PU L#59 (P#61) L2 L#30 (256KB) + L1d L#30 (32KB) + L1i L#30 (32KB) + Core L#30 PU L#60 (P#30) PU L#61 (P#62) L2 L#31 (256KB) + L1d L#31 (32KB) + L1i L#31 (32KB) + Core L#31 PU L#62 (P#31) PU L#63 (P#63) ======================================================================