Hi,

On 2025-08-07 11:24:18 +0200, Tomas Vondra wrote:
> 2) I'm a bit unsure what "NUMA nodes" actually means. The patch mostly
> assumes each core / piece of RAM is assigned to a particular NUMA node.

There are systems in which some NUMA nodes do *not* contain any CPUs. E.g. if
you attach memory via a CXL/PCIe add-in card, rather than via the CPUs memory
controller. In that case numactl -H (and obviously also the libnuma APIs) will
report that the numa node is not associated with any CPU.

I don't currently have live access to such a system, but this PR piece happens
to have numactl -H output:
https://lenovopress.lenovo.com/lp2184-implementing-cxl-memory-on-linux-on-thinksystem-v4-servers
> numactl -H
> available: 4 nodes (0-3)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
> 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 96 97 98 
> 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 
> 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 
> 137 138 139 140 141 142 143
> node 0 size: 1031904 MB
> node 0 free: 1025554 MB
> node 1 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 
> 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 
> 95 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 
> 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 
> 181 182 183 184 185 186 187 188 189 190 191
> node 1 size: 1032105 MB
> node 1 free: 1024244 MB
> node 2 cpus:
> node 2 size: 262144 MB
> node 2 free: 262143 MB
> node 3 cpus:
> node 3 size: 262144 MB
> node 3 free: 262142 MB
> node distances:
> node   0   1   2   3
>   0:  10  21  14  24
>   1:  21  10  24  14
>   2:  14  24  10  26
>   3:  24  14  26  10

Note that node 2 & 3 don't have associated CPUs (and higher access costs).

I don't think this is common enough to worry about from a performance POV, but
we probably shouldn't crash if we encounter it...


> But it also cares about the cores (and the node for each core), because
> it uses that to pick the right partition for a backend. And here the
> situation is less clear, because the CPUs don't need to be assigned to a
> particular node, even on a NUMA system. Consider the rpi5 NUMA layout:
>
> $ numactl --hardware
> available: 8 nodes (0-7)
> node 0 cpus: 0 1 2 3
> node 0 size: 992 MB
> node 0 free: 274 MB
> node 1 cpus: 0 1 2 3
> node 1 size: 1019 MB
> node 1 free: 327 MB
> ...
> node   0   1   2   3   4   5   6   7
>   0:  10  10  10  10  10  10  10  10
>   1:  10  10  10  10  10  10  10  10
>   2:  10  10  10  10  10  10  10  10
>   3:  10  10  10  10  10  10  10  10
>   4:  10  10  10  10  10  10  10  10
>   5:  10  10  10  10  10  10  10  10
>   6:  10  10  10  10  10  10  10  10
>   7:  10  10  10  10  10  10  10  10
> This says there are 8 NUMA nodes, each with ~1GB of RAM. But the 4 cores
> are not assigned to particular nodes - each core is mapped to all 8 NUMA
> nodes.

FWIW, you can get a different version of this with AMD Epyc too, if "L3 LLC as
NUMA" is enabled.


> I'm not sure what to do about this (or how getcpu() or libnuma handle this).

I don't immediately see any libnuma functions that would care?

I also am somewhat curious about what getcpu() returns for the current node...

Greetings,

Andres Freund


Reply via email to