Igor, Sorry. Few more questions. > -----Original Message----- > From: Igor Mammedov <imamm...@redhat.com> > Sent: Wednesday, July 29, 2020 9:12 AM > To: Moger, Babu <babu.mo...@amd.com> > Cc: pbonz...@redhat.com; r...@twiddle.net; qemu-devel@nongnu.org; > ehabk...@redhat.com > Subject: Re: [PATCH v2 1/3] hw/i386: Initialize topo_ids from > CpuInstanceProperties > > On Mon, 27 Jul 2020 18:59:42 -0500 > Babu Moger <babu.mo...@amd.com> wrote: > > > > -----Original Message----- > > > From: Igor Mammedov <imamm...@redhat.com> > > > Sent: Monday, July 27, 2020 12:14 PM > > > To: Moger, Babu <babu.mo...@amd.com> > > > Cc: qemu-devel@nongnu.org; pbonz...@redhat.com; > ehabk...@redhat.com; > > > r...@twiddle.net > > > Subject: Re: [PATCH v2 1/3] hw/i386: Initialize topo_ids from > > > CpuInstanceProperties > > > > > > On Mon, 27 Jul 2020 10:49:08 -0500 > > > Babu Moger <babu.mo...@amd.com> wrote: > > > > > > > > -----Original Message----- > > > > > From: Igor Mammedov <imamm...@redhat.com> > > > > > Sent: Friday, July 24, 2020 12:05 PM > > > > > To: Moger, Babu <babu.mo...@amd.com> > > > > > Cc: qemu-devel@nongnu.org; pbonz...@redhat.com; > > > ehabk...@redhat.com; > > > > > r...@twiddle.net > > > > > Subject: Re: [PATCH v2 1/3] hw/i386: Initialize topo_ids from > > > > > CpuInstanceProperties > > > > > > > > > > On Mon, 13 Jul 2020 14:30:29 -0500 Babu Moger > > > > > <babu.mo...@amd.com> wrote: > > > > > > > > > > > > -----Original Message----- > > > > > > > From: Igor Mammedov <imamm...@redhat.com> > > > > > > > Sent: Monday, July 13, 2020 12:32 PM > > > > > > > To: Moger, Babu <babu.mo...@amd.com> > > > > > > > Cc: pbonz...@redhat.com; r...@twiddle.net; > > > > > > > ehabk...@redhat.com; > > > > > > > qemu- de...@nongnu.org > > > > > > > Subject: Re: [PATCH v2 1/3] hw/i386: Initialize topo_ids > > > > > > > from CpuInstanceProperties > > > > > > > > > > > > > > On Mon, 13 Jul 2020 11:43:33 -0500 Babu Moger > > > > > > > <babu.mo...@amd.com> wrote: > > > > > > > > > > > > > > > On 7/13/20 11:17 AM, Igor Mammedov wrote: > > > > > > > > > On Mon, 13 Jul 2020 10:02:22 -0500 Babu Moger > > > > > > > > > <babu.mo...@amd.com> wrote: > > > > > > > > > > > > > > > > > >>> -----Original Message----- > > > > > > > > >>> From: Igor Mammedov <imamm...@redhat.com> > > > > > > > > >>> Sent: Monday, July 13, 2020 4:08 AM > > > > > > > > >>> To: Moger, Babu <babu.mo...@amd.com> > > > > > > > > >>> Cc: pbonz...@redhat.com; r...@twiddle.net; > > > > > > > > >>> ehabk...@redhat.com; > > > > > > > > >>> qemu- de...@nongnu.org > > > > > > > > >>> Subject: Re: [PATCH v2 1/3] hw/i386: Initialize > > > > > > > > >>> topo_ids from CpuInstanceProperties > > > > > > > > > [...] > > > > > > > > >>>> + > > > > > > > > >>>> +/* > > > > > > > > >>>> + * Initialize topo_ids from CpuInstanceProperties > > > > > > > > >>>> + * node_id in CpuInstanceProperties(or in CPU > > > > > > > > >>>> +device) is a sequential > > > > > > > > >>>> + * number, but while building the topology > > > > > > > > >>> > > > > > > > > >>>> we need to separate it for > > > > > > > > >>>> + * each socket(mod nodes_per_pkg). > > > > > > > > >>> could you clarify a bit more on why this is necessary? > > > > > > > > >> > > > > > > > > >> If you have two sockets and 4 numa nodes, node_id in > > > > > > > > >> CpuInstanceProperties will be number sequentially as 0, 1, > > > > > > > > >> 2, 3. > > > > > > > > >> But in EPYC topology, it will be 0, 1, 0, 1( Basically > > > > > > > > >> mod % number of nodes > > > > > > > per socket). > > > > > > > > > > > > > > > > > > I'm confused, let's suppose we have 2 EPYC sockets with > > > > > > > > > 2 nodes per socket so APIC id woulbe be composed like: > > > > > > > > > > > > > > > > > > 1st socket > > > > > > > > > pkg_id(0) | node_id(0) > > > > > > > > > pkg_id(0) | node_id(1) > > > > > > > > > > > > > > > > > > 2nd socket > > > > > > > > > pkg_id(1) | node_id(0) > > > > > > > > > pkg_id(1) | node_id(1) > > > > > > > > > > > > > > > > > > if that's the case, then EPYC's node_id here doesn't > > > > > > > > > look like a NUMA node in the sense it's usually used > > > > > > > > > (above config would have 4 different memory controllers > > > > > > > > > => 4 conventional > > > NUMA nodes). > > > > > > > > > > > > > > > > EPIC model uses combination of socket id and node id to > > > > > > > > identify the numa nodes. So, it internally uses all the > > > > > > > > information. > > > > > > > > > > > > > > well with above values, EPYC's node_id doesn't look like > > > > > > > it's specifying a machine numa node, but rather a node index > > > > > > > within single socket. In which case, it doesn't make much > > > > > > > sense calling it NUMA node_id, it's rather some index within > > > > > > > a socket. (it starts looking like terminology is all mixed > > > > > > > up) > > > > > > > > > > > > > > If you have access to a milti-socket EPYC machine, can you > > > > > > > dump and post here its apic ids, pls? > > > > > > > > > > > > Here is the output from my EPYC machine with 2 sockets and > > > > > > totally > > > > > > 8 nodes(SMT disabled). The cpus 0-31 are in socket 0 and cpus > > > > > > 32-63 in socket 1. > > > > > > > > > > > > # lscpu > > > > > > Architecture: x86_64 > > > > > > CPU op-mode(s): 32-bit, 64-bit > > > > > > Byte Order: Little Endian > > > > > > CPU(s): 64 > > > > > > On-line CPU(s) list: 0-63 > > > > > > Thread(s) per core: 1 > > > > > > Core(s) per socket: 32 > > > > > > Socket(s): 2 > > > > > > NUMA node(s): 8 > > > > > > Vendor ID: AuthenticAMD > > > > > > CPU family: 23 > > > > > > Model: 1 > > > > > > Model name: AMD Eng Sample: 1S1901A4VIHF5_30/19_N > > > > > > Stepping: 2 > > > > > > CPU MHz: 2379.233 > > > > > > CPU max MHz: 1900.0000 > > > > > > CPU min MHz: 1200.0000 > > > > > > BogoMIPS: 3792.81 > > > > > > Virtualization: AMD-V > > > > > > L1d cache: 32K > > > > > > L1i cache: 64K > > > > > > L2 cache: 512K > > > > > > L3 cache: 8192K > > > > > > NUMA node0 CPU(s): 0-7 > > > > > > NUMA node1 CPU(s): 8-15 > > > > > > NUMA node2 CPU(s): 16-23 > > > > > > NUMA node3 CPU(s): 24-31 > > > > > > NUMA node4 CPU(s): 32-39 > > > > > > NUMA node5 CPU(s): 40-47 > > > > > > NUMA node6 CPU(s): 48-55 > > > > > > NUMA node7 CPU(s): 56-63 > > > > > > > > > > > > Here is the output of #cpuid -l 0x8000001e -r > > > > > > > > > > > > > > > (1) > > > > > > You may want to refer > > > > > > > > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww. > > > > > > > > > > > > > > > amd.com%2Fsystem%2Ffiles%2FTechDocs%2F54945_3.03_ppr_ZP_B2_pub.zip& > > > > > amp > > > > > > > > > > > > > > > ;data=02%7C01%7Cbabu.moger%40amd.com%7Ceacf7e8facbc4ae2eee808d82 > > > > > ff3ca9 > > > > > > > > > > > > > > > 0%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C6373120714103223 > > > > > 35& > > > > > > > > > > > > > > > sdata=%2Fdr93YVlwSq82%2FwRh2NU21Zkw4HJ%2B%2FVVYxAkhCCKJ4w%3D&a > > > > > mp;reser > > > > > > ved=0 (section 2.1.12.2.1.3 ApicId Enumeration Requirements). > > > > > > Note that this is a general guideline. We tried to generalize > > > > > > in qemu as much as possible. It is bit complex. > > > > > > > > > > > > > > > > > > > > > CPU 0: > > > > > > 0x8000001e 0x00: eax=0x00000000 ebx=0x00000100 > > > > > > ecx=0x00000300 > > > > > > edx=0x00000000 > > > > > [...] > > > > > > CPU 63: > > > > > > 0x8000001e 0x00: eax=0x0000007e ebx=0x0000011f > > > > > > ecx=0x00000307 > > > > > > edx=0x00000000 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I wonder if linux guest actually uses node_id encoded in > > > > > > > > > apic id for configuring/checking numa structures, or it > > > > > > > > > just uses whatever ACPI SRAT table provided. > > > > > > > > > > > > > > > > > >>>> + */ > > > > > > > > >>>> +static inline void x86_init_topo_ids(X86CPUTopoInfo > > > *topo_info, > > > > > > > > >>>> + > > > > > > > > >>>> CpuInstanceProperties props, > > > > > > > > >>>> + X86CPUTopoIDs > > > > > > > > >>>> *topo_ids) { > > > > > > > > >>>> + topo_ids->smt_id = props.has_thread_id ? > > > > > > > > >>>> props.thread_id > : 0; > > > > > > > > >>>> + topo_ids->core_id = props.has_core_id ? props.core_id > > > > > > > > >>>> : 0; > > > > > > > > >>>> + topo_ids->die_id = props.has_die_id ? props.die_id : > > > > > > > > >>>> 0; > > > > > > > > >>>> + topo_ids->node_id = props.has_node_id ? > > > > > > > > >>>> + props.node_id % > > > > > > > > >>>> +MAX(topo_info->nodes_per_pkg, 1) : 0; > > > > > > > > > > It looks like I was wrong pushing system wide NUMA node-id into > > > > > APIC ID (choosen naming is confusing a bit), per [1] mentioned > > > > > above, EPYC's > > > node-id is: > > > > > > > > > > • ApicId[6] = Socket ID. > > > > > * ApicId[5:4]= Node ID. > > > > > • ApicId[3] = Logical CCX L3 complex ID • ApicId[2:0]= (SMT) ? > > > > > {LogicalCoreID[1:0],ThreadId} : {1'b0,LogicalCoreID[1:0]} > > > > > > > > > > which is can hold only 0-3 values, and defined as: > > > > > > > > > > "A node, is an integrated circuit device that includes one to 8 > > > > > cores (one or two Core Complexes)." > > > > > > > > > > spec also mentions it indirectly as die-id if one looks at > > > > > CPUID_Fn8000001E_EBX [Core Identifiers] (Core::X86::Cpuid::CoreId) ... > > > > > CoreId = ({2'b0, DieId[1:0], LogicalComplexId[0], > > > > > LogicalThreadId[2:0]} >> SMT > > > > > > > > > > and in > > > > > (2) > > > > > CPUID_Fn8000001E_ECX [Node Identifiers] (Core::X86::Cpuid::NodeId) ... > > > > > {5'b00000,1'b[SOCKET_ID],2'b[DIE_ID]} > > > > > > > > > > Question is why we did not reuse topo_ids->die_id instead of > > > > > adding confusing topo_ids->node_id in the first place? > > > > > > > > Initially, I thought about it. But Intel uses die_id differently than > > > > AMD. > > > > So, did not want complicate things. > > > > If we take that route then we need to re-arrange the numa code as > > > > we need to numa information to calculate the die id. So, did not > > > > want to mix up things. > > > > > > > > > > > > > > Also looking APIC ID and SRAT table provided here, > > > > > CPUID_Fn8000001E_ECX corresponds to NUMA node id (i.e. what > > > > > -numa in QEMU used for) and Node ID embeded into ApicId[5:4] is > basically die-id. > > > > > > > > > > Difference between die-id implemented in QEMU and EPYC's die id > > > > > (topo_ids- > > > > > >node_id) is that the former doesn't require numa config (maybe > > > > > >it should, but > > > > > ship'salready sailed) and gets number of dies from '-smp dies=X' > > > > > CLI option, while for EPYC we calculate topo_ids->node_id > > > > > implicitly from number of numa nodes and sockets, which > > > > > implicitly requires that machine 'must' be configured with -numa > > > > > options. > > > > > > > > > > Maybe we should drop this implicit calculation along with > > > > > topo_ids->node_id and reuse '-smp dies=X' plus extra checks for > > > > > EPYC to ask for -numa if there is more than 1 die and if we need > > > > > to be really strict, add checks for limit of dies/cores within > > > > > socket/die according to spec[2] so encoded APIC ID and > > > > > CPUID_8000001E match the > > > spec. > > > > > > > > There will be complications when user configures with both die_id > > > > and numa_id. It will complicate things further. I will have to > > > > look closely at the code if it is feasible. > > > > > > it's worth a try. > > > conseptionally die_id in intel/amd is the same. Most likely intel > > > has a dedicated memory controller on each die so it still should > > > form a NUMA node. But that aspect probably was ignored while > > > implementing it in QEMU so ping of configuring QEMU right is on > > > user's shoulders (there is no checks whatsoever if cpu belonging to > > > specific > die is on right NUMA node). > > > > So you are suggesting to use die_id to build the topology for EPYC. > > Also initialize die_id based on the numa information. Re-arrange the > > numa code to make sure we have all the information before we build the > > topology. And then remove the node_id inside X86CPUTopoIDs. Is that the > plan? > reusing die_id might simplify logic and at very least we won't have 2 very > similar > fields to deal with. With die_id it should be conditional on EPYC.
Not convinced if the using the die_id will solve the problem here. But going to investigate this little bit. > > Regardless of using die_id, we should > > (1) error out if tolopolgy will require more than 1 numa node and no numa > config was provided. We already have node_id check in numa_cpu_pre_plug. Do you want me to bring this check in pc_cpu_pre_plug? > (2) for 1 NUMA node use autonuma to create 1 node implicitly, that requres > converting static MachineClass::auto_enable_numa into an instance specific > value, i.e. moving it into MachineState, so that we can change it at runtime > depending on CPU type. Isn't it already taken care in numa_complete_configuration when num_nodes = 0? Where does this change go if at all required? > (3) use NUMA id from CPU::node-id for CPUID_8000001E and have a checks that > will ensure > that used value is possible to fit in CPUID leaf. Node_id is already part of apic_id. We can easily extract it from apic_id. I have already sent the patch to simplify CPUID_8000001E. I will make it part of this series. https://lore.kernel.org/qemu-devel/159164753686.20543.4158548114186964547.st...@naples-babu.amd.com/ > > > > > > > What AMD has implemented on top of that in CPU hw, is to expose NUMA > > > node id in CPUID_8000001E. I don't know why it was done as usually > > > it's ACPI tables that describe relations between nodes so for OS > > > this info almost useless (I'd guess it's faster to use CPUID instead > > > of fetching pre-cpu variable but that's pretty much it from OS point > > > of view) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >>>> + topo_ids->pkg_id = props.has_socket_id ? > > > > > > > > >>>> props.socket_id > : > > > > > > > > >>>> +0; } > > > > > > > > >>>> /* > > > > > > > > >>>> * Make APIC ID for the CPU 'cpu_index' > > > > > > > > >>>> * > > > > > > > > >>>> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >