On Wed, 5 Feb 2020 10:10:06 -0600 Babu Moger <babu.mo...@amd.com> wrote:
> On 2/5/20 3:38 AM, Igor Mammedov wrote: > > On Tue, 4 Feb 2020 13:08:58 -0600 > > Babu Moger <babu.mo...@amd.com> wrote: > > > >> On 2/4/20 2:02 AM, Igor Mammedov wrote: > >>> On Mon, 3 Feb 2020 13:31:29 -0600 > >>> Babu Moger <babu.mo...@amd.com> wrote: > >>> > >>>> On 2/3/20 8:59 AM, Igor Mammedov wrote: > >>>>> On Tue, 03 Dec 2019 18:36:54 -0600 > >>>>> Babu Moger <babu.mo...@amd.com> wrote: > >>>>> > >>>>>> This series fixes APIC ID encoding problems on AMD EPYC CPUs. > >>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.redhat.com%2Fshow_bug.cgi%3Fid%3D1728166&data=02%7C01%7Cbabu.moger%40amd.com%7Cdbfd059a060a4851aad908d7aa1f3532%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637164923333568238&sdata=P0I547X5r0s9emWu3ptIcm1U%2FhCMZmnMQOQ0IgLPzzQ%3D&reserved=0 > >>>>>> > >>>>>> Currently, the APIC ID is decoded based on the sequence > >>>>>> sockets->dies->cores->threads. This works for most standard AMD and > >>>>>> other > >>>>>> vendors' configurations, but this decoding sequence does not follow > >>>>>> that of > >>>>>> AMD's APIC ID enumeration strictly. In some cases this can cause CPU > >>>>>> topology > >>>>>> inconsistency. When booting a guest VM, the kernel tries to validate > >>>>>> the > >>>>>> topology, and finds it inconsistent with the enumeration of EPYC cpu > >>>>>> models. > >>>>>> > >>>>>> To fix the problem we need to build the topology as per the Processor > >>>>>> Programming Reference (PPR) for AMD Family 17h Model 01h, Revision B1 > >>>>>> Processors. It is available at > >>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.amd.com%2Fsystem%2Ffiles%2FTechDocs%2F55570-B1_PUB.zip&data=02%7C01%7Cbabu.moger%40amd.com%7Cdbfd059a060a4851aad908d7aa1f3532%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637164923333568238&sdata=AO6m%2FEI17iLoAa3gNnRSJKJAdvBRKh0Dmbr7bCVA0us%3D&reserved=0 > >>>>>> > >>>>>> Here is the text from the PPR. > >>>>>> Operating systems are expected to use > >>>>>> Core::X86::Cpuid::SizeId[ApicIdSize], the > >>>>>> number of least significant bits in the Initial APIC ID that indicate > >>>>>> core ID > >>>>>> within a processor, in constructing per-core CPUID masks. > >>>>>> Core::X86::Cpuid::SizeId[ApicIdSize] determines the maximum number of > >>>>>> cores > >>>>>> (MNC) that the processor could theoretically support, not the actual > >>>>>> number of > >>>>>> cores that are actually implemented or enabled on the processor, as > >>>>>> indicated > >>>>>> by Core::X86::Cpuid::SizeId[NC]. > >>>>>> Each Core::X86::Apic::ApicId[ApicId] register is preset as follows: > >>>>>> • ApicId[6] = Socket ID. > >>>>>> • ApicId[5:4] = Node ID. > >>>>>> • ApicId[3] = Logical CCX L3 complex ID > >>>>>> • ApicId[2:0]= (SMT) ? {LogicalCoreID[1:0],ThreadId} : > >>>>>> {1'b0,LogicalCoreID[1:0]} > >>>>> > >>>>> > >>>>> After checking out all patches and some pondering, used here approach > >>>>> looks to me too intrusive for the task at hand especially where it > >>>>> comes to generic code. > >>>>> > >>>>> (Ignore till ==== to see suggestion how to simplify without reading > >>>>> reasoning behind it first) > >>>>> > >>>>> Lets look for a way to simplify it a little bit. > >>>>> > >>>>> So problem we are trying to solve, > >>>>> 1: calculate APIC IDs based on cpu type (to e more specific: for EPYC > >>>>> based CPUs) > >>>>> 2: it depends on knowing total number of numa nodes. > >>>>> > >>>>> Externally workflow looks like following: > >>>>> 1. user provides -smp x,sockets,cores,...,maxcpus > >>>>> that's used by possible_cpu_arch_ids() singleton to build list of > >>>>> possible CPUs (which is available to user via command > >>>>> 'hotpluggable-cpus') > >>>>> > >>>>> Hook could be called very early and possible_cpus data might be > >>>>> not complete. It builds a list of possible CPUs which user could > >>>>> modify later. > >>>>> > >>>>> 2.1 user uses "-numa cpu,node-id=x,..." or legacy "-numa > >>>>> node,node_id=x,cpus=" > >>>>> options to assign cpus to nodes, which is one way or another > >>>>> calling > >>>>> machine_set_cpu_numa_node(). The later updates 'possible_cpus' > >>>>> list > >>>>> with node information. It happens early when total number of nodes > >>>>> is not available. > >>>>> > >>>>> 2.2 user does not provide explicit node mappings for CPUs. > >>>>> QEMU steps in and assigns possible cpus to nodes in > >>>>> machine_numa_finish_cpu_init() > >>>>> (using the same machine_set_cpu_numa_node()) right before calling > >>>>> boards > >>>>> specific machine init(). At that time total number of nodes is > >>>>> known. > >>>>> > >>>>> In 1 -- 2.1 cases, 'arch_id' in 'possible_cpus' list doesn't have to be > >>>>> defined before > >>>>> boards init() is run. > >> > >> In case of 2.1, we need to have the arch_id already generated. This is > >> done inside possible_cpu_arch_ids. The arch_id is used by > >> machine_set_cpu_numa_node to assign the cpus to correct numa node. > > > > I might have missed something but I don't see arch_id itself being used in > > machine_set_cpu_numa_node(). It only uses props part of possible_cpus > > Before calling machine_set_cpu_numa_node, we call > cpu_index_to_instance_props -> x86_cpu_index_to_props-> > possible_cpu_arch_ids->x86_possible_cpu_arch_ids. > > This sequence sets up the arch_id(in x86_cpu_apic_id_from_index) for all > the available cpus. Based on the arch_id, it also sets up the props. x86_possible_cpu_arch_ids() arch_id = x86_cpu_apic_id_from_index(x86ms, i) x86_topo_ids_from_apicid(arch_id, x86ms->smp_dies, ms->smp.cores, ms->smp.threads, &topo); // assign socket/die/core/thread from topo so currently it uses indirect way to convert index in possible_cpus->cpus[] to socket/die/core/thread ids. But essentially it take '-smp' options and [0..max_cpus) number as original data converts it into intermediate apic_id and then reverse engineer it back to topo info. Why not use x86_topo_ids_from_idx() directly to get rid of 'props' dependency on apic_id? > And these props values are used to assign the nodes in > machine_set_cpu_numa_node. > > At this point we are still parsing the numa nodes and so we don't know the > total number of numa nodes. Without that information, the arch_id > generated here will not be correct for EPYC models. > > This is the reason for changing the generic numa code(patch #12-Split the > numa initialization). > > > > > > >> If we want to move the arch_id generation into board init(), then we need > >> to save the cpu indexes belonging to each node somewhere. > > > > when cpus are assigned explicitly, decision what cpus go to what nodes is > > up to user and user configured mapping is stored in > > MachineState::possible_cpus > > which is accessed by via possible_cpu_arch_ids() callback. > > Hence I don see any reason to touch cpu indexes. > > Please see my reasoning above. > > > > >> > >>>>> > >>>>> In 2.2 case it calls get_default_cpu_node_id() -> > >>>>> x86_get_default_cpu_node_id() > >>>>> which uses arch_id calculate numa node. > >>>>> But then question is: does it have to use APIC id or could it infer > >>>>> 'pkg_id', > >>>>> it's after, from ms->possible_cpus->cpus[i].props data? > >>>> > >>>> Not sure if I got the question right. In this case because the numa > >>>> information is not provided all the cpus are assigned to only one node. > >>>> The apic id is used here to get the correct pkg_id. > >>> > >>> apicid was composed from socket/core/thread[/die] tuple which > >>> cpus[i].props is. > >>> > >>> Question is if we can compose only pkg_id based on the same data without > >>> converting it to apicid and then "reverse engineering" it back > >>> original data? > >> > >> Yes. It is possible. > >> > >>> > >>> Or more direct question: is socket-id the same as pkg_id? > >> > >> Yes. Socket_id and pkg_id is same. > >> > >>> > >>> > >>>> > >>>>> > >>>>> With that out of the way APIC ID will be used only during board's > >>>>> init(), > >>>>> so board could update possible_cpus with valid APIC IDs at the start of > >>>>> x86_cpus_init(). > >>>>> > >>>>> ==== > >>>>> in nutshell it would be much easier to do following: > >>>>> > >>>>> 1. make x86_get_default_cpu_node_id() APIC ID in-depended or > >>>>> if impossible as alternative recompute APIC IDs there if cpu > >>>>> type is EPYC based (since number of nodes is already known) > >>>>> 2. recompute APIC IDs in x86_cpus_init() if cpu type is EPYC based > >>>>> > >>>>> this way one doesn't need to touch generic numa code, introduce > >>>>> x86 specific init_apicid_fn() hook into generic code and keep > >>>>> x86/EPYC nuances contained within x86 code only. > >>>> > >>>> I was kind of already working in the similar direction in v4. > >>>> 1. We already have split the numa initialization in patch #12(Split the > >>>> numa initialization). This way we know exactly how many numa nodes are > >>>> there before hand. > >>> > >>> I suggest to drop that patch, It's the one that touches generic numa > >>> code and adding more legacy based extensions like cpu_indexes. > >>> Which I'd like to get rid of to begin with, so only -numa cpu is left. > >>> > >>> I think it's not necessary to touch numa code at all for apicid generation > >>> purpose, as I tried to explain above. We should be able to keep > >>> this x86 only business. > >> > >> This is going to be difficult without touching the generic numa code.patch > >> #12(Split the > >>>> numa initialization) > > > > Looking at current code I don't see why one would touch numa code. > > Care to explain in more details why you'd have to touch it? > > Please see the reasoning above. > > > >>>> 2. Planning to remove init_apicid_fn > >>>> 3. Insert the handlers inside X86CPUDefinition. > >>> what handlers do you mean? > >> > >> Apicid generation logic can be separated into 3 types of handlers. > >> x86_apicid_from_cpu_idx: Generate apicid from cpu index. > >> x86_topo_ids_from_apicid: Generate topo ids from apic id. > >> x86_apicid_from_topo_ids: Generate apicid from topo ids. > >> > >> We should be able to generate one id from other(you can see topology.h). > >> > >> X86CPUDefinition will have the handlers specific to each model like the > >> way we have features now. The above 3 handlers will be used as default > >> handler. > > > > it probably shouldn't be a part of X86CPUDefinition, > > as it's machines responsibility to generate and set APIC ID. > > > > What you are doing with this topo functions in this version > > looks more that enough to me. > > It is all the exact same topo functions. Only making these functions as > the handlers inside the X86CPUDefinition. > > > > >> The EPYC model will have its corresponding handlers. > >> > >> x86_apicid_from_cpu_idx_epyc > >> x86_topo_ids_from_apicid_epyc > >> x86_apicid_from_topo_ids_epyc. > > > > CPU might use call backs, but does it have to? > > I see cpu_x86_cpuid() uses these functions to decode apic_id back to topo > > info and then compose various leaves based on it. > > Within CPU code I'd just use > > if (i_am_epyc) > > x86_topo_ids_from_apicid_epyc() > > else > > x86_topo_ids_from_apicid() > > it's easier to read and one doesn't have to go figure > > indirection chain to figure out what code is called. > > Eduardo already commented on this idea. Anything specific to cpu models > should be part of the X86CPUDefinition. We should not compare the specific > model here. Comparing the specific model does not scale. We are achieving > this by loading the model definition(similar to what we do in > x86_cpu_load_model). ok > > > > >>>> 4. EPYC model will have its own apid id handlers. Everything else will be > >>>> initialized with a default handlers(current default handler). > >>>> 5. The function pc_possible_cpu_arch_ids will load the model definition > >>>> and initialize the PCMachineState data structure with the model specific > >>>> handlers. > >>> I'm not sure what do you mean here. > >> > >> PCMachineState will have the function pointers to the above handlers. > >> I was going to load the correct handler based on the mode type. > > > > Could be done like this, but considering that within machine we need > > to calculate apic_id only once, the same 'if' trick would be simpler > > > > x86_cpus_init() { > > > > if (cpu == epic) { > > make_epyc_apic_ids(mc->possible_cpu_arch_ids(ms)) > > } > > Once again, this does not scale. Please see my response above. > > > > > // go on with creating cpus ... > > } > > > >>>> Does that sound similar to what you are thinking. Thoughts? > >>> If you have something to share and can push it on github, > >>> I can look at, whether it has design issues to spare you a round trip on > >>> a list. > >>> (it won't be proper review but at least I can help to pinpoint most > >>> problematic parts) > >>> > >> My code for the current approach is kind of ready(yet to be tested). I can > >> send it as v3.1 if you want to look. Or we can wait for our discussion to > >> settle. I will post it after our discussion. > > ok, lets wait till we finish this discussion > > I can post my draft patch to give you more idea about what i am talking > about now. Let me know. > > > > >> There is one more problem we need to address. I was going to address later > >> in v4 or v5. > >> > >> This works > >> -numa node,nodeid=0,cpus=0-3 -numa node,nodeid=1,cpus=4-7 > >> > >> This does not work > >> -numa node,nodeid=0,cpus=0-5 -numa node,nodeid=1,cpus=6-7 > > Is it supposed to work (i.e. can real hardware do such topology)? > > Hardware does not support this configuration. That is why I did not think > it is serious enough to fix this problem right now. > > > > >> This requires the generic code to pass the node information to the x86 > >> code which requires some handler changes. I was thinking my code will > >> simplify the changes to address this issue. > > > > without more information, it's hard to comment on issue and whether > > extra complexity of callbacks is justificated. > > > > There could be 2 ways here, add fixes to this series so we could see the > > reason > > or make this series simple to solve apic_id problem only and then on top of > > it send the second series that solves another issue. > > > > Considering that this series is already big/complicated enough, > > personally I'd go for 2nd option. As it's easier to describe what patches > > are > > doing and easier to review => should result in faster reaching consensus > > and merging. > > [...] > > >