On Wed, 5 Feb 2020 13:07:31 -0600 Babu Moger <babu.mo...@amd.com> wrote:
> On 2/5/20 10:56 AM, Igor Mammedov wrote: > > On Wed, 5 Feb 2020 10:10:06 -0600 > > Babu Moger <babu.mo...@amd.com> wrote: > > > >> On 2/5/20 3:38 AM, Igor Mammedov wrote: > >>> On Tue, 4 Feb 2020 13:08:58 -0600 > >>> Babu Moger <babu.mo...@amd.com> wrote: > >>> > >>>> On 2/4/20 2:02 AM, Igor Mammedov wrote: > >>>>> On Mon, 3 Feb 2020 13:31:29 -0600 > >>>>> Babu Moger <babu.mo...@amd.com> wrote: > >>>>> > >>>>>> On 2/3/20 8:59 AM, Igor Mammedov wrote: > >>>>>>> On Tue, 03 Dec 2019 18:36:54 -0600 > >>>>>>> Babu Moger <babu.mo...@amd.com> wrote: > >>>>>>> > >>>>>>>> This series fixes APIC ID encoding problems on AMD EPYC CPUs. > >>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.redhat.com%2Fshow_bug.cgi%3Fid%3D1728166&data=02%7C01%7Cbabu.moger%40amd.com%7C6b6d6af79fee45cc904808d7aa5c5f37%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637165186049856500&sdata=vDAkIxR3U6LX%2FmnYjZPRC55smMqLend%2FHQjbfYWydBk%3D&reserved=0 > >>>>>>>> > >>>>>>>> Currently, the APIC ID is decoded based on the sequence > >>>>>>>> sockets->dies->cores->threads. This works for most standard AMD and > >>>>>>>> other > >>>>>>>> vendors' configurations, but this decoding sequence does not follow > >>>>>>>> that of > >>>>>>>> AMD's APIC ID enumeration strictly. In some cases this can cause CPU > >>>>>>>> topology > >>>>>>>> inconsistency. When booting a guest VM, the kernel tries to > >>>>>>>> validate the > >>>>>>>> topology, and finds it inconsistent with the enumeration of EPYC cpu > >>>>>>>> models. > >>>>>>>> > >>>>>>>> To fix the problem we need to build the topology as per the Processor > >>>>>>>> Programming Reference (PPR) for AMD Family 17h Model 01h, Revision B1 > >>>>>>>> Processors. It is available at > >>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.amd.com%2Fsystem%2Ffiles%2FTechDocs%2F55570-B1_PUB.zip&data=02%7C01%7Cbabu.moger%40amd.com%7C6b6d6af79fee45cc904808d7aa5c5f37%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637165186049856500&sdata=rVMRN%2BbUeGWEksKO5uQ3Wxc71eeHCXMrkLVRbo4JHHI%3D&reserved=0 > >>>>>>>> > >>>>>>>> Here is the text from the PPR. > >>>>>>>> Operating systems are expected to use > >>>>>>>> Core::X86::Cpuid::SizeId[ApicIdSize], the > >>>>>>>> number of least significant bits in the Initial APIC ID that > >>>>>>>> indicate core ID > >>>>>>>> within a processor, in constructing per-core CPUID masks. > >>>>>>>> Core::X86::Cpuid::SizeId[ApicIdSize] determines the maximum number > >>>>>>>> of cores > >>>>>>>> (MNC) that the processor could theoretically support, not the actual > >>>>>>>> number of > >>>>>>>> cores that are actually implemented or enabled on the processor, as > >>>>>>>> indicated > >>>>>>>> by Core::X86::Cpuid::SizeId[NC]. > >>>>>>>> Each Core::X86::Apic::ApicId[ApicId] register is preset as follows: > >>>>>>>> • ApicId[6] = Socket ID. > >>>>>>>> • ApicId[5:4] = Node ID. > >>>>>>>> • ApicId[3] = Logical CCX L3 complex ID > >>>>>>>> • ApicId[2:0]= (SMT) ? {LogicalCoreID[1:0],ThreadId} : > >>>>>>>> {1'b0,LogicalCoreID[1:0]} > >>>>>>> > >>>>>>> > >>>>>>> After checking out all patches and some pondering, used here approach > >>>>>>> looks to me too intrusive for the task at hand especially where it > >>>>>>> comes to generic code. > >>>>>>> > >>>>>>> (Ignore till ==== to see suggestion how to simplify without reading > >>>>>>> reasoning behind it first) > >>>>>>> > >>>>>>> Lets look for a way to simplify it a little bit. > >>>>>>> > >>>>>>> So problem we are trying to solve, > >>>>>>> 1: calculate APIC IDs based on cpu type (to e more specific: for > >>>>>>> EPYC based CPUs) > >>>>>>> 2: it depends on knowing total number of numa nodes. > >>>>>>> > >>>>>>> Externally workflow looks like following: > >>>>>>> 1. user provides -smp x,sockets,cores,...,maxcpus > >>>>>>> that's used by possible_cpu_arch_ids() singleton to build list > >>>>>>> of > >>>>>>> possible CPUs (which is available to user via command > >>>>>>> 'hotpluggable-cpus') > >>>>>>> > >>>>>>> Hook could be called very early and possible_cpus data might be > >>>>>>> not complete. It builds a list of possible CPUs which user could > >>>>>>> modify later. > >>>>>>> > >>>>>>> 2.1 user uses "-numa cpu,node-id=x,..." or legacy "-numa > >>>>>>> node,node_id=x,cpus=" > >>>>>>> options to assign cpus to nodes, which is one way or another > >>>>>>> calling > >>>>>>> machine_set_cpu_numa_node(). The later updates 'possible_cpus' > >>>>>>> list > >>>>>>> with node information. It happens early when total number of > >>>>>>> nodes > >>>>>>> is not available. > >>>>>>> > >>>>>>> 2.2 user does not provide explicit node mappings for CPUs. > >>>>>>> QEMU steps in and assigns possible cpus to nodes in > >>>>>>> machine_numa_finish_cpu_init() > >>>>>>> (using the same machine_set_cpu_numa_node()) right before > >>>>>>> calling boards > >>>>>>> specific machine init(). At that time total number of nodes is > >>>>>>> known. > >>>>>>> > >>>>>>> In 1 -- 2.1 cases, 'arch_id' in 'possible_cpus' list doesn't have to > >>>>>>> be defined before > >>>>>>> boards init() is run. > >>>> > >>>> In case of 2.1, we need to have the arch_id already generated. This is > >>>> done inside possible_cpu_arch_ids. The arch_id is used by > >>>> machine_set_cpu_numa_node to assign the cpus to correct numa node. > >>> > >>> I might have missed something but I don't see arch_id itself being used in > >>> machine_set_cpu_numa_node(). It only uses props part of possible_cpus > >> > >> Before calling machine_set_cpu_numa_node, we call > >> cpu_index_to_instance_props -> x86_cpu_index_to_props-> > >> possible_cpu_arch_ids->x86_possible_cpu_arch_ids. > >> > >> This sequence sets up the arch_id(in x86_cpu_apic_id_from_index) for all > >> the available cpus. Based on the arch_id, it also sets up the props. > > > > > > x86_possible_cpu_arch_ids() > > arch_id = x86_cpu_apic_id_from_index(x86ms, i) > > x86_topo_ids_from_apicid(arch_id, x86ms->smp_dies, ms->smp.cores, > > ms->smp.threads, &topo); > > // assign socket/die/core/thread from topo > > > > so currently it uses indirect way to convert index in possible_cpus->cpus[] > > to socket/die/core/thread ids. > > But essentially it take '-smp' options and [0..max_cpus) number as original > > data > > converts it into intermediate apic_id and then reverse engineer it back to > > topo info. > > > > Why not use x86_topo_ids_from_idx() directly to get rid of 'props' > > dependency on apic_id? > > It might work. But this feels like a work-around and delaying the problem > for later. Just re-arranging the numa code little bit we can address this. The idea behind possible_cpus is to allow users query topo information board generates (based on -smp) at configuration time (or late) so users could know what -numa cpu,topo_options [and -device foo-cpu,topo_options] to use, initializing apic_id on the first access is secondary and I did it only because I could do it without additional data. But main purpose of possible_cpus is to keep topology information. That includes numa node mapping, which should be stored in possible_cpus along with the rest of cpu topology. Looking [12/18] numa patch, it makes -numa node,cpus legacy option to reintroduce data duplication, by storing mapping elsewhere and then putting that mapping into possible_cpus at numa complete time (that's what I dislike and don't see a valid reason to do so). That also won't work if user queries hotpluggable-cpus before that time and it also doesn't work if user uses preferable -numa cpu,topo_options as both would initialize possible_cpus on the first access. So if you need do some board specific post-processing done on topo information when it's complete and recalculate apic_id do it at board init time like was suggested before (x86_cpu_new() looks like a good place to do it). [...]