On 2/6/20 7:08 AM, Igor Mammedov wrote:
> On Wed, 5 Feb 2020 13:07:31 -0600
> Babu Moger <babu.mo...@amd.com> wrote:
>
>> On 2/5/20 10:56 AM, Igor Mammedov wrote:
>>> On Wed, 5 Feb 2020 10:10:06 -0600
>>> Babu Moger <babu.mo...@amd.com> wrote:
>>>
>>>> On 2/5/20 3:38 AM, Igor Mammedov wrote:
>>>>> On Tue, 4 Feb 2020 13:08:58 -0600
>>>>> Babu Moger <babu.mo...@amd.com> wrote:
>>>>>
>>>>>> On 2/4/20 2:02 AM, Igor Mammedov wrote:
>>>>>>> On Mon, 3 Feb 2020 13:31:29 -0600
>>>>>>> Babu Moger <babu.mo...@amd.com> wrote:
>>>>>>>
>>>>>>>> On 2/3/20 8:59 AM, Igor Mammedov wrote:
>>>>>>>>> On Tue, 03 Dec 2019 18:36:54 -0600
>>>>>>>>> Babu Moger <babu.mo...@amd.com> wrote:
>>>>>>>>>
>>>>>>>>>> This series fixes APIC ID encoding problems on AMD EPYC CPUs.
>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.redhat.com%2Fshow_bug.cgi%3Fid%3D1728166&data=02%7C01%7Cbabu.moger%40amd.com%7C76bf8434899b41de094f08d7ab05bdf3%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637165913481441118&sdata=34fZQpUjScKbbc35c7ot433HA1Rz03YG6aP1ucyGUsQ%3D&reserved=0
>>>>>>>>>>
>>>>>>>>>> Currently, the APIC ID is decoded based on the sequence
>>>>>>>>>> sockets->dies->cores->threads. This works for most standard AMD and
>>>>>>>>>> other
>>>>>>>>>> vendors' configurations, but this decoding sequence does not follow
>>>>>>>>>> that of
>>>>>>>>>> AMD's APIC ID enumeration strictly. In some cases this can cause CPU
>>>>>>>>>> topology
>>>>>>>>>> inconsistency. When booting a guest VM, the kernel tries to
>>>>>>>>>> validate the
>>>>>>>>>> topology, and finds it inconsistent with the enumeration of EPYC cpu
>>>>>>>>>> models.
>>>>>>>>>>
>>>>>>>>>> To fix the problem we need to build the topology as per the Processor
>>>>>>>>>> Programming Reference (PPR) for AMD Family 17h Model 01h, Revision B1
>>>>>>>>>> Processors. It is available at
>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.amd.com%2Fsystem%2Ffiles%2FTechDocs%2F55570-B1_PUB.zip&data=02%7C01%7Cbabu.moger%40amd.com%7C76bf8434899b41de094f08d7ab05bdf3%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637165913481451075&sdata=4YXG%2BrCP5UUXcCQX4Ly8B%2FXdlvZoFrPCgonjy0IwG0U%3D&reserved=0
>>>>>>>>>>
>>>>>>>>>> Here is the text from the PPR.
>>>>>>>>>> Operating systems are expected to use
>>>>>>>>>> Core::X86::Cpuid::SizeId[ApicIdSize], the
>>>>>>>>>> number of least significant bits in the Initial APIC ID that
>>>>>>>>>> indicate core ID
>>>>>>>>>> within a processor, in constructing per-core CPUID masks.
>>>>>>>>>> Core::X86::Cpuid::SizeId[ApicIdSize] determines the maximum number
>>>>>>>>>> of cores
>>>>>>>>>> (MNC) that the processor could theoretically support, not the actual
>>>>>>>>>> number of
>>>>>>>>>> cores that are actually implemented or enabled on the processor, as
>>>>>>>>>> indicated
>>>>>>>>>> by Core::X86::Cpuid::SizeId[NC].
>>>>>>>>>> Each Core::X86::Apic::ApicId[ApicId] register is preset as follows:
>>>>>>>>>> • ApicId[6] = Socket ID.
>>>>>>>>>> • ApicId[5:4] = Node ID.
>>>>>>>>>> • ApicId[3] = Logical CCX L3 complex ID
>>>>>>>>>> • ApicId[2:0]= (SMT) ? {LogicalCoreID[1:0],ThreadId} :
>>>>>>>>>> {1'b0,LogicalCoreID[1:0]}
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> After checking out all patches and some pondering, used here approach
>>>>>>>>> looks to me too intrusive for the task at hand especially where it
>>>>>>>>> comes to generic code.
>>>>>>>>>
>>>>>>>>> (Ignore till ==== to see suggestion how to simplify without reading
>>>>>>>>> reasoning behind it first)
>>>>>>>>>
>>>>>>>>> Lets look for a way to simplify it a little bit.
>>>>>>>>>
>>>>>>>>> So problem we are trying to solve,
>>>>>>>>> 1: calculate APIC IDs based on cpu type (to e more specific: for
>>>>>>>>> EPYC based CPUs)
>>>>>>>>> 2: it depends on knowing total number of numa nodes.
>>>>>>>>>
>>>>>>>>> Externally workflow looks like following:
>>>>>>>>> 1. user provides -smp x,sockets,cores,...,maxcpus
>>>>>>>>> that's used by possible_cpu_arch_ids() singleton to build list
>>>>>>>>> of
>>>>>>>>> possible CPUs (which is available to user via command
>>>>>>>>> 'hotpluggable-cpus')
>>>>>>>>>
>>>>>>>>> Hook could be called very early and possible_cpus data might be
>>>>>>>>> not complete. It builds a list of possible CPUs which user could
>>>>>>>>> modify later.
>>>>>>>>>
>>>>>>>>> 2.1 user uses "-numa cpu,node-id=x,..." or legacy "-numa
>>>>>>>>> node,node_id=x,cpus="
>>>>>>>>> options to assign cpus to nodes, which is one way or another
>>>>>>>>> calling
>>>>>>>>> machine_set_cpu_numa_node(). The later updates 'possible_cpus'
>>>>>>>>> list
>>>>>>>>> with node information. It happens early when total number of
>>>>>>>>> nodes
>>>>>>>>> is not available.
>>>>>>>>>
>>>>>>>>> 2.2 user does not provide explicit node mappings for CPUs.
>>>>>>>>> QEMU steps in and assigns possible cpus to nodes in
>>>>>>>>> machine_numa_finish_cpu_init()
>>>>>>>>> (using the same machine_set_cpu_numa_node()) right before
>>>>>>>>> calling boards
>>>>>>>>> specific machine init(). At that time total number of nodes is
>>>>>>>>> known.
>>>>>>>>>
>>>>>>>>> In 1 -- 2.1 cases, 'arch_id' in 'possible_cpus' list doesn't have to
>>>>>>>>> be defined before
>>>>>>>>> boards init() is run.
>>>>>>
>>>>>> In case of 2.1, we need to have the arch_id already generated. This is
>>>>>> done inside possible_cpu_arch_ids. The arch_id is used by
>>>>>> machine_set_cpu_numa_node to assign the cpus to correct numa node.
>>>>>
>>>>> I might have missed something but I don't see arch_id itself being used in
>>>>> machine_set_cpu_numa_node(). It only uses props part of possible_cpus
>>>>
>>>> Before calling machine_set_cpu_numa_node, we call
>>>> cpu_index_to_instance_props -> x86_cpu_index_to_props->
>>>> possible_cpu_arch_ids->x86_possible_cpu_arch_ids.
>>>>
>>>> This sequence sets up the arch_id(in x86_cpu_apic_id_from_index) for all
>>>> the available cpus. Based on the arch_id, it also sets up the props.
>>>
>>>
>>> x86_possible_cpu_arch_ids()
>>> arch_id = x86_cpu_apic_id_from_index(x86ms, i)
>>> x86_topo_ids_from_apicid(arch_id, x86ms->smp_dies, ms->smp.cores,
>>> ms->smp.threads, &topo);
>>> // assign socket/die/core/thread from topo
>>>
>>> so currently it uses indirect way to convert index in possible_cpus->cpus[]
>>> to socket/die/core/thread ids.
>>> But essentially it take '-smp' options and [0..max_cpus) number as original
>>> data
>>> converts it into intermediate apic_id and then reverse engineer it back to
>>> topo info.
>>>
>>> Why not use x86_topo_ids_from_idx() directly to get rid of 'props'
>>> dependency on apic_id?
>>
>> It might work. But this feels like a work-around and delaying the problem
>> for later. Just re-arranging the numa code little bit we can address this.
>
> The idea behind possible_cpus is to allow users query topo information
> board generates (based on -smp) at configuration time (or late) so users
> could know what -numa cpu,topo_options [and -device foo-cpu,topo_options]
> to use, initializing apic_id on the first access is secondary and I did
> it only because I could do it without additional data.
>
> But main purpose of possible_cpus is to keep topology information.
> That includes numa node mapping, which should be stored in possible_cpus
> along with the rest of cpu topology.
>
> Looking [12/18] numa patch, it makes -numa node,cpus legacy option
> to reintroduce data duplication, by storing mapping elsewhere and
> then putting that mapping into possible_cpus at numa complete time
> (that's what I dislike and don't see a valid reason to do so).
>
> That also won't work if user queries hotpluggable-cpus before that time
> and it also doesn't work if user uses preferable -numa cpu,topo_options
> as both would initialize possible_cpus on the first access.
>
> So if you need do some board specific post-processing done on topo
> information when it's complete and recalculate apic_id do it at board
> init time like was suggested before (x86_cpu_new() looks like a good
> place to do it).
Ok. Sure. Will start working on it. Thanks