386: Fix uninitialized memory with -device and CPU hotplug

Babu Moger Fri, 26 Jun 2020 13:57:04 -0700


> -----Original Message-----
> From: Moger, Babu <babu.mo...@amd.com>
> Sent: Thursday, June 25, 2020 5:55 PM
> To: Igor Mammedov <imamm...@redhat.com>
> Cc: ehabk...@redhat.com; m...@redhat.com; qemu-devel@nongnu.org;
> pbonz...@redhat.com; r...@twiddle.net
> Subject: Re: [PATCH 1/2] hw/386: Fix uninitialized memory with -device and CPU
> hotplug
> 
> 
> 
> On 6/25/20 1:32 PM, Igor Mammedov wrote:
> > On Thu, 25 Jun 2020 11:41:25 -0500
> > Babu Moger <babu.mo...@amd.com> wrote:
> >
> >> Igor,
> >>
> >>> -----Original Message-----
> >>> From: Igor Mammedov <imamm...@redhat.com>
> >>> Sent: Thursday, June 25, 2020 10:19 AM
> >>> To: Moger, Babu <babu.mo...@amd.com>
> >>> Cc: ehabk...@redhat.com; m...@redhat.com; qemu-devel@nongnu.org;
> >>> pbonz...@redhat.com; r...@twiddle.net
> >>> Subject: Re: [PATCH 1/2] hw/386: Fix uninitialized memory with -device and
> CPU
> >>> hotplug
> >>>
> >>> On Wed, 24 Jun 2020 12:35:59 -0500
> >>> Babu Moger <babu.mo...@amd.com> wrote:
> >>>
> >>>>> -----Original Message-----
> >>>>> From: Igor Mammedov <imamm...@redhat.com>
> >>>>> Sent: Wednesday, June 24, 2020 8:48 AM
> >>>>> To: Moger, Babu <babu.mo...@amd.com>
> >>>>> Cc: ehabk...@redhat.com; m...@redhat.com; qemu-devel@nongnu.org;
> >>>>> pbonz...@redhat.com; r...@twiddle.net
> >>>>> Subject: Re: [PATCH 1/2] hw/386: Fix uninitialized memory with -device
> and
> >>> CPU
> >>>>> hotplug
> >>>>>
> >>>>> On Tue, 16 Jun 2020 12:18:56 -0500
> >>>>> Babu Moger <babu.mo...@amd.com> wrote:
> >>>>>
> >>>>>>> -----Original Message-----
> >>>>>>> From: Igor Mammedov <imamm...@redhat.com>
> >>>>>>> Sent: Tuesday, June 16, 2020 5:59 AM
> >>>>>>> To: Moger, Babu <babu.mo...@amd.com>
> >>>>>>> Cc: pbonz...@redhat.com; r...@twiddle.net; ehabk...@redhat.com;
> >>>>>>> m...@redhat.com; marcel.apfelb...@gmail.com; qemu-
> >>> de...@nongnu.org
> >>>>>>> Subject: Re: [PATCH 1/2] hw/386: Fix uninitialized memory with -device
> >>> and
> >>>>> CPU
> >>>>>>> hotplug
> >>>>>>>
> >>>>>>> On Mon, 08 Jun 2020 15:18:50 -0500
> >>>>>>> Babu Moger <babu.mo...@amd.com> wrote:
> >>>>>>>
> >>>>>>>> Noticed the following command failure while testing CPU hotplug.
> >>>>>>>>
> >>>>>>>> $ qemu-system-x86_64 -machine q35,accel=kvm -smp 1,maxcpus=2,
> >>>>>>>>   cores=1, threads=1,sockets=2 -cpu EPYC -device EPYC-x86_64-
> >>>>>>>>   cpu,core-id=0,socket-id=1,thread-id=0
> >>>>>>>>
> >>>>>>>>   qemu-system-x86_64: -device EPYC-x86_64-cpu,core-id=0,socket-
> >>> id=1,
> >>>>>>>>   thread-id=0: Invalid CPU [socket: 21855, die: 0, core: 0, thread: 
> >>>>>>>> 0]
> >>>>>>>>   with APIC ID 21855, valid index range 0:1
> >>>>>>>>
> >>>>>>>> This happens because APIC ID is calculated using uninitialized
> memory.
> >>>>>>>> This is happening after the addition of new field node_id in
> >>>>> X86CPUTopoIDs
> >>>>>>>> structure. The node_id field is uninitialized while calling
> >>>>>>>> apicid_from_topo_ids. The problem is discussed in the thread below.
> >>>>>>>>
> >>>>>>>
> >>>>>
> >>>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.ker
> >>>>>>> nel.org%2Fqemu-
> >>>>>>>
> >>>>>
> >>>
> devel%2F20200602171838.GG577771%40habkost.net%2F&amp;data=02%7C01
> >>>>>>>
> >>>>>
> >>>
> %7Cbabu.moger%40amd.com%7C02200d75fd8b48d1955608d811e44f5b%7C3d
> >>>>>>>
> >>>>>
> >>>
> d8961fe4884e608e11a82d994e183d%7C0%7C0%7C637279019564311233&amp
> >>>>>>>
> >>>>>
> >>>
> ;sdata=ry3QO0Z5dxLPoRxkYVkOsVm3nl%2BxfCGv8be%2BMHdoUPY%3D&amp;r
> >>>>>>> eserved=0
> >>>>>>>>
> >>>>>>>> Fix the problem by initializing the node_id properly.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Babu Moger <babu.mo...@amd.com>
> >>>>>>>> ---
> >>>>>>>>  hw/i386/pc.c               |    2 ++
> >>>>>>>>  include/hw/i386/topology.h |   11 +++++++++++
> >>>>>>>>  2 files changed, 13 insertions(+)
> >>>>>>>>
> >>>>>>>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> >>>>>>>> index 2128f3d6fe..974cc30891 100644
> >>>>>>>> --- a/hw/i386/pc.c
> >>>>>>>> +++ b/hw/i386/pc.c
> >>>>>>>> @@ -1585,6 +1585,8 @@ static void
> pc_cpu_pre_plug(HotplugHandler
> >>>>>>> *hotplug_dev,
> >>>>>>>>          topo_ids.die_id = cpu->die_id;
> >>>>>>>>          topo_ids.core_id = cpu->core_id;
> >>>>>>>>          topo_ids.smt_id = cpu->thread_id;
> >>>>>>>> +        topo_ids.node_id = cpu_x86_use_epyc_apic_id_encoding(ms-
> >>>>>> cpu_type)
> >>>>>>> ?
> >>>>>>>> +                           x86_node_id_for_epyc(&topo_info, 
> >>>>>>>> &topo_ids) : 0;
> >>>>>>>
> >>>>>>> I'd rather not calculate some default value here,
> >>>>>>> this is the branch where we check user provided topology info and
> error
> >>> out
> >>>>>>> asking
> >>>>>>> to provide missing bits.
> >>>>>> Noticed that cpu->node_id is initialized to
> >>> 0xFF(NUMA_NODE_UNASSIGNED).
> >>>>>> We can initialize cpu->node_id to default node like how we do it in
> >>>>>> x86_get_default_cpu_node_id.  We can use it to initialize
> >>> topo_ids.node_id.
> >>>>>> This is consistent with other fields core_id, die_id etc..
> >>>>>>
> >>>>>>>
> >>>>>>> I also wonder if we should force user to specify numa nodes on CLI if
> >>> EPYC
> >>>>> cpu is
> >>>>>>> used.
> >>>>>>> (i.e. I'm assuming that EPYC always requires numa)
> >>>>>>
> >>>>>> That is not true. Without numa all the cpus will be configured under 
> >>>>>> one
> >>>>>> default numa node 0. Like we do it using x86_get_default_cpu_node_id.
> >>>>>
> >>>>> get_default_cpu_node_id() which is making things up, is going to be
> >>> removed
> >>>>> eventually in favor of asking user to provide numa mapping explicitly on
> CLI.
> >>>>
> >>>> That will be good going forward.
> >>>>
> >>>>>
> >>>>> now if it's always only node 0, why do we need to calculate it then,
> >>>>> why not just assing 0 directly?
> >>>>>
> >>>>> what if we have several sockets, would all vCPUs still have node-id = 0?
> >>>>
> >>>> If there are several nodes then socket id becomes node id.
> >>> I wonder if node id == socket id then why bother with node_id at all,
> >>> probably node id is there to allow for design where several sockets are on
> >>> the same node
> >>>
> >>>
> >>>>> Another question is if we have CPUs with only 1 numa node set on all of
> >>> them,
> >>>>> does it require all other machinery like ACPI SRAT table defined or 
> >>>>> it's just
> >>>>> internal CPU impl. detail?
> >>>>
> >>>> I am not very sure about it. To me it looks like it is just internal cpu
> >>>> implementation detail.
> >>> I'd think it might confuse guest OS, when it decodes more than 1 numa node
> >>> for APIC ID/CPUID but then there are no such nodes described in ACPI.
> >>> While it might work for caches, it would miss any relation of memory
> mapping
> >>> to nodes or get it wrong if one doesn't match another.
> >>>
> >>>
> >>>> I think we have two options here.
> >>>>
> >>>> 1. Update the patch to initialize the node_id the way it is done
> >>>> get_default_cpu_node_id.
> >>>
> >>> if it were only one node for every CPU (incl. multisocket), I'd go with
> enabling
> >>> autonuma assigning all CPUs to default 0 node-id, since there is no
> ambiguity
> >>> where
> >>> CPUs and RAM are mapped to.
> >>> Is it possible to use node-id=0 for all EPYC CPUs even in multisocket 
> >>> config?
> >>> (it seems spec allows only one node per socket, but doesn't say that node
> ids
> >>> must
> >>> be different.)
> >>> If not, then making up node-id is not an option.
> >>>
> >>>> 2. Ask the user to pass the node_id while hot plugging the device. This 
> >>>> is
> >>>> a clean solution. Will require some data structure changes.
> >>>
> >>> Here is my brain dump of current very non obvious flow:
> >>>
> >>>   1. x86_possible_cpu_arch_ids()
> >>>          ms->possible_cpus->cpus[i].props.* = x86_topo_ids_from_idx()
> >>>
> >>>   2. possible numa_cpu_set()
> >>>          ms->possible_cpus->cpus[i].props.node_id = user input|0 -in 
> >>> autonuma
> >>> case
> >>>
> >>>   3. x86_cpus_init()
> >>>          // generate apic_id AND makeup node_id embedded into it
> >>>          ms->possible_cpus->cpus[i].arch_id =
> x86_cpu_apic_id_from_index(x86ms,
> >>> i);
> >>>                         -> x86_apicid_from_cpu_idx_epyc() ->
> >>> x86_topo_ids_from_idx_epyc()
> >>>                                                                  same as 
> >>> x86_topo_ids_from_idx() +
> node_id
> >>>                      or
> >>>                         -> x86_apicid_from_cpu_idx() -> 
> >>> x86_topo_ids_from_idx()
> >>>
> >>>   4. pc_cpu_pre_plug()
> >>>          // basically topo ids module node-id is not set or user provided
> >>>          cpu_slot = pc_find_cpu_slot(MACHINE(pcms), cpu->apic_id, &idx);
> >>>
> >>>   5.
> >>>          // do it again with diff that in EPYC case it my have different 
> >>> node-id
> than
> >>> cpu_slot
> >>>          x86ms->topo_ids_from_apicid(cpu->apic_id, &topo_info, &topo_ids);
> >>>
> >>>          //i.e. user input of node-id is ignored
> >>>          set socket-id/core-id/... (but not node-id) from topo_info
> >>>
> >>>          numa_cpu_pre_plug(cpu_slot)
> >>>                            ^^^^^^
> >>>               if (node_id == CPU_UNSET_NUMA_NODE_ID) {
> >>>                    if (slot->props.has_node_id)
> >>>                        object_property_set_int(... slot->props.node_id, 
> >>> "node-id",...);
> >>>               // this applies to hotplugged without node-id and to 
> >>> initial CPUs (-
> smp X)
> >>>               // so we may end up with "node-id" being set to user 
> >>> defined value
> >>>               // or left unset (no numa enabled)
> >>>               // while APIC ID will have some node-id encoded in it.
> >>>
> >>>
> >>> that's quite a mess, maybe we should unify both
> >>> amd make x86_apicid_from_cpu_idx_epyc()/x86_apicid_from_cpu_idx() use
> >>> ms->possible_cpus->cpus[i].props instead of x86_topo_ids_from_idx()
> >>> i.e
> >>>
> >>>        x86_apicid_from_cpu_idx_epyc() {
> >>>            topoids = x86_apicid_from_cpu_idx() {
> >>>                           return ms->possible_cpus->cpus[i].props
> >>>                       }
> >>>            if (ms->possible_cpus->cpus[i].props.has_node_id)
> >>>                topoids.node_id = ms->possible_cpus->cpus[i].props.node_id
> >>>            else
> >>>                error_fatal("EPYC requires use of -numa to define topology 
> >>> if using
> >>> more than 1 socket")
> >>>        }
> >>>
> >>> that way QEMU makes up only node[0] by enabling autonuma or whatever
> >>> user privided explicitly is encoded into APIC ID and it will be always
> consistent
> >>> with cpu
> >>> *-id properties in possible_cpus and SRAT table QEMU generates.
> >>>
> >>> as cleanup we can get rid of back and forth conversion [5] and use 
> >>> cpu_slot
> to
> >>> set
> >>> the same ids.
> >>>
> >>> Also maybe we should have a check that node-id is the same within socket
> in
> >>> case of EPYC
> >>> if it's guarantied that EPYC won't support multiple nodes per socket.
> >>>
> >>> hope it makes at least some sense.
> >>
> >> To make things clear, in case of autonuma we don't have to worry about
> >> node_id. We just have to set it topo_ids.node_id to 0 in pc_cpu_pre_plug,
> >> Everything will work as expected. This will solve our current problem of
> >> uninitialized variable.
> >
> > I'm proposing to enable autonuma, which in its turn will assign all CPUs to
> > node-id=0 in possible_cpus. And once this information is in possible_cpus,
> > numa_cpu_pre_plug() should take care of setting correct node-id on CPU for
> > the case of initial CPUs (node_id == CPU_UNSET_NUMA_NODE_ID), and in
> case
> > of hotplug numa_cpu_pre_plug() will complaing if user suppled nonsense on
> > with device_add.
> >
> >> Problem here is, when user has configured the numa, then setting the
> >> topo_ids.node_id to 0 might not work because it might create duplicate
> >> apicids and device_add will be rejected.  As per the comments in
> > I don't see how such duplicate could be made, even if all CPUs have 0
> > node-id, there are pkg_id/core_id/thread_id wich are encoded in APIC ID,
> > where pkg_id is unique across machine (at least in QEMU), so I don't see
> > how duplicate is possible.
> >
> >> numa_cpu_pre_plug, this is already broken. Look at the comments below.
> >> Looks like node_id cannot be passed down.
> >> ============================================
> >> if (node_id == CPU_UNSET_NUMA_NODE_ID) {
> >>         /* due to bug in libvirt, it doesn't pass node-id from props on
> >>          * device_add as expected, so we have to fix it up here */
> >>         if (slot->props.has_node_id) {
> >>             object_property_set_int(OBJECT(dev), slot->props.node_id,
> >>                                     "node-id", errp);
> >>         }
> >           else if (epyc)
> >              error("incomplete EPYC topology use -numa cpu,node-id=some-
> id,socket-id=%d  to configure numa node for socket",
> >                     cpu_socket_id)
> >
> >>     } else if (node_id != slot->props.node_id) {
> >> ============================================
> >>
> >> I was trying to solve this problem setting the node_id correctly for EPYC
> >> at least.  If you think, this is not important we can ignore it (by
> >> setting topo_ids.node_id to 0) and move forward.  I don't see the need for
> >> changing other topology specific code as we have already made very generic.
> >
> > node-id - can be passed down (problem was that libvird didn't do it back 
> > then
> > for -device/device_add, hence above hack).
> >
> > But that's not a problem, the problem is that x86_apicid_from_cpu_idx_epyc()
> makes
> > up node-id on its own, which is not big deal in case of autonuma since they
> happen
> > to match and there aren't any ambiguity, but with more numa nodes, numa
> config should
> > be defined by user explicitly and current code may end up with incoherent
> config,
> > where some parts of QEMU think CPU has one node-id while APIC ID is
> encoded with another.
> >
> > So after some pondering on a subject, to make sure it will look correct 
> > from all
> angles,
> > we need to:
> >
> >  1: use single source for topo ids, i.e. pull user provided node-id from
> possible_cpus
> >     for both cpu.node-id property and for APIC ID. Hence my suggestion to
> change
> >     x86_apicid_from_cpu_idx_epyc() as described above.
> >
> >  2: verify that user provided id's make sense in EPYC case. (pre_plug)
> 
> Basically, you are saying to setup the props.has_node_id and props.node_id
> while buidling the topology. Make sure to use the numa information if the
> user has provided else build the topology as per EPYC topology model.
> Right? I will have to think thru this little bit. Will try to send the
> draft patch tomarrow.


It is getting much more complicated. I cannot change the
x86_apicid_from_cpu_idx_epyc to access the
ms->possible_cpus->cpus[i].props because we don’t have access to
MachineState inside topology functions.  Even with these changes I don’t
know if we are going to solve the un-initialized memory problem.
I will continue to investigate. Let me know if think any other options.

Following command hotplugs the device without the node_id. Is there a
command to hotadd the device with node id? I am trying to understand if we
really need to worry about non-zero node id.

$ qemu-system-x86_64 -machine q35,accel=kvm -smp 1,maxcpus=2,
  cores=1, threads=1,sockets=2 -cpu EPYC -device EPYC-x86_64-
  cpu,core-id=0,socket-id=1,thread-id=0


> 
> >
> >>>> Let me know if you see any other option.
> >>>>
> >>>>>
> >>>>>
> >>>>>>>
> >>>>>>>>          cpu->apic_id = x86ms->apicid_from_topo_ids(&topo_info,
> >>>>> &topo_ids);
> >>>>>>>>      }
> >>>>>>>>
> >>>>>>>> diff --git a/include/hw/i386/topology.h b/include/hw/i386/topology.h
> >>>>>>>> index 07239f95f4..ee4deb84c4 100644
> >>>>>>>> --- a/include/hw/i386/topology.h
> >>>>>>>> +++ b/include/hw/i386/topology.h
> >>>>>>>> @@ -140,6 +140,17 @@ static inline unsigned
> >>>>>>> apicid_pkg_offset_epyc(X86CPUTopoInfo *topo_info)
> >>>>>>>>             apicid_node_width_epyc(topo_info);
> >>>>>>>>  }
> >>>>>>>>
> >>>>>>>> +static inline unsigned x86_node_id_for_epyc(X86CPUTopoInfo
> >>>>> *topo_info,
> >>>>>>>> +                                            const X86CPUTopoIDs 
> >>>>>>>> *topo_ids)
> >>>>>>>> +{
> >>>>>>>> +    unsigned nr_nodes = MAX(topo_info->nodes_per_pkg, 1);
> >>>>>>>> +    unsigned cores_per_node = DIV_ROUND_UP((topo_info-
> >>>> dies_per_pkg
> >>>>> *
> >>>>>>>> +                                            
> >>>>>>>> topo_info->cores_per_die *
> >>>>>>>> +                                            
> >>>>>>>> topo_info->threads_per_core),
> >>>>>>>> +                                            nr_nodes);
> >>>>>>>> +
> >>>>>>>> +    return (topo_ids->core_id / cores_per_node) % nr_nodes;
> >>>>>>> what if nr_nodes == 0?
> >>>>>>>
> >>>>>>>> +}
> >>>>>>>>  /*
> >>>>>>>>   * Make APIC ID for the CPU based on Pkg_ID, Core_ID, SMT_ID
> >>>>>>>>   *
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >>
> >

RE: [PATCH 1/2] hw/386: Fix uninitialized memory with -device and CPU hotplug

Reply via email to