[Xen-devel] [DOC RFC] Heterogeneous Multi Processing Support in Xen

Dario Faggioli Wed, 07 Dec 2016 10:31:12 -0800

% Heterogeneous Multi Processing Support in Xen
% Revision 1

\clearpage


# Basics

---------------- ------------------------
         Status: **Design Document**

Architecture(s): x86, arm

   Component(s): Hypervisor and toolstack
---------------- ------------------------

# Overview

HMP (Heterogeneous Multi Processing) and AMP (Asymmetric Multi Processing)
refer to systems where physical CPUs are not exactly equal. It may be that
they have different processing power, or capabilities, or that each is
specifically designed to run a particular system component.
Most of the times the CPUs have different Instruction Set Architectures (ISA)
or Application Binary Interfaces (ABIs). But they may *just* be different
implementations of the same ISA, in which case they typically differ in
speed, power efficiency or handling of special things (e.g., erratas).

An example is ARM big.LITTLE, which in fact, is the use case that got the
discussion about HMP started. This document, however, is generic, and does
not target only big.LITTLE.

What need proper Xen support are systems and use cases where virtual CPUs
can not be seamlessly moved around all the physical CPUs. In fact, in these
cases, there must be a way to:

* decide and specify on what (set of) physical CPU(s), each vCPU can execute on;
* enforce that a vCPU that can only run on a certain (set of) pCPUs, is never
  actually run anywhere else.

**N.B.:** it is becoming common to refer as AMP or HMP also to systems which
have various kind of co-processors (from crypto engines to graphic hardware),
integrated with the CPUs on the same chip. This is not what this design 
document is about.

# Classes of CPUs

A *class of CPUs* is defined as follows:

1. each pCPU in the system belongs to a class;
2. a class can consist of one or more pCPUs;
3. each pCPU can only be in one class;
4. CPUs belonging to the same class are homogeneous enough that a virtual
   CPU that blocks/is preempted while running on a pCPU of a class can,
   **seamlessly**, unblock/be scheduler on any pCPU of that same class;
5. when a virtual CPU is associated with a (set of) class(es) of CPUs, it
   means that the vCPU can run on all the pCPUs belonging to the said
   class(es).

So, for instance, in architecture Foobar two classes of CPUs exist, class
foo and class bar. If a virtual CPU running on a CPU 0, which is of class
foo, blocks (or is preempted), it can, when it unblocks (or is selected by
the scheduler to run  again), run on CPU 3, still of class foo, but not on
CPU 6, which is of class bar.

## Defining classes

How a class is defined, i.e., what are the specific characteristics that
determine what CPUs belong to which class, is highly architecture specific.

### x86

There is no HMP platform of relevance, for now, in x86 world. Therefore,
only one class will exist, and all the CPUs will be set to belong to it.
**TODO X86:** is this correct?

### ARM

**TODO ARM:** I know nothing about what specifically should be used to
form classes, so I'm deferring this to ARM people.

So far, in the original thread the following ideas came up (well, there's
more, but I don't know enough of ARM to judge what is really relevant about
this topic):

* 
[Julien](https://lists.xenproject.org/archives/html/xen-devel/2016-09/msg02153.html)
  "I don't think an hardcoded list of processor in Xen is the right solution.
   There are many existing processors and combinations for big.LITTLE so it
   will nearly be impossible to keep updated."
* 
[Julien](https://lists.xenproject.org/archives/html/xen-devel/2016-09/msg02256.html)
  "Well, before trying to do something clever like that (i.e naming "big" and
  "little"), we need to have upstreamed bindings available to acknowledge the
  difference. AFAICT, it is not yet upstreamed for Device Tree and I don't
  know any static ACPI tables providing the similar information."
* 
[Peng](https://lists.xenproject.org/archives/html/xen-devel/2016-09/msg02194.html)
  "For how to differentiate cpus, I am looking the linaro eas cpu topology code"

# User details

## Classes of CPUs for the users

It will be possible, in a VM config file, to specify the (set of) class(es)
of each vCPU. This allows creating HMP VMs.

E.g., on ARM, it will be possible to create big.LITTLE VMs which, if run on
big.LITTLE hosts, could leverage the big.LITTLE support of the guest OS kernel
and tools.

For such purpose, a new option will be added to xl config file:

    vcpus = "8"
    vcpuclass = ["0-2:class0", "3,4:class1,class3", "5:class0, class2", 
"8:class4"]

with the following meaning:

* vCPUs 0, 1, 2 can only run on pcpus of class class0
* vCPUs 3, 4 can run on pcpus of class class1 **and** on pcpus of class class3
* vCPUs 5 can run on pcpus of class class0 **and** on pCPUs of class class2
* for vCPUs 7, since they're not mentioned, default applies
* vCPUs 8 can only run on pcpus of class class4

For the vCPUs for which no class is specified, default behavior applies.

**TODO:** note that I think it must be possible to associate more than
one class to a vCPU. This is expressed in the example above, and assumed
to be true throughout the document. It might be, though, that, at least at
early stages (see implementation phases below), we will enable only 1-to-1
mapping.

**TODO:** default can be, either:

1. the vCPU can run on any CPU of any class,
2. the vCPU can only run on a specific, arbitrary decided, class (and I'd say
   that should be class 0).

The former seems a better interface. It looks to me like the most natural
and less surprising, from the user point of view, and the most future proof
(see phase 3 of implementation below).
The latter may be more practical, though. In fact, with the former, we risk
crashing (the guest or the hypervisor) if one creates a VM and forgets to
specify the vCPU classes --which does not look ideal.

It will be possible to gather information about what classes exist, and what
pCPUs belong to each class, by issuing the `xl info -n' command:

    cpu_topology           :
    cpu:    core    socket     node     class
      0:       0        1        0        0
      1:       0        1        0        1
      2:       1        1        0        2
      3:       1        1        0        3
      4:       9        1        0        3
      5:       9        1        0        0
      6:      10        1        0        1
      7:      10        1        0        2
      8:       0        0        1        3
      9:       0        0        1        3
     10:       1        0        1        1
     11:       1        0        1        0
     12:       9        0        1        1
     13:       9        0        1        0
     14:      10        0        1        2
     15:      10        0        1        2

**TODO:** do we want to keep using `-n`, or add another switch, like -c or
something? I'm not sure I like using `-n` as, e.g., on x86, this would most
of the times result in just a column full of `0`, and it may raise confusion
among users about what that actually means.
Also, do we want to print the class ids, or some more abstract class names?
(or support both, and have a way to decide which one to see)? 

# Technical details

## Hypervisor

The hypervisor needs to know within which class each of the present CPUs
falls. At boot (or, in general, CPU bringup) time, while identifying the CPU,
a list of classes is constructed, and the mapping between each CPU and the
class it is determined it should belong, established.

The list of classes is kept ordered from the more powerful to the less
powerful.
**TODO:** this has been [proposed by 
George](https://lists.xenproject.org/archives/html/xen-devel/2016-09/msg02212.html).
I like the idea, what do others think? If we agree on that, note that there
has been no discussion on defining what "more powerful" means, neither on
x86 (although, not really that interesting, for now, I'd say), nor on ARM.

The mapping between CPUs and classes will be kept in memory in the following
data structures:

    uint16_t cpu_to_class[NR_CPUS] __read_mostly;
    cpumask_t class_to_cpumask[NR_CPUS] __read_mostly;

**TODO:** it's probably better to allocate the cpumask array dynamically,
to avoid wasting too much space.

**TODO:** if we want the ordering, structure needs to be kept ordered too
(or additional structures should be used for the purpose).

Each virtual CPU must know on what class of CPUs it can run on. Since a
vCPU can be associated to more than one class, the best way to keep track
of this information is a bitamp. That will be a new `cpumask` typed member
in `struct vcpu`. were the i-eth bit set means the vCPU can
run on CPUs of class i.

If a vCPU is found running on a pCPU of a class that is not associated to
the vCPU itself, an exception should be raised.
**TODO:** What kind? BUG_ON? Crash the guest? The guest would probably crash
--or become unreliable-- by its own, I guess.

Setting and getting the CPU class of a vCPU will happen via two new
hypercalls:

* `XEN_DOMCTL_setvcpuclass`
* `XEN_DOMCTL_setvcpuclass`

Information about CPU classes will be propagated to toolstak by adding a
new field in xen_sysctl_cputopo, which will become:

    struct xen_sysctl_cputopo {
        uint32_t core;
        uint32_t socket;
        uint32_t node;
        unit32_t class;
    };

For homogeneous and SMP systems, the value of the new class field will
be 0 for all the cores.

## Toolstack

It will be possible for the toolstack to retrieve from Xen the list of
existing CPU classes, their names, and the information about to which
class each present CPU belongs to.

**TODO:** [George 
suggested](https://lists.xenproject.org/archives/html/xen-devel/2016-09/msg02212.html)
to allow a richer set of labels, at the toolstack level, and I like
the idea very much. It's not clear to me, though, in what component
this list of names, and the mapping between them and the classes as
they're known inside Xen should live.

Libxl and libxc interfaces will be introduced for associating a vCPU to
a (set of) class(es):

* `libxl_set_vcpuclass()`, `libxl_get_vcpuclass()`;
* `xc_vcpu_setclass()`, `xc_vcpu_getclass()`.

In libxl, class information will be added in `struct libxl_cputopology`,
which is filled by `libxl_get_cpu_topology()`.

# Implementation

Implementation can proceed in phases.

## Phase 1

Class definition, identification and mapping of CPUs to classes, inside
Xen, will be implemented. And so they will be libxc and libxl interfaces
for retrieving such information.

Parsing of the new `vcpuclass` parameter will be implemented in `xl`. The
result of such parsing will then be used as if it were the hard-affinity of
the various vCPUs. That is, we will set the hard-affinity of each vCPU, to
the pCPUs that are part of the class(es) the vCPU itself is being assigned,
according to `vcpuclass`.

This would *Just Work(TM)*, as soon as the user does not try to change the
hard-affinity, during the VM lifetime (e.g., with `xl vcpu-pin').

**TODO:** It may be useful, for avoiding the above to happen, to add another
`xl` config option that, if set, disallows changing the affinity from what it
was at VM creation time (something like `immutable_affinity=1`). Thoughts?
I'm leaning toward doing that, as it may even be something useful to have
in other usecases.

### Phase 1.5

Library (libxc and libxl) calls and hypercalls that are necessary to associate
a class to the vCPUs will be implemented.

At which point, when parsing `vcpuclass` in `xl`, we will call both (with the
same bitmap as input):

* `libxl_set_vcpuclass()`
* `libxl_set_vcpuaffinity()`

`libxl__set_vcpuaffinity()` will be modified in such a way that, when setting
hard-affinity for a vCPU:

* it will get the CPU class(es) associated to the vCPU;
* it will check what pCPUs that belong to the class(es);
* it will filter out, from the new hard-affinity being set, the pCPUs that
   are not in the vCPU's class(es)'.

As a safety measure, `vcpu_set_hard_affinity()` in Xen will also be modified
such that, if someone somehow manages to pass down an hard-affinity mask
which contains pCPUs outside from the proper classes, it will error out
with -EINVAL.

### Phase 2

Inside Xen, the various schedulers will be modified to deal internally with
the fact that vCPUs can only run on pCPUs from the class(es) they are
associated with. This allows for more efficient implementation, and paves
the way for enabling more intelligent logic (e.g., for minimizing power
consumption) in *phase 3*.

Calling `libxl_set_vcpuaffinity()` from `xl` / libxl is therefore no longer
necessary and will be avoided (i.e., only `libxl_set_vcpuclass()` will be
called).

### Phase 3

Moving vCPUs between classes will be implemented. This means that, e.g.,
on ARM big.LITTLE, it will be possible for a vCPU to block on a big core
and wakeup on a LITTLE core.

**TODO:** About what this takes, see [Julien's 
email](https://lists.xenproject.org/archives/html/xen-devel/2016-09/msg02345.html).

This means it will be no longer necessary to specify the class of the
vCPUs via `vcpuclass` in `xl`, although that will of course remain
supported. So:

1. if one wants (sticking with big.LITTLE as example) a big.LITTLE VM,
   and wants to make sure that make sure that big vCPUs will run on big
   pCPUs, and that LITTLE vCPUs will run on LITTLE pCPUs, she will use:

    vcpus = "8"
    vcpuclass = ["0-3:big", "4-7:little"]

2. if one does not care, and is happy to let the Xen scheduler decide
   where to run the various vCPUs, in order, for instance, to be sure
   to get the best power efficiency for the host as a whole, he can
   just avoid specifying any `vcpuclass`, or doing something like this:

    vcpuclass = ["all:all"]

# Limitations

* Until in *phase 1*, it won't be possible to use vCPU hard-affinity
  for anything else than HMP support;
* until before *phase 3*, since HMP support is basically the same as
  setting hard-affinity, performance may not be ideal;
* until before *phase 3*, vCPUs can't move between classes. This means.
  for instance, in the big.LITTLE world, Xen's scheduler can't move a
  vCPU running on a big core on a LITTLE core (e.g., to try save power).

# Testing

Testing requires an actual AMP/HMP system. On such a system, we at least
want to:

* create a VM **without** specifying `vcpuclass` in its config file, and
  check that the default policy is correctly applied to all vCPUs;
* create a VM **specifying** `vcpuclass` in its config file and check that
  the classes are assegned to vCPUs appropriately;
* create a VM **specifying** `vcpuclass` in its config file and check that
  the various vCPUs are not running on any pCPU outside of their respective
  classes.

# Areas for improvement

* Make it possible to test even on non-HMP systems. That could be done by
  making it possible to provide Xen with fake CPU classes for the system
  CPUs (e.g., with boot time parameters);
* implement a way to view the class the vCPUs have been assigned (either as
  past of the output of `xl vcpu-list`, or as a dedicated `xl` subcommand);
* make it possible to dynamically change the class of vCPUs at runtime, with
  `xl` (either via a new parameter to `vcpu-pin` subcommand, or via a new
  subcommand).

# Known issues

*TBD*.

# References

* [Asymetric Multi 
Processing](https://en.wikipedia.org/wiki/Asymmetric_multiprocessing)
* [Heterogeneous Multi 
Processing](https://en.wikipedia.org/wiki/Heterogeneous_computing)
* [ARM 
big.LITTLE](https://www.arm.com/products/processors/technologies/biglittleprocessing.php)

# History

------------------------------------------------------------------------
Date       Revision Version  Notes
---------- -------- -------- -------------------------------------------
2016-12-02 1                 RFC of design document
---------- -------- -------- -------------------------------------------

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

signature.asc
Description: This is a digitally signed message part

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

[Xen-devel] [DOC RFC] Heterogeneous Multi Processing Support in Xen

Reply via email to