Hey everyone,
So, as a followup of what we were discussing in this thread:
[Xen-devel] PV-vNUMA issue: topology is misinterpreted by
the guest
http://lists.xenproject.org/archives/html/xen-devel/2015-07/
msg03241.html
I started looking in more details at scheduling domains in the
Linux
kernel. Now, that thread was about CPUID and vNUMA, and their
weird way
of interacting, while this thing I'm proposing here is
completely
independent from them both.
In fact, no matter whether vNUMA is supported and enabled, and
no matter
whether CPUID is reporting accurate, random, meaningful or
completely
misleading information, I think that we should do something
about how
scheduling domains are build.
Fact is, unless we use 1:1, and immutable (across all the guest
lifetime) pinning, scheduling domains should not be
constructed, in
Linux, by looking at *any* topology information, because that
just does
not make any sense, when vcpus move around.
Let me state this again (hoping to make myself as clear as
possible): no
matter in how much good shape we put CPUID support, no matter
how
beautifully and consistently that will interact with both
vNUMA,
licensing requirements and whatever else. It will be always
possible for
vCPU #0 and vCPU #3 to be scheduled on two SMT threads at time
t1, and
on two different NUMA nodes at time t2. Hence, the Linux
scheduler
should really not skew his load balancing logic toward any of
those two
situations, as neither of them could be considered correct
(since
nothing is!).
For now, this only covers the PV case. HVM case shouldn't be
any
different, but I haven't looked at how to make the same thing
happen in
there as well.
OVERALL DESCRIPTION
===================
What this RFC patch does is, in the Xen PV case, configure
scheduling
domains in such a way that there is only one of them, spanning
all the
pCPUs of the guest.
Note that the patch deals directly with scheduling domains, and
there is
no need to alter the masks that will then be used for building
and
reporting the topology (via CPUID, /proc/cpuinfo, /sysfs,
etc.). That is
the main difference between it and the patch proposed by
Juergen here:
http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg
05088.html
This means that when, in future, we will fix CPUID handling and
make it
comply with whatever logic or requirements we want, that won't
have any
unexpected side effects on scheduling domains.
Information about how the scheduling domains are being
constructed
during boot are available in `dmesg', if the kernel is booted
with the
'sched_debug' parameter. It is also possible to look
at /proc/sys/kernel/sched_domain/cpu*, and at /proc/schedstat.
With the patch applied, only one scheduling domain is created,
called
the 'VCPU' domain, spanning all the guest's (or Dom0's) vCPUs.
You can
tell that from the fact that every cpu* folder
in /proc/sys/kernel/sched_domain/ only have one subdirectory
('domain0'), with all the tweaks and the tunables for our
scheduling
domain.
EVALUATION
==========
I've tested this with UnixBench, and by looking at Xen build
time, on a
16, 24 and 48 pCPUs hosts. I've run the benchmarks in Dom0
only, for
now, but I plan to re-run them in DomUs soon (Juergen may be
doing
something similar to this in DomU already, AFAUI).
I've run the benchmarks with and without the patch applied
('patched'
and 'vanilla', respectively, in the tables below), and with
different
number of build jobs (in case of the Xen build) or of parallel
copy of
the benchmarks (in the case of UnixBench).
What I get from the numbers is that the patch almost always
brings
benefits, in some cases even huge ones. There are a couple of
cases
where we regress, but always only slightly so, especially if
comparing
that to the magnitude of some of the improvement that we get.
Bear also in mind that these results are gathered from Dom0,
and without
any overcommitment at the vCPU level (i.e., nr. vCPUs == nr
pCPUs). If
we move things in DomU and do overcommit at the Xen scheduler
level, I
am expecting even better results.