Re: [Xen-devel] Ongoing/future speculative mitigation work

Andrew Cooper Fri, 19 Oct 2018 05:18:01 -0700

On 19/10/18 09:09, Dario Faggioli wrote:
> On Thu, 2018-10-18 at 18:46 +0100, Andrew Cooper wrote:
>> Hello,
>>
> Hey,
>
> This is very accurate and useful... thanks for it. :-)
>
>> 1) A secrets-free hypervisor.
>>
>> Basically every hypercall can be (ab)used by a guest, and used as an
>> arbitrary cache-load gadget.  Logically, this is the first half of a
>> Spectre SP1 gadget, and is usually the first stepping stone to
>> exploiting one of the speculative sidechannels.
>>
>> Short of compiling Xen with LLVM's Speculative Load Hardening (which
>> is
>> still experimental, and comes with a ~30% perf hit in the common
>> case),
>> this is unavoidable.  Furthermore, throwing a few
>> array_index_nospec()
>> into the code isn't a viable solution to the problem.
>>
>> An alternative option is to have less data mapped into Xen's virtual
>> address space - if a piece of memory isn't mapped, it can't be loaded
>> into the cache.
>>
>> [...]
>>
>> 2) Scheduler improvements.
>>
>> (I'm afraid this is rather more sparse because I'm less familiar with
>> the scheduler details.)
>>
>> At the moment, all of Xen's schedulers will happily put two vcpus
>> from
>> different domains on sibling hyperthreads.  There has been a lot of
>> sidechannel research over the past decade demonstrating ways for one
>> thread to infer what is going on the other, but L1TF is the first
>> vulnerability I'm aware of which allows one thread to directly read
>> data
>> out of the other.
>>
>> Either way, it is now definitely a bad thing to run different guests
>> concurrently on siblings.  
>>
> Well, yes. But, as you say, L1TF, and I'd say TLBLeed as well, are the
> first serious issues discovered so far and, for instance, even on x86,
> not all Intel CPUs and none of the AMD ones, AFAIK, are affected.


TLBleed is an excellent paper and associated research, but is still just
inference - a vast quantity of post-processing is required to extract
the key.

There are plenty of other sidechannels which affect all SMT
implementations, such as the effects of executing an mfence instruction,
execution unit

> Therefore, although I certainly think we _must_ have the proper
> scheduler enhancements in place (and in fact I'm working on that :-D)
> it should IMO still be possible for the user to decide whether or not
> to use them (either by opting-in or opting-out, I don't care much at
> this stage).

I'm not suggesting that we leave people without a choice, but given an
option which doesn't share siblings between different guests, it should
be the default.

>
>> Fixing this by simply not scheduling vcpus
>> from a different guest on siblings does result in a lower resource
>> utilisation, most notably when there are an odd number runable vcpus
>> in
>> a domain, as the other thread is forced to idle.
>>
> Right.
>
>> A step beyond this is core-aware scheduling, where we schedule in
>> units
>> of a virtual core rather than a virtual thread.  This has much better
>> behaviour from the guests point of view, as the actually-scheduled
>> topology remains consistent, but does potentially come with even
>> lower
>> utilisation if every other thread in the guest is idle.
>>
> Yes, basically, what you describe as 'core-aware scheduling' here can
> be build on top of what you had described above as 'not scheduling
> vcpus from different guests'.
>
> I mean, we can/should put ourselves in a position where the user can
> choose if he/she wants:
> - just 'plain scheduling', as we have now,
> - "just" that only vcpus of the same domains are scheduled on siblings
> hyperthread,
> - full 'core-aware scheduling', i.e., only vcpus that the guest
> actually sees as virtual hyperthread siblings, are scheduled on
> hardware hyperthread siblings.
>
> About the performance impact, indeed it's even higher with core-aware
> scheduling. Something that we can see about doing, is acting on the
> guest scheduler, e.g., telling it to try to "pack the load", and keep
> siblings busy, instead of trying to avoid doing that (which is what
> happens by default in most cases).
>
> In Linux, this can be done by playing with the sched-flags (see, e.g.,
> https://elixir.bootlin.com/linux/v4.18/source/include/linux/sched/topology.h#L20
>  ,
> and /proc/sys/kernel/sched_domain/cpu*/domain*/flags ).
>
> The idea would be to avoid, as much as possible, the case when "every
> other thread is idle in the guest". I'm not sure about being able to do
> something by default, but we can certainly document things (like "if
> you enable core-scheduling, also do `echo 1234 > /proc/sys/.../flags'
> in your Linux guests").
>
> I haven't checked whether other OSs' schedulers have something similar.
>
>> A side requirement for core-aware scheduling is for Xen to have an
>> accurate idea of the topology presented to the guest.  I need to dust
>> off my Toolstack CPUID/MSR improvement series and get that upstream.
>>
> Indeed. Without knowing which one of the guest's vcpus are to be
> considered virtual hyperthread siblings, I can only get you as far as
> "only scheduling vcpus of the same domain on siblings hyperthread". :-)
>
>> One of the most insidious problems with L1TF is that, with
>> hyperthreading enabled, a malicious guest kernel can engineer
>> arbitrary
>> data leakage by having one thread scanning the expected physical
>> address, and the other thread using an arbitrary cache-load gadget in
>> hypervisor context.  This occurs because the L1 data cache is shared
>> by
>> threads.
>>
> Right. So, sorry if this is a stupid question, but how does this relate
> to the "secret-free hypervisor", and with the "if a piece of memory
> isn't mapped, it can't be loaded into the cache".
>
> So, basically, I'm asking whether I am understanding it correctly that
> secret-free Xen + core-aware scheduling would *not* be enough for
> mitigating L1TF properly (and if the answer is no, why... but only if
> you have 5 mins to explain it to me :-P).
>
> In fact, ISTR that core-scheduling plus something that looked to me
> similar enough to "secret-free Xen", is how Microsoft claims to be
> mitigating L1TF on hyper-v...

Correct - that is what HyperV appears to be doing.

Its best to consider the secret-free Xen and scheduler improvements as
orthogonal.  In particular, the secret-free Xen is defence in depth
against SP1, and the risk of future issues, but does have
non-speculative benefits as well.

That said, the only way to use HT and definitely be safe to L1TF without
a secret-free Xen is to have the synchronised entry/exit logic working.

>> A solution to this issue was proposed, whereby Xen synchronises
>> siblings
>> on vmexit/entry, so we are never executing code in two different
>> privilege levels.  Getting this working would make it safe to
>> continue
>> using hyperthreading even in the presence of L1TF.  
>>
> Err... ok, but we still want core-aware scheduling, or at least we want
> to avoid having vcpus from different domains on siblings, don't we? In
> order to avoid leaks between guests, I mean.

Ideally, we'd want all of these.  I expect the only reasonable way to
develop them is one on top of another.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] Ongoing/future speculative mitigation work

Reply via email to