[Xen-devel] Consensus in Parallel Universe Responses to Spectre/Meltdown

Rich Persaud Wed, 10 Jan 2018 19:40:07 -0800

> On Jan 10, 2018, at 11:39, Ian Jackson <ian.jack...@eu.citrix.com> wrote:
> 
> Jan Beulich writes ("Re: Radical proposal v2: Publish Amazon's verison now, 
> Citrix's version soon"):
>> There are a couple of instances of "a branch", and I'm not really
>> clear on which one that would be, yet in part my opinion depends
>> on that, as this will affect what state certain branches will be in
>> for subsequent work. As I agree with the PVH shim being the
>> better baseline for work going forward, in particular I wouldn't like
>> to see the Vixen series becoming the base of any branch going to
>> be maintained going forward.
> 
> Anthony Liguori writes ("Re: [Xen-devel] Radical proposal v2: Publish 
> Amazon's verison now, Citrix's version soon"):
>> What I would suggest is the following:
>> 1) Merge Vixen into staging
>> 2) Backport Vixen into stable-4.10 and cut a release
> 
> We do not have time any longer (if we had time to start with) to
> reconcile these divergent views.


[ Disclaimer: non-technical, PM-oriented message ahead.  Opinions expressed are 
those of this individual and not former/current employers or clients. ]

Having worked with Xen since 2005, helping to ship XenServer, XenClient and 
OpenXT, I would like to challenge assertions that the community of Xen (or 
$HW_vendor or $OS_vendor or $APP_vendor) developers and users must settle for a 
non-consensus long-term mitigation.

Across the computer industry, it is clear that a small subset of specialists 
have known about this issue for some time:  developers who worked on candidate 
fixes ahead of the public announcement, experts who warned about 
microarchitecture risks years ago, and any adversaries who acted on their 
warnings.  Some people had advance information & time to consider candidate 
solutions, most [1] of the world did not.

As a customer of $HW_vendor / Xen / $OS_vendor / $APP_vendor, the last thing I 
want to hear is that world-class specialists who have had weeks/months to 
evaluate candidate fixes have been unable to reach agreement and propose to 
delegate the decision TO CUSTOMERS (?!)  That would be customers with only days 
of exposure to the CVE details, who still have to keep their regular business 
running, while trying to understand a complex security issue that eluded 
experts for decades.

As a general-purpose open-source hypervisor not tied to one operating system or 
use case, Xen has always been susceptible to fragmentation.  It can seem easier 
to make private modifications vs. upstreaming/revising changes for acceptance 
to a public codebase that serves many stakeholders.  The non-upstreamed Xen 
forks of Amazon EC2, Citrix XenClient and Bromium have made the Xen community 
less strong, by reducing public Xen contributions, including but not limited to 
security.

Yet… a once-in-decades (?) security issue has brought forth a public 
contribution from the private, parallel Xen universe that is Amazon EC2, 
accompanied by engineering resources to review and test a long-term solution 
that could be acceptable to the public Xen codebase.  If merged, this 
contribution could reduce fragmentation, increase dev/test resources and expand 
the risk pool of Xen customers sharing a common, battle-tested mitigation.  
Reconciliation of private/public Xen universes is never easy, but in this 
unique instance it would make the Xen community stronger.

Notes:

1) PVH is widely acknowledged as the long-term future of Xen.  The recent 
security issue makes it even more important that PVH be well designed and 
widely tested with resources from the broader Xen community.  Strategic PVH 
improvements need not be rushed to solve a tactical security issue. There are a 
number of Xen and security companies who have not yet contributed to recent 
public Xen design discussions, including Bromium.  If there is lack of design 
consensus, we can call upon additional Xen contributors to help achieve 
consensus.

2) Security:  large swathes of customers in many markets have neither the time 
nor expertise to make complex security decisions which involve functional 
tradeoffs and non-obvious interactions among opaque operating systems, 
microcode and hardware.  This category of customer wants to know that, (a) they 
are no worse off than "most" other customers, (b) they are on a supported path 
that will lead to a widely deployed long-term solution, and (c) they have clear 
documentation on operational constraints during their journey from temporary 
fix to long-term solution.

3) Community:  given the unprecedented nature of this Amazon code contribution, 
it is in the interest of the Xen community for Amazon EC2 to migrate to a 
solution that is used by upstream Xen customers.  This requires an incremental, 
bisectable path between already-deployed EC2 Vixen and in-development PVH shim. 
 The reason for the Xen community to support this approach, however difficult, 
is to un-fork Amazon's version of Xen, in exchange for expanding the pool of 
Xen deployments which share a common risk profile.

Henry Baker posted [2] to the Cryptography mailing list about optimistic 
concurrency control:

"… Speculation is an extremely common, and an entirely human, reaction to 
*latency*. If the latency of some operation is too long, we pretend that the 
most common case is occurring and try to fix things later if/when we find out 
that we've been wrong … Only in the 20th and 21st centuries have we had the 
luxury of speed-of-light communications and sub-second latencies, so that we 
can often replace *optimistic* concurrency control with *pessimistic* 
concurrency control.  When an ancient Roman general took his army over the 
horizon, he might be out of contact for *months*, so pessimistic concurrency 
control simply wasn't an option. If he screwed up, the only option was to send 
*another general and another army* over the horizon to repair the damage that 
the first general and army had caused."
Some teams working with Xen appear to have developed independent (?) 
mitigations for Spectre/Meltdown, in advance of public disclosure.  Some of 
those mitigations have already shipped to commercial customers.  By definition, 
each parallel, private effort addressed a narrower set of requirements than 
those of the broader Xen community.  We have now moved onto real-time 
coordination where early private assumptions can be revisited in public, as we 
seek consensus on a unified, long-term solution and timeline for addressing 
every individual PV feature regression.
Each privately developed mitigation may be useful to a subset of Xen users.  
Each can be hosted in a short-lived branch, with documentation of tradeoffs and 
an index of all mitigation branches.  Organizations with the time and resources 
to evaluate these short-lived branches may adopt one that matches their 
constraints.  But the broader Xen community needs consensus on the path to a 
unified solution that can be merged to release trees *and widely deployed*.
However long it takes to realize broad consensus on a solution for release 
branches, consensus is what customers expect in a critical fix from a trusted 
open-source provider of general-purpose virtualization.  If customers were 
satisfied with less, they would be using a more narrowly-focused hypervisor.  
We can educate users that for many software stacks, Spectre/Meltdown is not a 
"patch and forget" security issue, but one that will have industry-wide 
operational and economic impact for months/years.
Rich
[1] 
https://techcrunch.com/2018/01/06/how-tier-2-cloud-vendors-banded-together-to-cope-with-spectre-and-meltdown/
[2] http://www.metzdowd.com/pipermail/cryptography/2018-January/033541.html

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] Consensus in Parallel Universe Responses to Spectre/Meltdown

Reply via email to