On 7/17/2017 6:08 PM, Huang, Kai wrote:
Hi Andrew,
Thank you very much for comments. Sorry for late reply, and please see
my reply below.
On 7/12/2017 2:13 AM, Andrew Cooper wrote:
On 09/07/17 09:03, Kai Huang wrote:
Hi all,
This series is RFC Xen SGX virtualization support design and RFC
draft patches.
Thankyou very much for this design doc.
2. SGX Virtualization Design
2.1 High Level Toolstack Changes:
2.1.1 New 'epc' parameter
EPC is limited resource. In order to use EPC efficiently among all
domains,
when creating guest, administrator should be able to specify domain's
virtual
EPC size. And admin
alao should be able to get all domain's virtual EPC size.
For this purpose, a new 'epc = <size>' parameter is added to XL
configuration
file. This parameter specifies guest's virtual EPC size. The EPC base
address
will be calculated by toolstack internally, according to guest's
memory size,
MMIO size, etc. 'epc' is MB in unit and any 1MB aligned value will be
accepted.
How will this interact with multi-package servers? Even though its
fine to implement the single-package support first, the design should
be extensible to the multi-package case.
First of all, what are the implications of multi-package SGX?
(Somewhere) you mention changes to scheduling. I presume this is
because a guest with EPC mappings in EPT must be scheduled on the same
package, or ENCLU[EENTER] will fail. I presume also that each package
will have separate, unrelated private keys?
The ENCLU[EENTE] will continue to work on multi-package server. Actually
I was told all ISA existing behavior documented in SDM won't change for
server, as otherwise this would be a bad design :)
Unfortunately I was told I cannot talk about MP server SGX a lot now.
Basically I can only talk about staff already documented in SDM (sorry
:( ). But I guess multiple EPC in CPUID is designed to cover MP server,
at lease mainly (we can do reasonable guess).
In terms of the design, I think we can follow XL config file parameters
for memory. 'epc' parameter will always specify totol EPC size that the
domain has. And we can use existing NUMA related parameters, such as
setting cpus='...' to physically pin vcpu to specific pCPUs, so that EPC
will be mostly allocated from related node. If that node runs out of
EPC, we can decide whether to allocate EPC from other node, or fail to
create domain. I know Linux supports NUMA policy which can specify
whether to allow allocating memory from other nodes, does Xen has such
policy? Sorry I haven't checked this. If Xen has such policy, we need to
choose whether to use memory policy, or introduce new policy for EPC.
If we are going to support vNUAM EPC in the future. We can also use
similar way to config vNUMA EPC in XL config.
Sorry I mentioned scheduling. I should say *potentially* :). My thinking
was as SGX is per-thread, then SGX info reported by different CPU
package may be different (ex, whether SGX2 is supported), then we may
need scheduler to be aware of SGX. But I think we don't have to consider
this now.
What's your comments?
I presume there is no sensible way (even on native) for a single
logical process to use multiple different enclaves? By extension,
does it make sense to try and offer parts of multiple enclaves to a
single VM?
The native machine allows running multiple enclaves, even signed by
multiple authors. SGX only has limit that before launching any other
enclave, Launch Enclave (LE) must be launched. LE is the only enclave
that doesn't require EINITTOKEN in EINIT. For LE, its signer
(SHA256(sigstruct->modulus)) must be equal to the value in
IA32_SGXLEPUBKEYHASHn MSRs. LE will generates EINITTOKEN for other
enclaves (EINIT for other enclaves requires EINITTOKEN). For other
enclaves, there's no such limitation that enclave's signer must match
IA32_SGXLEPUBKEYHASHn so the signer can be anybody. But for other
enclaves, before running EINIT, the LE's signer (which is equal to
IA32_SGXLEPUBKEYHASHn as explained above) needs to be updated to
IA32_SGXLEPUBKEYHASHn (MSRs can be changed, for example, when there's
multiple LEs running in OS). This is because EINIT needs to perform
EINITTOKEN integrity check (EINITTOKEN contains MAC info that calculated
by LE, and EINIT needs LE's IA32_SGXLEPUBKEYHASHn to derive the key to
verify MAC).
SGX in VM doesn't change those behaviors, so in VM, the enclaves can
also be signed by anyone, but Xen needs to emulate IA32_SGXLEPUBKEYHASHn
so that when one VM is running, the correct IA32_SGXLEPUBKEYHASHn are
already in physical MSRs.
2.1.3 Notify domain's virtual EPC base and size to Xen
Xen needs to know guest's EPC base and size in order to populate EPC
pages for
it. Toolstack notifies EPC base and size to Xen via
XEN_DOMCTL_set_cpuid.
I am currently in the process of reworking the Xen/Toolstack interface
when it comes to CPUID handling. The latest design is available here:
https://lists.xenproject.org/archives/html/xen-devel/2017-07/msg00378.html
but the end result will be the toolstack expressing its CPUID policy
in terms of the architectural layout.
Therefore, I would expect that, however the setting is represented in
the configuration file, xl/libxl would configure it with the
hypervisor by setting CPUID.0x12[2] with the appropriate base and size.
I agree. I saw you are planning to introduce new
XEN_DOMCTL_get{set}_cpuid_policy, which will allow toolstack to
query/set cpuid policy in single hypercall (if I understand correctly),
so I think we should definitely use the new hypercalls.
I also saw you are planning to introduce new hypercall to query
raw/host/pv_max/hvm_max cpuid policy (not just featureset), so I think
'xl sgxinfo' (or xl info -sgx) can certainly use that to get physical
SGX info (EPC info). And 'xl sgxlist' (or xl list -sgx) can use
XEN_DOMCTL_get{set}_cpuid_policy to display domain's SGX info (EPC info).
Btw, do you think we need 'xl sgxinfo' and 'xl sgxlist'? If we do, which
is better? New 'xl sgxinfo' and 'xl sgxlist', or extending existing 'xl
info' and 'xl list' to support SGX, such as 'xl info -sgx' and 'xl list
-sgx' above?
2.1.4 Launch Control Support (?)
Xen Launch Control Support is about to support running multiple
domains with
each running its own LE signed by different owners (if HW allows,
explained
below). As explained in 1.4 SGX Launch Control, EINIT for LE (Launch
Enclave)
only succeeds when SHA256(SIGSTRUCT.modulus) matches
IA32_SGXLEPUBKEYHASHn,
and EINIT for other enclaves will derive EINITTOKEN key according to
IA32_SGXLEPUBKEYHASHn. Therefore, to support this, guest's virtual
IA32_SGXLEPUBKEYHASHn must be updated to phyiscal MSRs before EINIT
(which
also means the physical IA32_SGXLEPUBKEYHASHn need to be *unlocked*
in BIOS
before booting to OS).
For physical machine, it is BIOS's writer's decision that whether
BIOS would
provide interface for user to specify customerized
IA32_SGXLEPUBKEYHASHn (it
is default to digest of Intel's signing key after reset). In reality,
OS's SGX
driver may require BIOS to make MSRs *unlocked* and actively write
the hash
value to MSRs in order to run EINIT successfully, as in this case,
the driver
will not depend on BIOS's capability (whether it allows user to
customerize
IA32_SGXLEPUBKEYHASHn value).
The problem is for Xen, do we need a new parameter, such as
'lehash=<SHA256>'
to specify the default value of guset's virtual
IA32_SGXLEPUBKEYHASHn? And do
we need a new parameter, such as 'lewr' to specify whether guest's
virtual MSRs
are locked or not before handling to guest's OS?
I tends to not introduce 'lehash', as it seems SGX driver would
actively update
the MSRs. And new parameter would add additional changes for upper layer
software (such as openstack). And 'lewr' is not needed either as Xen
can always
*unlock* the MSRs to guest.
Please give comments?
Currently in my RFC patches above two parameters are not implemented.
Xen hypervisor will always *unlock* the MSRs. Whether there is 'lehash'
parameter or not doesn't impact Xen hypervisor's emulation of
IA32_SGXLEPUBKEYHASHn. See below Xen hypervisor changes for details.
Reading around, am I correct with the following?
1) Some processors have no launch control. There is no restriction on
which enclaves can boot.
Yes that some processors have no launch control. However it doesn't mean
there's no restriction on which enclaves can boot. Contrary, on those
machines only Intel's Launch Enclave (LE) can run, as on those machine,
IA32_SGXLEPUBKEYHASHn either doesn't exist, or equal to digest of
Intel's signing RSA pubkey. However although only Intel's LE can be run,
we can still run other enclaves from other signers. Please see my reply
above.
2) Some Skylake client processors claim to have launch control, but
the MSRs are unavailable (is this an erratum?). These are limited to
booting enclaves matching the Intel public key.
Sorry I don't know whether this is an erratum. I will get back to you
after confirming internally.
Hi Andrew,
I raised this internally, and it turns out that in the latest SDM Intel
has fixed the statement, so that IA32_SGXLEPUBKEYHASHn MSRs are only
available when both SGX and SGX_LC is present in CPUID. When I was
writing the design and patches, I was referring to old SDM, and the old
one doesn't mention SGX_LC in CPUID as condition. So it is my fault and
this statement has been fixed in latest SDM (41.2.2 Intel SGX Launch
Control Configuration):
https://software.intel.com/sites/default/files/managed/7c/f1/332831-sdm-vol-3d.pdf
However in latest SDM volume 4: Model-Specific Registers:
https://software.intel.com/sites/default/files/managed/22/0d/335592-sdm-vol-4.pdf
You can still see that for IA32_SGXLEPUBKEYHASHn (table 2-2, register
address 8CH): "Read permitted If CPUID.(EAX=12H,ECX=0H):EAX[0]=1". So
there's still error in SDM.
I don't think this will be an erratum. Intel will fix the error in vol 4
in next version SDM. We should refer to 41.2.2 as it has accurate
description.
3) Launch control may be locked by the BIOS. There may be a custom
hash, or it might be the Intel default. Xen can't adjust it at all,
but can support running any number of VMs with matching enclaves.
Yes Launch control may be locked by BIOS, although this depends on
whether BIOS provides interface for user to configure. I was told that
typically BIOS will unlock Launch Control, as SGX driver is expecting
such behavior. But I am not sure we can always assume this.
Whether there will be custom hash also depends on BIOS. BIOS may or may
not provide interface for user to configure custom hash. So on physical
machine, I think we need to consider all the cases. On machine that with
Launch control *unlocked*, Xen is able to dynamically change
IA32_SGXLEKEYHASHn so that Xen is able to run multiple VM with each
running LE from different signer. However if launch control is *locked*
in BIOS, then Xen is still able to run multiple VM, but all VM can only
run LE from the signer that matches the IA32_SGXLEPUBKEYHASHn (which in
most case should be Intel default, but can be custom hash if BIOS allows
user to configure).
Sorry I am not quite sure the typical implementation of BIOS. I think I
can reach out internally and get back to you if I have something.
I also reached out internally to find the typical BIOS implementation in
terms of SGX LC. Typically BIOS will neither provide configuration
options for user to set custom hash, nor select whether MSRs are locked
or not. Typically for client machine, MSRs are locked with Intel
default, and for server machine, MSRs are unlocked. But we cannot rule
out 3rd party to provide different BIOS that may provide options for
user to choose locked/unlocked mode, and/or for user to specify custom
hash. Custom hash + locked mode may be useful for some special purpose
(ex, IT management) as it provides most secure option -- that even
kernel/VMM can only launch LE signed with particular signer. In case of
VM, custom hash + locked mode may be even more useful than bare-metal as
VM is usually supposed to run some particular purpose appliance.
So I think it is better to keep 'lehash' and 'lewr' XL parameters. They
both are optional -- the former provides custom hash, and the latter set
VM to be in unlocked mode. If neither is specified, then VM will be in
locked mode, and VM's virtual IA32_SGXLEPUBKEYHASHn either have Intel's
default value (when physical machine is unlocked), or have machine's MSR
values (when machine is in locked mode). And when physical machine is in
locked mode, specifying either 'lehash' or 'lewr' will result in
creating VM failure.
So we have 3 XL parameters for SGX: 'epc', 'lehash' and 'lewr', probably
we should consolidate them into one XL parameter, such
sgx=['epc=<size>', 'lehash=<sha256>', 'lewr=[on|off]'] ?
Thanks,
-Kai
4) Launch control may be unlocked by the BIOS. In this case, Xen can
context switch a hash per domain, and run all enclaves.
Yes. With enclave == LE I think you meant.
The eventual plans for CPUID and MSR levelling should allow all of
these to be expressed in sensible ways, and I don't forsee any issues
with supporting all of these scenarios.
So do you think we should have 'lehash' and 'lewr' parameters in XL
config file? The former provides custom hash, and the latter provides
whether unlock guest's Launch control.
My thinking is SGX driver needs to *actively* write LE's pubkey hash to
IA32_SGXLEPUBKEYHASHn in *unlocked* mode, so 'lehash' alone is not
needed. 'lehash' only has meaning when 'lewr' is needed to provide a
default hash value in locked mode, as if we always use *unlocked* mode
for guest, 'lehash' is not necessary.
2.2 High Level Xen Hypervisor Changes:
2.2.1 EPC Management (?)
Xen hypervisor needs to detect SGX, discover EPC, and manage EPC before
supporting SGX to guest. EPC is detected via SGX CPUID 0x12.0x2. It's
possible
that there are multiple EPC sections (enumerated via sub-leaves 0x3
and so on,
until invaid EPC is reported), but this is only true on
multiple-socket server
machines. For server machines there are additional things also needs
to be done,
such as NUMA EPC, scheduling, etc. We will support server machine in
the future
but currently we only support one EPC.
EPC is reported as reserved memory (so it is not reported as normal
memory).
EPC must be managed in 4K pages. CPU hardware uses EPCM to track
status of each
EPC pages. Xen needs to manage EPC and provide functions to, ie,
alloc and free
EPC pages for guest.
There are two ways to manage EPC: Manage EPC separately; or Integrate
it to
existing memory management framework.
It is easy to manage EPC separately, as currently EPC is pretty small
(~100MB),
and we can even put them in a single list. However it is not
flexible, for
example, you will have to write new algorithms when EPC becomes
larger, ex, GB.
And you have to write new code to support NUMA EPC (although this
will not come
in short time).
Integrating EPC to existing memory management framework seems more
reasonable,
as in this way we can resume memory management data
structures/algorithms, and
it will be more flexible to support larger EPC and potentially NUMA
EPC. But
modifying MM framework has a higher risk to break existing memory
management
code (potentially more bugs).
In my RFC patches currently we choose to manage EPC separately. A new
structure epc_page is added to represent a single 4K EPC page. A
whole array
of struct epc_page will be allocated during EPC initialization, so
that given
the other, one of PFN of EPC page and 'struct epc_page' can be got by
adding
offset.
But maybe integrating EPC to MM framework is more reasonable. Comments?
2.2.2 EPC Virtualization (?)
It looks like managing the EPC is very similar to managing the NVDIMM
ranges. We have a (set of) physical address ranges which need 4k
ownership granularity to different domains.
I think integrating this into struct page_struct is the better way to go.
Will do. So I assume we will introduce new MEMF_epc, and use existing
alloc_domheap/xenheap_pages to allocate EPC? MEMF_epc can also be used
if we need to support ballooning in the future (using existing
XENMEM_{decrease/increase}_reservation.
This part is how to populate EPC for guests. We have 3 choices:
- Static Partitioning
- Oversubscription
- Ballooning
Static Partitioning means all EPC pages will be allocated and mapped
to guest
when it is created, and there's no runtime change of page table
mappings for EPC
pages. Oversubscription means Xen hypervisor supports EPC page
swapping between
domains, meaning Xen is able to evict EPC page from another domain
and assign it
to the domain that needs the EPC. With oversubscription, EPC can be
assigned to
domain on demand, when EPT violation happens. Ballooning is similar
to memory
ballooning. It is basically "Static Partitioning" + "Balloon driver"
in guest.
Static Partitioning is the easiest way in terms of implementation,
and there
will be no hypervisor overhead (except EPT overhead of course),
because in
"Static partitioning", there is no EPT violation for EPC, and Xen
doesn't need
to turn on ENCLS VMEXIT for guest as ENCLS runs perfectly in non-root
mode.
Ballooning is "Static Partitioning" + "Balloon driver" in guest. Like
"Static
Paratitioning", ballooning doesn't need to turn on ENCLS VMEXIT, and
doesn't
have EPT violation for EPC either. To support ballooning, we need
ballooning
driver in guest to issue hypercall to give up or reclaim EPC pages.
In terms of
hypercall, we have two choices: 1) Add new hypercall for EPC
ballooning; 2)
Using existing XENMEM_{increase/decrease}_reservation with new memory
flag, ie,
XENMEMF_epc. I'll discuss more regarding to adding dedicated
hypercall or not
later.
Oversubscription looks nice but it requires more complicated
implemetation.
Firstly, as explained in 1.3.3 EPC Eviction & Reload, we need to
follow specific
steps to evict EPC pages, and in order to do that, basically Xen
needs to trap
ENCLS from guest and keep track of EPC page status and enclave info
from all
guest. This is because:
- To evict regular EPC page, Xen needs to know SECS location
- Xen needs to know EPC page type: evicting regular EPC and
evicting SECS,
VA page have different steps.
- Xen needs to know EPC page status: whether the page is blocked
or not.
Those info can only be got by trapping ENCLS from guest, and parsing its
parameters (to identify SECS page, etc). Parsing ENCLS parameters
means we need
to know which ENCLS leaf is being trapped, and we need to translate
guest's
virtual address to get physical address in order to locate EPC page.
And once
ENCLS is trapped, we have to emulate ENCLS in Xen, which means we
need to
reconstruct ENCLS parameters by remapping all guest's virtual address
to Xen's
virtual address (gva->gpa->pa->xen_va), as ENCLS always use
*effective address*
which is able to be traslated by processor when running ENCLS.
--------------------------------------------------------------
| ENCLS |
--------------------------------------------------------------
| /|\
ENCLS VMEXIT| | VMENTRY
| |
\|/ |
1) parse ENCLS parameters
2) reconstruct(remap) guest's ENCLS parameters
3) run ENCLS on behalf of guest (and skip ENCLS)
4) on success, update EPC/enclave info, or inject error
And Xen needs to maintain each EPC page's status (type, blocked or
not, in
enclave or not, etc). Xen also needs to maintain all Enclave's info
from all
guests, in order to find the correct SECS for regular EPC page, and
enclave's
linear address as well.
So in general, "Static Partitioning" has simplest implementation, but
obviously
not the best way to use EPC efficiently; "Ballooning" has all pros of
Static
Partitioning but requies guest balloon driver; "Oversubscription" is
best in
terms of flexibility but requires complicated hypervisor implemetation.
We have implemented "Static Partitioning" in RFC patches, but needs your
feedback on whether it is enough. If not, which one should we do at
next stage
-- Ballooning or Oversubscription. IMO Ballooning may be good enough,
given fact
that currently memory is also "Static Partitioning" + "Ballooning".
Comments?
Definitely go for static partitioning to begin with. This is far
simpler to implement.
I can't see a pressing usecase for oversubscription or ballooning. Any
datacenter work will be using exclusively static, and I expect static
will fine for all (or at least, most) client usecases.
Thanks. So for the first stage I will focus on static partitioning.
2.2.3 Populate EPC for Guest
Toolstack notifies Xen about domain's EPC base and size by
XEN_DOMCTL_set_cpuid,
so currently Xen populates all EPC pages for guest in
XEN_DOMCTL_set_cpuid,
particularly, in handling XEN_DOMCTL_set_cpuid for CPUID.0x12.0x2.
Once Xen
checks the values passed from toolstack is valid, Xen will allocate
all EPC
pages and setup EPT mappings for guest.
2.2.4 New Dedicated Hypercall (?)
All this information should (eventually) be available via the
appropriate SYSCTL_get_{cpuid,msr}_policy hypercalls. I don't see any
need for dedicated hypercalls.
Yes I agree. Originally I had concern that without dedicated hypercall,
it is hard to implement 'xl sgxinfo' and 'xl sgxlist', but according to
your new CPUID enhancement plan, the two can be done via the new
hypercalls to query Xen's and domain's cpuid policy. See my reply above
regarding to "Notify Xen about guest's EPC info".
2.2.9 Guest Suspend & Resume
On hardware, EPC is destroyed when power goes to S3-S5. So Xen will
destroy
guest's EPC when guest's power goes into S3-S5. Currently Xen is
notified by
Qemu in terms of S State change via HVM_PARAM_ACPI_S_STATE, where Xen
will
destroy EPC if S State is S3-S5.
Specifically, Xen will run EREMOVE for guest's each EPC page, as
guest may
not handle EPC suspend & resume correctly, in which case physically
guest's EPC
pages may still be valid, so Xen needs to run EREMOVE to make sure
all EPC
pages are becoming invalid. Otherwise further operation in guest on
EPC may
fault as it assumes all EPC pages are invalid after guest is resumed.
For SECS page, EREMOVE may fault with SGX_CHILD_PRESENT, in which
case Xen will
keep this SECS page into a list, and call EREMOVE for them again
after all EPC
pages have been called with EREMOVE. This time the EREMOVE on SECS
will succeed
as all children (regular EPC pages) have already been removed.
2.2.10 Destroying Domain
Normally Xen just frees all EPC pages for domain when it is
destroyed. But Xen
will also do EREMOVE on all guest's EPC pages (described in above
2.2.7) before
free them, as guest may shutdown unexpected (ex, user kills guest),
and in this
case, guest's EPC may still be valid.
2.3 Additional Point: Live Migration, Snapshot Support (?)
How big is the EPC? If we are talking MB rather than GB, movement of
the EPC could be after the pause, which would add some latency to live
migration but should work. I expect that people would prefer to have
the flexibility of migration even at the cost of extra latency.
The EPC is typically ~100MB at maximum (as I observed). The EPC is
typically reserved with EPCM (EPC map, which is invisible to SW)
together by BIOS as processor reserved memory (RPM). On real machine,
for both our internal develop machines, and some machines that from
Dell, HP, Lenovo (that you can buy from market now), BIOS always
provides 3 choices in terms RPM: 32M, 64M, and 128M. And with 128M RPM,
EPC is slightly less than 100M.
The problem is EPC cannot be moved. I think you were saying moving EPC
by evicting EPC out at last stage and copy evicted content to remote,
and then reload. However I don't think this will work, as EPC eviction
itself needs to use a VA slot (which itslef is EPC), so you can image
that the VA slots cannot be moved to remote. Even if they can, they
cannot be used to reload EPC in remote, as info in VA slot is bound to
platform and cannot be used on remote.
To support live migration, we can only choose to ignore EPC during live
migration and let guest SGX driver/user SW stack to handle restoring
enclave (which is actually a lot simpler in hypervisor/toolstack's
implementation) . Guest SGX driver needs to handle lose EPC anyway, as
EPC is destroyed in S3-S5. The only difference is to support live
migration, guest SGX driver needs to support *sudden* lose of EPC, which
is not HW behavior, and I was told that currently both Windows & Linux
SGX driver already support *sudden* lose of EPC, which leaves us a
question whether we need to support SGX live migration (and snapshot).
~Andrew
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel