Re: [Xen-devel] [RFC PATCH 00/15] RFC: SGX virtualization design and draft patches

Huang, Kai Fri, 21 Jul 2017 02:06:07 -0700


On 7/17/2017 6:08 PM, Huang, Kai wrote:

Hi Andrew,
Thank you very much for comments. Sorry for late reply, and please seemy reply below.
On 7/12/2017 2:13 AM, Andrew Cooper wrote:
On 09/07/17 09:03, Kai Huang wrote:
Hi all,
This series is RFC Xen SGX virtualization support design and RFCdraft patches.
Thankyou very much for this design doc.
2. SGX Virtualization Design

2.1 High Level Toolstack Changes:

2.1.1 New 'epc' parameter
EPC is limited resource. In order to use EPC efficiently among alldomains,when creating guest, administrator should be able to specify domain'svirtual
EPC size. And admin
alao should be able to get all domain's virtual EPC size.
For this purpose, a new 'epc = <size>' parameter is added to XLconfigurationfile. This parameter specifies guest's virtual EPC size. The EPC baseaddresswill be calculated by toolstack internally, according to guest'smemory size,MMIO size, etc. 'epc' is MB in unit and any 1MB aligned value will beaccepted.
How will this interact with multi-package servers? Even though itsfine to implement the single-package support first, the design shouldbe extensible to the multi-package case.
First of all, what are the implications of multi-package SGX?
(Somewhere) you mention changes to scheduling. I presume this isbecause a guest with EPC mappings in EPT must be scheduled on the samepackage, or ENCLU[EENTER] will fail. I presume also that each packagewill have separate, unrelated private keys?
The ENCLU[EENTE] will continue to work on multi-package server. ActuallyI was told all ISA existing behavior documented in SDM won't change forserver, as otherwise this would be a bad design :)
Unfortunately I was told I cannot talk about MP server SGX a lot now.Basically I can only talk about staff already documented in SDM (sorry:( ). But I guess multiple EPC in CPUID is designed to cover MP server,at lease mainly (we can do reasonable guess).
In terms of the design, I think we can follow XL config file parametersfor memory. 'epc' parameter will always specify totol EPC size that thedomain has. And we can use existing NUMA related parameters, such assetting cpus='...' to physically pin vcpu to specific pCPUs, so that EPCwill be mostly allocated from related node. If that node runs out ofEPC, we can decide whether to allocate EPC from other node, or fail tocreate domain. I know Linux supports NUMA policy which can specifywhether to allow allocating memory from other nodes, does Xen has suchpolicy? Sorry I haven't checked this. If Xen has such policy, we need tochoose whether to use memory policy, or introduce new policy for EPC.
If we are going to support vNUAM EPC in the future. We can also usesimilar way to config vNUMA EPC in XL config.
Sorry I mentioned scheduling. I should say *potentially* :). My thinkingwas as SGX is per-thread, then SGX info reported by different CPUpackage may be different (ex, whether SGX2 is supported), then we mayneed scheduler to be aware of SGX. But I think we don't have to considerthis now.
What's your comments?
I presume there is no sensible way (even on native) for a singlelogical process to use multiple different enclaves? By extension,does it make sense to try and offer parts of multiple enclaves to asingle VM?
The native machine allows running multiple enclaves, even signed bymultiple authors. SGX only has limit that before launching any otherenclave, Launch Enclave (LE) must be launched. LE is the only enclavethat doesn't require EINITTOKEN in EINIT. For LE, its signer(SHA256(sigstruct->modulus)) must be equal to the value inIA32_SGXLEPUBKEYHASHn MSRs. LE will generates EINITTOKEN for otherenclaves (EINIT for other enclaves requires EINITTOKEN). For otherenclaves, there's no such limitation that enclave's signer must matchIA32_SGXLEPUBKEYHASHn so the signer can be anybody. But for otherenclaves, before running EINIT, the LE's signer (which is equal toIA32_SGXLEPUBKEYHASHn as explained above) needs to be updated toIA32_SGXLEPUBKEYHASHn (MSRs can be changed, for example, when there'smultiple LEs running in OS). This is because EINIT needs to performEINITTOKEN integrity check (EINITTOKEN contains MAC info that calculatedby LE, and EINIT needs LE's IA32_SGXLEPUBKEYHASHn to derive the key toverify MAC).
SGX in VM doesn't change those behaviors, so in VM, the enclaves canalso be signed by anyone, but Xen needs to emulate IA32_SGXLEPUBKEYHASHnso that when one VM is running, the correct IA32_SGXLEPUBKEYHASHn arealready in physical MSRs.
2.1.3 Notify domain's virtual EPC base and size to Xen
Xen needs to know guest's EPC base and size in order to populate EPCpages forit. Toolstack notifies EPC base and size to Xen viaXEN_DOMCTL_set_cpuid.
I am currently in the process of reworking the Xen/Toolstack interfacewhen it comes to CPUID handling. The latest design is available here:https://lists.xenproject.org/archives/html/xen-devel/2017-07/msg00378.htmlbut the end result will be the toolstack expressing its CPUID policyin terms of the architectural layout.
Therefore, I would expect that, however the setting is represented inthe configuration file, xl/libxl would configure it with thehypervisor by setting CPUID.0x12[2] with the appropriate base and size.
I agree. I saw you are planning to introduce newXEN_DOMCTL_get{set}_cpuid_policy, which will allow toolstack toquery/set cpuid policy in single hypercall (if I understand correctly),so I think we should definitely use the new hypercalls.
I also saw you are planning to introduce new hypercall to queryraw/host/pv_max/hvm_max cpuid policy (not just featureset), so I think'xl sgxinfo' (or xl info -sgx) can certainly use that to get physicalSGX info (EPC info). And 'xl sgxlist' (or xl list -sgx) can useXEN_DOMCTL_get{set}_cpuid_policy to display domain's SGX info (EPC info).
Btw, do you think we need 'xl sgxinfo' and 'xl sgxlist'? If we do, whichis better? New 'xl sgxinfo' and 'xl sgxlist', or extending existing 'xlinfo' and 'xl list' to support SGX, such as 'xl info -sgx' and 'xl list-sgx' above?
2.1.4 Launch Control Support (?)
Xen Launch Control Support is about to support running multipledomains witheach running its own LE signed by different owners (if HW allows,explainedbelow). As explained in 1.4 SGX Launch Control, EINIT for LE (LaunchEnclave)only succeeds when SHA256(SIGSTRUCT.modulus) matchesIA32_SGXLEPUBKEYHASHn,
and EINIT for other enclaves will derive EINITTOKEN key according to
IA32_SGXLEPUBKEYHASHn. Therefore, to support this, guest's virtual
IA32_SGXLEPUBKEYHASHn must be updated to phyiscal MSRs before EINIT(whichalso means the physical IA32_SGXLEPUBKEYHASHn need to be *unlocked*in BIOS
before booting to OS).
For physical machine, it is BIOS's writer's decision that whetherBIOS wouldprovide interface for user to specify customerizedIA32_SGXLEPUBKEYHASHn (itis default to digest of Intel's signing key after reset). In reality,OS's SGXdriver may require BIOS to make MSRs *unlocked* and actively writethe hashvalue to MSRs in order to run EINIT successfully, as in this case,the driverwill not depend on BIOS's capability (whether it allows user tocustomerize
IA32_SGXLEPUBKEYHASHn value).
The problem is for Xen, do we need a new parameter, such as'lehash=<SHA256>'to specify the default value of guset's virtualIA32_SGXLEPUBKEYHASHn? And dowe need a new parameter, such as 'lewr' to specify whether guest'svirtual MSRs
are locked or not before handling to guest's OS?
I tends to not introduce 'lehash', as it seems SGX driver wouldactively update
the MSRs. And new parameter would add additional changes for upper layer
software (such as openstack). And 'lewr' is not needed either as Xencan always
*unlock* the MSRs to guest.

Please give comments?

Currently in my RFC patches above two parameters are not implemented.
Xen hypervisor will always *unlock* the MSRs. Whether there is 'lehash'
parameter or not doesn't impact Xen hypervisor's emulation of
IA32_SGXLEPUBKEYHASHn. See below Xen hypervisor changes for details.
Reading around, am I correct with the following?
1) Some processors have no launch control. There is no restriction onwhich enclaves can boot.
Yes that some processors have no launch control. However it doesn't meanthere's no restriction on which enclaves can boot. Contrary, on thosemachines only Intel's Launch Enclave (LE) can run, as on those machine,IA32_SGXLEPUBKEYHASHn either doesn't exist, or equal to digest ofIntel's signing RSA pubkey. However although only Intel's LE can be run,we can still run other enclaves from other signers. Please see my replyabove.
2) Some Skylake client processors claim to have launch control, butthe MSRs are unavailable (is this an erratum?). These are limited tobooting enclaves matching the Intel public key.
Sorry I don't know whether this is an erratum. I will get back to youafter confirming internally.


Hi Andrew,

I raised this internally, and it turns out that in the latest SDM Intelhas fixed the statement, so that IA32_SGXLEPUBKEYHASHn MSRs are onlyavailable when both SGX and SGX_LC is present in CPUID. When I waswriting the design and patches, I was referring to old SDM, and the oldone doesn't mention SGX_LC in CPUID as condition. So it is my fault andthis statement has been fixed in latest SDM (41.2.2 Intel SGX LaunchControl Configuration):


https://software.intel.com/sites/default/files/managed/7c/f1/332831-sdm-vol-3d.pdf

However in latest SDM volume 4: Model-Specific Registers:

https://software.intel.com/sites/default/files/managed/22/0d/335592-sdm-vol-4.pdf

You can still see that for IA32_SGXLEPUBKEYHASHn (table 2-2, registeraddress 8CH): "Read permitted If CPUID.(EAX=12H,ECX=0H):EAX[0]=1". Sothere's still error in SDM.

I don't think this will be an erratum. Intel will fix the error in vol 4in next version SDM. We should refer to 41.2.2 as it has accuratedescription.

3) Launch control may be locked by the BIOS. There may be a customhash, or it might be the Intel default. Xen can't adjust it at all,but can support running any number of VMs with matching enclaves.
Yes Launch control may be locked by BIOS, although this depends onwhether BIOS provides interface for user to configure. I was told thattypically BIOS will unlock Launch Control, as SGX driver is expectingsuch behavior. But I am not sure we can always assume this.
Whether there will be custom hash also depends on BIOS. BIOS may or maynot provide interface for user to configure custom hash. So on physicalmachine, I think we need to consider all the cases. On machine that withLaunch control *unlocked*, Xen is able to dynamically changeIA32_SGXLEKEYHASHn so that Xen is able to run multiple VM with eachrunning LE from different signer. However if launch control is *locked*in BIOS, then Xen is still able to run multiple VM, but all VM can onlyrun LE from the signer that matches the IA32_SGXLEPUBKEYHASHn (which inmost case should be Intel default, but can be custom hash if BIOS allowsuser to configure).
Sorry I am not quite sure the typical implementation of BIOS. I think Ican reach out internally and get back to you if I have something.

I also reached out internally to find the typical BIOS implementation interms of SGX LC. Typically BIOS will neither provide configurationoptions for user to set custom hash, nor select whether MSRs are lockedor not. Typically for client machine, MSRs are locked with Inteldefault, and for server machine, MSRs are unlocked. But we cannot ruleout 3rd party to provide different BIOS that may provide options foruser to choose locked/unlocked mode, and/or for user to specify customhash. Custom hash + locked mode may be useful for some special purpose(ex, IT management) as it provides most secure option -- that evenkernel/VMM can only launch LE signed with particular signer. In case ofVM, custom hash + locked mode may be even more useful than bare-metal asVM is usually supposed to run some particular purpose appliance.

So I think it is better to keep 'lehash' and 'lewr' XL parameters. Theyboth are optional -- the former provides custom hash, and the latter setVM to be in unlocked mode. If neither is specified, then VM will be inlocked mode, and VM's virtual IA32_SGXLEPUBKEYHASHn either have Intel'sdefault value (when physical machine is unlocked), or have machine's MSRvalues (when machine is in locked mode). And when physical machine is inlocked mode, specifying either 'lehash' or 'lewr' will result increating VM failure.

So we have 3 XL parameters for SGX: 'epc', 'lehash' and 'lewr', probablywe should consolidate them into one XL parameter, suchsgx=['epc=<size>', 'lehash=<sha256>', 'lewr=[on|off]'] ?


Thanks,
-Kai

4) Launch control may be unlocked by the BIOS. In this case, Xen cancontext switch a hash per domain, and run all enclaves.
Yes. With enclave == LE I think you meant.
The eventual plans for CPUID and MSR levelling should allow all ofthese to be expressed in sensible ways, and I don't forsee any issueswith supporting all of these scenarios.
So do you think we should have 'lehash' and 'lewr' parameters in XLconfig file? The former provides custom hash, and the latter provideswhether unlock guest's Launch control.
My thinking is SGX driver needs to *actively* write LE's pubkey hash toIA32_SGXLEPUBKEYHASHn in *unlocked* mode, so 'lehash' alone is notneeded. 'lehash' only has meaning when 'lewr' is needed to provide adefault hash value in locked mode, as if we always use *unlocked* modefor guest, 'lehash' is not necessary.
2.2 High Level Xen Hypervisor Changes:

2.2.1 EPC Management (?)

Xen hypervisor needs to detect SGX, discover EPC, and manage EPC before
supporting SGX to guest. EPC is detected via SGX CPUID 0x12.0x2. It'spossiblethat there are multiple EPC sections (enumerated via sub-leaves 0x3and so on,until invaid EPC is reported), but this is only true onmultiple-socket servermachines. For server machines there are additional things also needsto be done,such as NUMA EPC, scheduling, etc. We will support server machine inthe future
but currently we only support one EPC.
EPC is reported as reserved memory (so it is not reported as normalmemory).EPC must be managed in 4K pages. CPU hardware uses EPCM to trackstatus of eachEPC pages. Xen needs to manage EPC and provide functions to, ie,alloc and free
EPC pages for guest.
There are two ways to manage EPC: Manage EPC separately; or Integrateit to
existing memory management framework.
It is easy to manage EPC separately, as currently EPC is pretty small(~100MB),and we can even put them in a single list. However it is notflexible, forexample, you will have to write new algorithms when EPC becomeslarger, ex, GB.And you have to write new code to support NUMA EPC (although thiswill not come
in short time).
Integrating EPC to existing memory management framework seems morereasonable,as in this way we can resume memory management datastructures/algorithms, andit will be more flexible to support larger EPC and potentially NUMAEPC. Butmodifying MM framework has a higher risk to break existing memorymanagement
code (potentially more bugs).

In my RFC patches currently we choose to manage EPC separately. A new
structure epc_page is added to represent a single 4K EPC page. Awhole arrayof struct epc_page will be allocated during EPC initialization, sothat giventhe other, one of PFN of EPC page and 'struct epc_page' can be got byadding
offset.

But maybe integrating EPC to MM framework is more reasonable. Comments?

2.2.2 EPC Virtualization (?)
It looks like managing the EPC is very similar to managing the NVDIMMranges. We have a (set of) physical address ranges which need 4kownership granularity to different domains.
I think integrating this into struct page_struct is the better way to go.
Will do. So I assume we will introduce new MEMF_epc, and use existingalloc_domheap/xenheap_pages to allocate EPC? MEMF_epc can also be usedif we need to support ballooning in the future (using existingXENMEM_{decrease/increase}_reservation.
This part is how to populate EPC for guests. We have 3 choices:
     - Static Partitioning
     - Oversubscription
     - Ballooning
Static Partitioning means all EPC pages will be allocated and mappedto guestwhen it is created, and there's no runtime change of page tablemappings for EPCpages. Oversubscription means Xen hypervisor supports EPC pageswapping betweendomains, meaning Xen is able to evict EPC page from another domainand assign itto the domain that needs the EPC. With oversubscription, EPC can beassigned todomain on demand, when EPT violation happens. Ballooning is similarto memoryballooning. It is basically "Static Partitioning" + "Balloon driver"in guest.
Static Partitioning is the easiest way in terms of implementation,and therewill be no hypervisor overhead (except EPT overhead of course),because in"Static partitioning", there is no EPT violation for EPC, and Xendoesn't needto turn on ENCLS VMEXIT for guest as ENCLS runs perfectly in non-rootmode.
Ballooning is "Static Partitioning" + "Balloon driver" in guest. Like"StaticParatitioning", ballooning doesn't need to turn on ENCLS VMEXIT, anddoesn'thave EPT violation for EPC either. To support ballooning, we needballooningdriver in guest to issue hypercall to give up or reclaim EPC pages.In terms ofhypercall, we have two choices: 1) Add new hypercall for EPCballooning; 2)Using existing XENMEM_{increase/decrease}_reservation with new memoryflag, ie,XENMEMF_epc. I'll discuss more regarding to adding dedicatedhypercall or not
later.
Oversubscription looks nice but it requires more complicatedimplemetation.Firstly, as explained in 1.3.3 EPC Eviction & Reload, we need tofollow specificsteps to evict EPC pages, and in order to do that, basically Xenneeds to trapENCLS from guest and keep track of EPC page status and enclave infofrom all
guest. This is because:
     - To evict regular EPC page, Xen needs to know SECS location
- Xen needs to know EPC page type: evicting regular EPC andevicting SECS,
       VA page have different steps.
- Xen needs to know EPC page status: whether the page is blockedor not.
Those info can only be got by trapping ENCLS from guest, and parsing its
parameters (to identify SECS page, etc). Parsing ENCLS parametersmeans we needto know which ENCLS leaf is being trapped, and we need to translateguest'svirtual address to get physical address in order to locate EPC page.And onceENCLS is trapped, we have to emulate ENCLS in Xen, which means weneed toreconstruct ENCLS parameters by remapping all guest's virtual addressto Xen'svirtual address (gva->gpa->pa->xen_va), as ENCLS always use*effective address*
which is able to be traslated by processor when running ENCLS.

     --------------------------------------------------------------
                 |   ENCLS   |
     --------------------------------------------------------------
                 |          /|\
     ENCLS VMEXIT|           | VMENTRY
                 |           |
                \|/          |

        1) parse ENCLS parameters
        2) reconstruct(remap) guest's ENCLS parameters
        3) run ENCLS on behalf of guest (and skip ENCLS)
        4) on success, update EPC/enclave info, or inject error
And Xen needs to maintain each EPC page's status (type, blocked ornot, inenclave or not, etc). Xen also needs to maintain all Enclave's infofrom allguests, in order to find the correct SECS for regular EPC page, andenclave's
linear address as well.
So in general, "Static Partitioning" has simplest implementation, butobviouslynot the best way to use EPC efficiently; "Ballooning" has all pros ofStaticPartitioning but requies guest balloon driver; "Oversubscription" isbest in
terms of flexibility but requires complicated hypervisor implemetation.

We have implemented "Static Partitioning" in RFC patches, but needs your
feedback on whether it is enough. If not, which one should we do atnext stage-- Ballooning or Oversubscription. IMO Ballooning may be good enough,given fact
that currently memory is also "Static Partitioning" + "Ballooning".

Comments?
Definitely go for static partitioning to begin with. This is farsimpler to implement.
I can't see a pressing usecase for oversubscription or ballooning. Anydatacenter work will be using exclusively static, and I expect staticwill fine for all (or at least, most) client usecases.
Thanks. So for the first stage I will focus on static partitioning.
2.2.3 Populate EPC for Guest
Toolstack notifies Xen about domain's EPC base and size byXEN_DOMCTL_set_cpuid,so currently Xen populates all EPC pages for guest inXEN_DOMCTL_set_cpuid,particularly, in handling XEN_DOMCTL_set_cpuid for CPUID.0x12.0x2.Once Xenchecks the values passed from toolstack is valid, Xen will allocateall EPC
pages and setup EPT mappings for guest.

2.2.4 New Dedicated Hypercall (?)
All this information should (eventually) be available via theappropriate SYSCTL_get_{cpuid,msr}_policy hypercalls. I don't see anyneed for dedicated hypercalls.
Yes I agree. Originally I had concern that without dedicated hypercall,it is hard to implement 'xl sgxinfo' and 'xl sgxlist', but according toyour new CPUID enhancement plan, the two can be done via the newhypercalls to query Xen's and domain's cpuid policy. See my reply aboveregarding to "Notify Xen about guest's EPC info".
2.2.9 Guest Suspend & Resume
On hardware, EPC is destroyed when power goes to S3-S5. So Xen willdestroyguest's EPC when guest's power goes into S3-S5. Currently Xen isnotified byQemu in terms of S State change via HVM_PARAM_ACPI_S_STATE, where Xenwill
destroy EPC if S State is S3-S5.
Specifically, Xen will run EREMOVE for guest's each EPC page, asguest maynot handle EPC suspend & resume correctly, in which case physicallyguest's EPCpages may still be valid, so Xen needs to run EREMOVE to make sureall EPCpages are becoming invalid. Otherwise further operation in guest onEPC may
fault as it assumes all EPC pages are invalid after guest is resumed.
For SECS page, EREMOVE may fault with SGX_CHILD_PRESENT, in whichcase Xen willkeep this SECS page into a list, and call EREMOVE for them againafter all EPCpages have been called with EREMOVE. This time the EREMOVE on SECSwill succeed
as all children (regular EPC pages) have already been removed.

2.2.10 Destroying Domain
Normally Xen just frees all EPC pages for domain when it isdestroyed. But Xenwill also do EREMOVE on all guest's EPC pages (described in above2.2.7) beforefree them, as guest may shutdown unexpected (ex, user kills guest),and in this
case, guest's EPC may still be valid.

2.3 Additional Point: Live Migration, Snapshot Support (?)
How big is the EPC? If we are talking MB rather than GB, movement ofthe EPC could be after the pause, which would add some latency to livemigration but should work. I expect that people would prefer to havethe flexibility of migration even at the cost of extra latency.
The EPC is typically ~100MB at maximum (as I observed). The EPC istypically reserved with EPCM (EPC map, which is invisible to SW)together by BIOS as processor reserved memory (RPM). On real machine,for both our internal develop machines, and some machines that fromDell, HP, Lenovo (that you can buy from market now), BIOS alwaysprovides 3 choices in terms RPM: 32M, 64M, and 128M. And with 128M RPM,EPC is slightly less than 100M.
The problem is EPC cannot be moved. I think you were saying moving EPCby evicting EPC out at last stage and copy evicted content to remote,and then reload. However I don't think this will work, as EPC evictionitself needs to use a VA slot (which itslef is EPC), so you can imagethat the VA slots cannot be moved to remote. Even if they can, theycannot be used to reload EPC in remote, as info in VA slot is bound toplatform and cannot be used on remote.
To support live migration, we can only choose to ignore EPC during livemigration and let guest SGX driver/user SW stack to handle restoringenclave (which is actually a lot simpler in hypervisor/toolstack'simplementation) . Guest SGX driver needs to handle lose EPC anyway, asEPC is destroyed in S3-S5. The only difference is to support livemigration, guest SGX driver needs to support *sudden* lose of EPC, whichis not HW behavior, and I was told that currently both Windows & LinuxSGX driver already support *sudden* lose of EPC, which leaves us aquestion whether we need to support SGX live migration (and snapshot).
~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Re: [Xen-devel] [RFC PATCH 00/15] RFC: SGX virtualization design and draft patches

Reply via email to