[Xen-devel] RFC on deprivileged x86 hypervisor device models

Ben Catterall Fri, 17 Jul 2015 03:12:40 -0700

Hi all,

I'm working on an x86 proof-of-concept series to evaluate if it isfeasible to move device models currently running in the hypervisor andx86 emulation code for HVM guests into a deprivileged context.

I've put together the following document as I have been consideringseveral different ways this could be achieved and was hoping to getfeedback from maintainers before I go ahead.


Many thanks in advance,
Ben

Context
-------

The aim is to run device models, which are already running inside thehypervisor (e.g. x86 emulate), in deprivileged user mode for HVM guests,using suitably mapped page tables. A simple hypercall convention isneeded to pass data between these two modes of operation and a mechanismto move between them.

This is intended as a proof-of-concept, with the aim of determining ifthis idea is feasible within performance constraints.


Motivation
----------

The motivation for moving the device models and x86 emulation code intoring 3 is to mitigate a system compromise due a bug in any of thesesystems. These systems are currently part of the hypervisor and,consequently, a bug in any of these could allow an attacker to gaincontrol (or perform a DOS) of Xen and/or guests.



Moving between privilege levels
--------------------------------

The general process is to determine if we need to run a device model (orsimilar) and then, if so, switch into deprivileged mode. The operationis performed by deprivileged code which calls into the hypervisor as andwhen needed. After the operation completes, we return to the hypervisor.

If deprivileged mode needs to make any hypervisor requests, it can dothese using a syscall interface, possibly placing an operation code intoa register to indicate the operation. This would allow it to get datato/from the hypervisor.

I am currently considering three different methods regarding the contextswitch and would be grateful of any feedback.


Method One
----------

This method works by building on top of the QEMU emulation path code.This currently operates as a state machine, using flags to determine thecurrent emulation state. These states are used to determine which codepaths to take when calling out of and into the hypervisor before andafter emulation.

The intention would be to add new states and then follow the sameprocess as existing code does except, rather than blocking the vcpu, weswitch into deprivileged mode and process the request on the currentvcpu. This is different to QEMU which blocks the current vcpu soadditional code is needed to support this context switch. There may beother code paths which have not been written in this way which wouldrequire rewriting.

When moving into deprivileged mode, we need to be careful to ensure thatwhen we leave, we can redo the call into the hypervisor after the devicemodel completes without causing problems. Thus, we need to be _certain_that the same call path is followed on the re-entry and that thesystem's state can handle this. This may mean undoing operations such asmemory allocations.



Method Two
----------

At the point of detecting the need to perform a deprivileged operation,we take a copy of the current stack from the current stack position upto the point where the guest entered Xen and save it. Subsequently, wemove the stack pointer back.

This effectively gives us a clean stack as though we had just enteredXen. We then put the deprivileged context onto this new stack and enterdeprivileged mode.

Upon returning, we restore the previous stack with the guest's and Xen'scontext then jump to the saved rip and continue execution. Xen will thenperform the necessary processing, determining if the operation wassuccessful or not.

We are effectively context switching out Xen for deprivileged code andthen bringing Xen back in once we're done.

As Xen is non-preemptive, the Xen stack won't be updated whilst we're indeprivileged mode. If it may be updated (I'm speculating here), e.g. aninterrupt, then we can pause deprivileged mode by hooking the interruptand restoring the Xen stack, then handle the interrupt and finally goback to deprivileged mode.

Problem: If the device model or emulator edit the saved guest registersand these are touched by Xen on the return path after finishingservicing the deprivileged operation, then the guest will use thesevalues not those the deprivileged mode provided.

This is not a problem if the code doesn't do this. If it does, we couldgive higher precedence to deprivileged changes. So, deprivileged modepushes the changes into the hypervisor which caches them and then, justbefore guest context is restored, makes those changes, thus discardingany Xen made.




Method Three
------------

A per vcpu stack is maintained for user mode and supervisor mode. Wethen don't need to do any copying, just switch to user mode at the pointwhen deprivileged code needs to run.

When deprivileged mode is done, we move back to supervisor mode, restorethe previous context and continue execution of the code path thatfollowed the call to move into deprivileged mode.




Method Evaluation
-----------------

In method one, similarly to the QEMU path, we need to move up and downthe call stack twice. We pay the cost of running the entry and exitcode, which all methods will. Then we pay the cost of the code paths formoving into deprivileged mode from the call site and for moving fromdeprivileged mode back to the call site to handle the result. This meansthat we also destroy and then rebuild the stack. We also pay anyallocation and deallocation costs twice, unless we can re-write the codepaths so that these can be avoided. A potential issue would be if anychanges are made to Xen's state on the first entry which mean that onthe second entry (returning from deprivileged mode), we take a differentcall path.

As mentioned, QEMU appears to do something similar so we can reuse muchof this. The call tree is quite deep and broad so great care will needto be taken when making these changes to examine state-changing calls.Furthermore, such a change will be needed for each device, although thiswill be simpler after the first device is added.

The second method requires copying the stack and then restoring it. Itdoesn't pay the costs of following a return path into deprivileged modeor moving back to the call site as it, effectively, skips all of this.Memory accesses on the stack are roughly the same as the first methodbut, we do need enough storage to hold a copy of the stack for eachvcpu. The edits to intermediate callers are likely to be simpler thanmethod one, as we don't need to worry about there being two differentreturn paths. Adding a new device model would most likely be easier thanmethod one.

Method two appears to require fewer edits to the original source codeand I suspect would be more efficient computationally than moving up anddown the stack twice with multiple flag tests breaking code up. However,this has already been done for QEMU call paths so this may prove lesstroublesome/ expensive than expected.

The third method _may_ require significant code refactoring ascurrently, there is only one stack per pcpu so this may be a large change.



Summary
-------

Just to reiterate, this is intended as a proof-of-concept to measure howfeasible such a feature is.


I'm currently on the fence between method one and method two.

Method one will require more attention to existing code paths and isless like a context-switch approach.

Method two will require less attention to existing code paths and ismore like a context-switching approach.


I am unsure of method three as I suspect it would be a significant change.

Are there any potential issues or things which I have overlooked?Additionally, which (if any) of the above would you recommend pursuingor do you have any ideas regarding alternatives?



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

[Xen-devel] RFC on deprivileged x86 hypervisor device models

Reply via email to