Hi all,
I'm working on an x86 proof-of-concept series to evaluate if it is
feasible to move device models currently running in the hypervisor and
x86 emulation code for HVM guests into a deprivileged context.
I've put together the following document as I have been considering
several different ways this could be achieved and was hoping to get
feedback from maintainers before I go ahead.
Many thanks in advance,
Ben
Context
-------
The aim is to run device models, which are already running inside the
hypervisor (e.g. x86 emulate), in deprivileged user mode for HVM guests,
using suitably mapped page tables. A simple hypercall convention is
needed to pass data between these two modes of operation and a mechanism
to move between them.
This is intended as a proof-of-concept, with the aim of determining if
this idea is feasible within performance constraints.
Motivation
----------
The motivation for moving the device models and x86 emulation code into
ring 3 is to mitigate a system compromise due a bug in any of these
systems. These systems are currently part of the hypervisor and,
consequently, a bug in any of these could allow an attacker to gain
control (or perform a DOS) of Xen and/or guests.
Moving between privilege levels
--------------------------------
The general process is to determine if we need to run a device model (or
similar) and then, if so, switch into deprivileged mode. The operation
is performed by deprivileged code which calls into the hypervisor as and
when needed. After the operation completes, we return to the hypervisor.
If deprivileged mode needs to make any hypervisor requests, it can do
these using a syscall interface, possibly placing an operation code into
a register to indicate the operation. This would allow it to get data
to/from the hypervisor.
I am currently considering three different methods regarding the context
switch and would be grateful of any feedback.
Method One
----------
This method works by building on top of the QEMU emulation path code.
This currently operates as a state machine, using flags to determine the
current emulation state. These states are used to determine which code
paths to take when calling out of and into the hypervisor before and
after emulation.
The intention would be to add new states and then follow the same
process as existing code does except, rather than blocking the vcpu, we
switch into deprivileged mode and process the request on the current
vcpu. This is different to QEMU which blocks the current vcpu so
additional code is needed to support this context switch. There may be
other code paths which have not been written in this way which would
require rewriting.
When moving into deprivileged mode, we need to be careful to ensure that
when we leave, we can redo the call into the hypervisor after the device
model completes without causing problems. Thus, we need to be _certain_
that the same call path is followed on the re-entry and that the
system's state can handle this. This may mean undoing operations such as
memory allocations.
Method Two
----------
At the point of detecting the need to perform a deprivileged operation,
we take a copy of the current stack from the current stack position up
to the point where the guest entered Xen and save it. Subsequently, we
move the stack pointer back.
This effectively gives us a clean stack as though we had just entered
Xen. We then put the deprivileged context onto this new stack and enter
deprivileged mode.
Upon returning, we restore the previous stack with the guest's and Xen's
context then jump to the saved rip and continue execution. Xen will then
perform the necessary processing, determining if the operation was
successful or not.
We are effectively context switching out Xen for deprivileged code and
then bringing Xen back in once we're done.
As Xen is non-preemptive, the Xen stack won't be updated whilst we're in
deprivileged mode. If it may be updated (I'm speculating here), e.g. an
interrupt, then we can pause deprivileged mode by hooking the interrupt
and restoring the Xen stack, then handle the interrupt and finally go
back to deprivileged mode.
Problem: If the device model or emulator edit the saved guest registers
and these are touched by Xen on the return path after finishing
servicing the deprivileged operation, then the guest will use these
values not those the deprivileged mode provided.
This is not a problem if the code doesn't do this. If it does, we could
give higher precedence to deprivileged changes. So, deprivileged mode
pushes the changes into the hypervisor which caches them and then, just
before guest context is restored, makes those changes, thus discarding
any Xen made.
Method Three
------------
A per vcpu stack is maintained for user mode and supervisor mode. We
then don't need to do any copying, just switch to user mode at the point
when deprivileged code needs to run.
When deprivileged mode is done, we move back to supervisor mode, restore
the previous context and continue execution of the code path that
followed the call to move into deprivileged mode.
Method Evaluation
-----------------
In method one, similarly to the QEMU path, we need to move up and down
the call stack twice. We pay the cost of running the entry and exit
code, which all methods will. Then we pay the cost of the code paths for
moving into deprivileged mode from the call site and for moving from
deprivileged mode back to the call site to handle the result. This means
that we also destroy and then rebuild the stack. We also pay any
allocation and deallocation costs twice, unless we can re-write the code
paths so that these can be avoided. A potential issue would be if any
changes are made to Xen's state on the first entry which mean that on
the second entry (returning from deprivileged mode), we take a different
call path.
As mentioned, QEMU appears to do something similar so we can reuse much
of this. The call tree is quite deep and broad so great care will need
to be taken when making these changes to examine state-changing calls.
Furthermore, such a change will be needed for each device, although this
will be simpler after the first device is added.
The second method requires copying the stack and then restoring it. It
doesn't pay the costs of following a return path into deprivileged mode
or moving back to the call site as it, effectively, skips all of this.
Memory accesses on the stack are roughly the same as the first method
but, we do need enough storage to hold a copy of the stack for each
vcpu. The edits to intermediate callers are likely to be simpler than
method one, as we don't need to worry about there being two different
return paths. Adding a new device model would most likely be easier than
method one.
Method two appears to require fewer edits to the original source code
and I suspect would be more efficient computationally than moving up and
down the stack twice with multiple flag tests breaking code up. However,
this has already been done for QEMU call paths so this may prove less
troublesome/ expensive than expected.
The third method _may_ require significant code refactoring as
currently, there is only one stack per pcpu so this may be a large change.
Summary
-------
Just to reiterate, this is intended as a proof-of-concept to measure how
feasible such a feature is.
I'm currently on the fence between method one and method two.
Method one will require more attention to existing code paths and is
less like a context-switch approach.
Method two will require less attention to existing code paths and is
more like a context-switching approach.
I am unsure of method three as I suspect it would be a significant change.
Are there any potential issues or things which I have overlooked?
Additionally, which (if any) of the above would you recommend pursuing
or do you have any ideas regarding alternatives?
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel