On 12/2/20, Demi M. Obenour wrote:
>
> That’s understandable.  From my perspective, it appears that the only
> change required would be to swap out the VMM, since the same kernel
> capabilities would be required either way.  The only difference would
> be that a nested instance of UX/RT would need to get untyped memory
> objects somehow, which seems simple.  Of course, I could very well
> be missing something here ― if this would meaningfully increase
> the complexity of the system, it probably isn’t worth it.
>

In addition to the aforementioned issues with binary compatibility, it
would also complicate booting such systems. UX/RT will have a uniform
boot process in which the kernel, root server, and all other files
required for early boot are contained within a "supervisor image" in
the root directory of the root container being booted rather than
having multiple images on disk, and the root server will be dependent
on having the Multiboot2 info passed through to it (so it doesn't have
to understand the filesystem format of the supervisor image). There
would also have to be some kind of API for communication with
alternate personalities running on the same kernel (although of course
on a hypervisor there would have to be an API for communicating with
other VMs as well).

Also, it's kind of redundant because UX/RT will already have extensive
containerization support at the VFS level (and since literally all IPC
and user memory will be file-based this means complete control over
what is accessible in containers). This would really just be pushing
containerization down to the kernel level, which would be of little
benefit.

>
> That said, from the way you phrased your message, I thought you were
> referring to a type-1 hypervisor that would run below UX/RT.  IMO, that
> is where such a tool really belongs ― it can provide strong isolation
> guarantees between multiple seL4-based systems, and still allow each
> of those systems to use hardware virtualization if they so desire.
> For instance, an embedded system might have high-assurance components
> running on the seL4 Core Platform, while UX/RT is used as a replacement
> for VMs running Linux.  Similarly, a hypothetical seL4-based QubesOS
> might use this type-1 hypervisor to isolate qubes from each other.

Yes, that's exactly what I was planning to do. I'm wanting to write a
Type 1 hypervisor based on seL4 or a fork, which will be independent
of UX/RT (although it will be distributed along with it). It will have
a somewhat Xen-ish architecture in that all services will be provided
by backend drivers running in VMs (the VMM will run as the root server
and will be the only process-like thing running on the hypervisor
microkernel).

>
> FYI, since you plan on a Linux compatibility layer, you might want to
> contact the illumos developers.  illumos is, of course, a completely
> different OS design, but they do have a full Linux compatibility
> layer and might be able to give some implementation advice.
>

Their Linux compatibility layer probably wouldn't really be all that
relevant to that of UX/RT. UX/RT's Linux system call handling will be
purely library-based, which is rather different from a kernel-based
compatibility layer.

>
> An officially-supported, API- and ABI- stable C library is planned,
> so this may not be a roadblock for much longer.
>

I was under the impression that it wasn't going to have a stable ABI
since it is intended more for static systems.

>
> 1. The need for an emulator removes many of the assurance guarantees
>    provided by seL4, since one must rely on the correctness of the
>    emulator to prevent in-VM privilege escalation vulnerabilities.
>    Such vulnerabilities are not uncommon in existing hypervisors.
>

There isn't necessarily a need for an emulator as such for PVH-type
VMs, although there will of course be device backends.

Also, since this is meant for dynamic systems, verification of the
microkernel is somewhat less relevant, since the TCB of any process or
VM will always include highly dynamic user-mode subsystems that will
be difficult or impossible to verify (the UX/RT root server as well as
that of the hypervisor will be written in Rust to reduce the potential
for significant vulnerabilities).

For a system with a static enclave running on the same processor as a
dynamic OS, a verified kernel could still be used (presumably without
any direct hypercall support unless that could be verified, since it
would be less likely that there would be a need for high-throughput
IPC between the two sides).

>
> 2. Nested hardware virtualization is quite difficult to implement, and
>    has significant overhead.  On the other hand, nested virtualization
>    based on seL4 capabilities is free.

UX/RT itself probably won't do anything with hardware virtualization
extensions directly. It will support hosting I/O emulators and device
backends for the underlying Type 1 hypervisor, and possibly
Skybridge-type acceleration, but those will only use hypercalls.

>
> 3. I doubt seL4 supports issuing any specialized hypercall
>    instructions, so you might need to fall back to emulation.
>

That's why I was wondering about adding kernel-level support for
direct hypercalls.

On 12/3/20, Indan Zupancic wrote:
>
> That seems possible to implement. The challenge is to implement
> this without increasing seL4's complexity and without slowing down
> all notification and IPC system calls, as it would be a couple of
> special checks somewhere. It would require new system calls, e.g:
> seL4_Notification_SetIRQHandler() and seL4_IPC_SetIRQHandler().
> Or it can be a TCB/VCPU binding instead.
>
> It does not solve blocking outgoing IPC calls, but I assume you
> don't mind those. Interrupts must be kept disabled for the VM
> during such calls.
>
>> It also has to work on x86 as well.
>
> The VM receive side needs special seL4 kernel support to generate
> interrupts when there are pending notifications or messages. This
> is architecture independent.
>
> The VM send side needs a way to directly make calls into the seL4
> kernel instead of going via the VMM. This needs VM kernel support,
> because such instructions are usually privileged. All VM <-> VMM
> communication happens via the seL4 kernel already, so all that is
> needed is for seL4 to know whether it is a VMM call or a system
> call. On ARM this happens by looking at whether it was a svc or a
> hvc call. It seems likely that x86 can do something similar.
>
> For people who are wondering what use this has, the advantage of
> taking the VMM out of the loop for seL4 <-> VM communication is not
> only possible performance increases, but it also simplifies the VMM
> and makes it more agnostic about the system as a whole. The system
> design becomes simpler because you can use the normal seL4 API
> everywhere, including VM user space. This makes it easier to move
> code between VM user space and native seL4.
>

I'm thinking of an interface where hypercalls are made by writing to
virtual registers rather than using explicit trapping instructions,
with the possibility to create multiple hypercall interfaces per VM,
each with its own set of endpoint capabilities (although interrupt
delivery would probably still be best off being centralized with a
single interface per VCPU, for use only by the kernel). This would
make it easy for user processes within a VM to make hypercalls since
the guest system could create a hypercall interface for each process.
Hypercall support for user processes in VMs would also pretty much
require moving to a semi-asynchronous model where only the hypercall
interface itself gets blocked without blocking the VCPU (attempting to
do anything with a hypercall interface that has a call already in
progress would fail, but the VCPU would continue to run and all other
hypercall interfaces would remain available). There would presumably
have to be a dummy TCB (with only a CSpace but no VSpace or CPU state
defined) for each hypercall interface separate from the TCB of the VM.
IPC for normal threads would remain synchronous.
_______________________________________________
Devel mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to