[seL4] Re: More enhancements for general-purpose systems that wouldn't be accepted into seL4?

Demi M. Obenour Sat, 05 Dec 2020 13:55:33 -0800

On 12/4/20 11:37 PM, Andrew Warkentin wrote:
> On 12/2/20, Demi M. Obenour wrote:
>>
>> That’s understandable.  From my perspective, it appears that the only
>> change required would be to swap out the VMM, since the same kernel
>> capabilities would be required either way.  The only difference would
>> be that a nested instance of UX/RT would need to get untyped memory
>> objects somehow, which seems simple.  Of course, I could very well
>> be missing something here ― if this would meaningfully increase
>> the complexity of the system, it probably isn’t worth it.
>>
> 
> In addition to the aforementioned issues with binary compatibility, it
> would also complicate booting such systems. UX/RT will have a uniform
> boot process in which the kernel, root server, and all other files
> required for early boot are contained within a "supervisor image" in
> the root directory of the root container being booted rather than
> having multiple images on disk, and the root server will be dependent
> on having the Multiboot2 info passed through to it (so it doesn't have
> to understand the filesystem format of the supervisor image). There
> would also have to be some kind of API for communication with
> alternate personalities running on the same kernel (although of course
> on a hypervisor there would have to be an API for communicating with
> other VMs as well).


Could the same API be used for both?

> Also, it's kind of redundant because UX/RT will already have extensive
> containerization support at the VFS level (and since literally all IPC
> and user memory will be file-based this means complete control over
> what is accessible in containers). This would really just be pushing
> containerization down to the kernel level, which would be of little
> benefit.
> 
>>
>> That said, from the way you phrased your message, I thought you were
>> referring to a type-1 hypervisor that would run below UX/RT.  IMO, that
>> is where such a tool really belongs ― it can provide strong isolation
>> guarantees between multiple seL4-based systems, and still allow each
>> of those systems to use hardware virtualization if they so desire.
>> For instance, an embedded system might have high-assurance components
>> running on the seL4 Core Platform, while UX/RT is used as a replacement
>> for VMs running Linux.  Similarly, a hypothetical seL4-based QubesOS
>> might use this type-1 hypervisor to isolate qubes from each other.
> 
> Yes, that's exactly what I was planning to do. I'm wanting to write a
> Type 1 hypervisor based on seL4 or a fork, which will be independent
> of UX/RT (although it will be distributed along with it). It will have
> a somewhat Xen-ish architecture in that all services will be provided
> by backend drivers running in VMs (the VMM will run as the root server
> and will be the only process-like thing running on the hypervisor
> microkernel).
> 
>>
>> FYI, since you plan on a Linux compatibility layer, you might want to
>> contact the illumos developers.  illumos is, of course, a completely
>> different OS design, but they do have a full Linux compatibility
>> layer and might be able to give some implementation advice.
>>
> 
> Their Linux compatibility layer probably wouldn't really be all that
> relevant to that of UX/RT. UX/RT's Linux system call handling will be
> purely library-based, which is rather different from a kernel-based
> compatibility layer.

Still, they might have useful advise on what is most important to
support, and what can be left unimplemented.  For instance, I believe
they left out signal-driven I/O, since hardly anyone used it.

>> An officially-supported, API- and ABI- stable C library is planned,
>> so this may not be a roadblock for much longer.
>>
> 
> I was under the impression that it wasn't going to have a stable ABI
> since it is intended more for static systems.

The system call API and ABI won’t be stable, but the API and ABI
of sel4corelib will.

>> 1. The need for an emulator removes many of the assurance guarantees
>>    provided by seL4, since one must rely on the correctness of the
>>    emulator to prevent in-VM privilege escalation vulnerabilities.
>>    Such vulnerabilities are not uncommon in existing hypervisors.
>>
> 
> There isn't necessarily a need for an emulator as such for PVH-type
> VMs, although there will of course be device backends.
> 
> Also, since this is meant for dynamic systems, verification of the
> microkernel is somewhat less relevant, since the TCB of any process or
> VM will always include highly dynamic user-mode subsystems that will
> be difficult or impossible to verify (the UX/RT root server as well as
> that of the hypervisor will be written in Rust to reduce the potential
> for significant vulnerabilities).

What I meant is that when one is interacting with hardware
directly, logic errors are far more likely than normal to lead to
vulnerabilities.  For instance, incorrectly emulating a faulting
instruction is memory safe from the VMM perspective, but could
easily lead to a privilege escalation vulnerability in the guest.
That’s where running natively on seL4 is a huge security win.

> For a system with a static enclave running on the same processor as a
> dynamic OS, a verified kernel could still be used (presumably without
> any direct hypercall support unless that could be verified, since it
> would be less likely that there would be a need for high-throughput
> IPC between the two sides).

Enclaves are indeed the case I was thinking of, although I do not
believe they would normally be referred to as such.  Specifically,
it would be nice for UX/RT to be able to run on the same kernel as
the rest of an otherwise static system.

>> 2. Nested hardware virtualization is quite difficult to implement, and
>>    has significant overhead.  On the other hand, nested virtualization
>>    based on seL4 capabilities is free.
> 
> UX/RT itself probably won't do anything with hardware virtualization
> extensions directly. It will support hosting I/O emulators and device
> backends for the underlying Type 1 hypervisor, and possibly
> Skybridge-type acceleration, but those will only use hypercalls.
> 
>>
>> 3. I doubt seL4 supports issuing any specialized hypercall
>>    instructions, so you might need to fall back to emulation.
>>
> 
> That's why I was wondering about adding kernel-level support for
> direct hypercalls.
> 
> On 12/3/20, Indan Zupancic wrote:
>>
>> That seems possible to implement. The challenge is to implement
>> this without increasing seL4's complexity and without slowing down
>> all notification and IPC system calls, as it would be a couple of
>> special checks somewhere. It would require new system calls, e.g:
>> seL4_Notification_SetIRQHandler() and seL4_IPC_SetIRQHandler().
>> Or it can be a TCB/VCPU binding instead.
>>
>> It does not solve blocking outgoing IPC calls, but I assume you
>> don't mind those. Interrupts must be kept disabled for the VM
>> during such calls.
>>
>>> It also has to work on x86 as well.
>>
>> The VM receive side needs special seL4 kernel support to generate
>> interrupts when there are pending notifications or messages. This
>> is architecture independent.
>>
>> The VM send side needs a way to directly make calls into the seL4
>> kernel instead of going via the VMM. This needs VM kernel support,
>> because such instructions are usually privileged. All VM <-> VMM
>> communication happens via the seL4 kernel already, so all that is
>> needed is for seL4 to know whether it is a VMM call or a system
>> call. On ARM this happens by looking at whether it was a svc or a
>> hvc call. It seems likely that x86 can do something similar.
>>
>> For people who are wondering what use this has, the advantage of
>> taking the VMM out of the loop for seL4 <-> VM communication is not
>> only possible performance increases, but it also simplifies the VMM
>> and makes it more agnostic about the system as a whole. The system
>> design becomes simpler because you can use the normal seL4 API
>> everywhere, including VM user space. This makes it easier to move
>> code between VM user space and native seL4.
>>
> 
> I'm thinking of an interface where hypercalls are made by writing to
> virtual registers rather than using explicit trapping instructions,
> with the possibility to create multiple hypercall interfaces per VM,
> each with its own set of endpoint capabilities (although interrupt
> delivery would probably still be best off being centralized with a
> single interface per VCPU, for use only by the kernel). This would
> make it easy for user processes within a VM to make hypercalls since
> the guest system could create a hypercall interface for each process.
> Hypercall support for user processes in VMs would also pretty much
> require moving to a semi-asynchronous model where only the hypercall
> interface itself gets blocked without blocking the VCPU (attempting to
> do anything with a hypercall interface that has a call already in
> progress would fail, but the VCPU would continue to run and all other
> hypercall interfaces would remain available). There would presumably
> have to be a dummy TCB (with only a CSpace but no VSpace or CPU state
> defined) for each hypercall interface separate from the TCB of the VM.
> IPC for normal threads would remain synchronous.

Virtual registers may very well take a significant performance hit.
If I recall correctly, dedicated system call instructions are heavily
optimized at the hardware level, and are much faster than a generic
trap.  I believe the same is true of HVC on ARM.  Some of the overhead
can be made up by means of a ring buffer, but some will remain.

Sincerely,

Demi

P.S.: I hope that UX/RT is wildly successful!

OpenPGP_signature
Description: OpenPGP digital signature

_______________________________________________
Devel mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[seL4] Re: More enhancements for general-purpose systems that wouldn't be accepted into seL4?

Reply via email to