Re: [Stratos-dev] Enabling hypervisor agnosticism for VirtIO backends
Hello, On 9/13/2021 4:51 PM, Stefano Stabellini via Stratos-dev wrote: On Mon, 6 Sep 2021, AKASHI Takahiro wrote: the second is how many context switches are involved in a transaction. Of course with all things there is a trade off. Things involving the very tightest latency would probably opt for a bare metal backend which I think would imply hypervisor knowledge in the backend binary. In configuration phase of virtio device, the latency won't be a big matter. In device operations (i.e. read/write to block devices), if we can resolve 'mmap' issue, as Oleksandr is proposing right now, the only issue is how efficiently we can deliver notification to the opposite side. Right? And this is a very common problem whatever approach we would take. Anyhow, if we do care the latency in my approach, most of virtio-proxy- related code can be re-implemented just as a stub (or shim?) library since the protocols are defined as RPCs. In this case, however, we would lose the benefit of providing "single binary" BE. (I know this is is an arguable requirement, though.) In my experience, latency, performance, and security are far more important than providing a single binary. In my opinion, we should optimize for the best performance and security, then be practical on the topic of hypervisor agnosticism. For instance, a shared source with a small hypervisor-specific component, with one implementation of the small component for each hypervisor, would provide a good enough hypervisor abstraction. It is good to be hypervisor agnostic, but I wouldn't go extra lengths to have a single binary. I cannot picture a case where a BE binary needs to be moved between different hypervisors and a recompilation is impossible (BE, not FE). Instead, I can definitely imagine detailed requirements on IRQ latency having to be lower than 10us or bandwidth higher than 500 MB/sec. Instead of virtio-proxy, my suggestion is to work together on a common project and common source with others interested in the same problem. I would pick something like kvmtool as a basis. It doesn't have to be kvmtools, and kvmtools specifically is GPL-licensed, which is unfortunate because it would help if the license was BSD-style for ease of integration with Zephyr and other RTOSes. As long as the project is open to working together on multiple hypervisors and deployment models then it is fine. For instance, the shared source could be based on OpenAMP kvmtool [1] (the original kvmtool likely prefers to stay small and narrow-focused on KVM). OpenAMP kvmtool was created to add support for hypervisor-less virtio but they are very open to hypervisors too. It could be a good place to add a Xen implementation, a KVM fatqueue implementation, a Jailhouse implementation, etc. -- work together toward the common goal of a single BE source (not binary) supporting multiple different deployment models. I have my reservations on using "kvmtool" to do any development here. "kvmtool" can't be used on the products and it is just a tool for the developers. The benefit of the solving problem w/ rust-vmm is that some of the crates from this project can be utilized for the real product. Alex has mentioned that "rust-vmm" today has some KVM specific bits but the rust-vmm community is already discussing to remove or re-org them in such a way that other Hypervisors can fit in. Microsoft has Hyper-V implementation w/ cloud-hypervisor which uses some of the rust-vmm components as well and they had shown interest to add the Hyper-V support in the "rust-vmm" project as well. I don't know the current progress but they had proven it it "cloud-hypervisor" project. "rust-vmm" project's license will work as well for most of the project developments and I see that "CrosVM" is shipping in the products as well. ---Trilok Soni
Re: [Stratos-dev] Enabling hypervisor agnosticism for VirtIO backends
Hi Stefano, On 9/14/2021 8:29 PM, Stefano Stabellini wrote: On Tue, 14 Sep 2021, Trilok Soni wrote: On 9/13/2021 4:51 PM, Stefano Stabellini via Stratos-dev wrote: On Mon, 6 Sep 2021, AKASHI Takahiro wrote: the second is how many context switches are involved in a transaction. Of course with all things there is a trade off. Things involving the very tightest latency would probably opt for a bare metal backend which I think would imply hypervisor knowledge in the backend binary. In configuration phase of virtio device, the latency won't be a big matter. In device operations (i.e. read/write to block devices), if we can resolve 'mmap' issue, as Oleksandr is proposing right now, the only issue is how efficiently we can deliver notification to the opposite side. Right? And this is a very common problem whatever approach we would take. Anyhow, if we do care the latency in my approach, most of virtio-proxy- related code can be re-implemented just as a stub (or shim?) library since the protocols are defined as RPCs. In this case, however, we would lose the benefit of providing "single binary" BE. (I know this is is an arguable requirement, though.) In my experience, latency, performance, and security are far more important than providing a single binary. In my opinion, we should optimize for the best performance and security, then be practical on the topic of hypervisor agnosticism. For instance, a shared source with a small hypervisor-specific component, with one implementation of the small component for each hypervisor, would provide a good enough hypervisor abstraction. It is good to be hypervisor agnostic, but I wouldn't go extra lengths to have a single binary. I cannot picture a case where a BE binary needs to be moved between different hypervisors and a recompilation is impossible (BE, not FE). Instead, I can definitely imagine detailed requirements on IRQ latency having to be lower than 10us or bandwidth higher than 500 MB/sec. Instead of virtio-proxy, my suggestion is to work together on a common project and common source with others interested in the same problem. I would pick something like kvmtool as a basis. It doesn't have to be kvmtools, and kvmtools specifically is GPL-licensed, which is unfortunate because it would help if the license was BSD-style for ease of integration with Zephyr and other RTOSes. As long as the project is open to working together on multiple hypervisors and deployment models then it is fine. For instance, the shared source could be based on OpenAMP kvmtool [1] (the original kvmtool likely prefers to stay small and narrow-focused on KVM). OpenAMP kvmtool was created to add support for hypervisor-less virtio but they are very open to hypervisors too. It could be a good place to add a Xen implementation, a KVM fatqueue implementation, a Jailhouse implementation, etc. -- work together toward the common goal of a single BE source (not binary) supporting multiple different deployment models. I have my reservations on using "kvmtool" to do any development here. "kvmtool" can't be used on the products and it is just a tool for the developers. The benefit of the solving problem w/ rust-vmm is that some of the crates from this project can be utilized for the real product. Alex has mentioned that "rust-vmm" today has some KVM specific bits but the rust-vmm community is already discussing to remove or re-org them in such a way that other Hypervisors can fit in. Microsoft has Hyper-V implementation w/ cloud-hypervisor which uses some of the rust-vmm components as well and they had shown interest to add the Hyper-V support in the "rust-vmm" project as well. I don't know the current progress but they had proven it it "cloud-hypervisor" project. "rust-vmm" project's license will work as well for most of the project developments and I see that "CrosVM" is shipping in the products as well. Most things in open source start as a developers tool before they become part of a product :) Agree, but I had an offline discussions with one the active developer of kvmtool and the confidence of using it in the product was no where near we expected during our evaluation. Same goes the QEMU and one of the biggest problem was no. of security issues against this huge codebase of QEMU. I am concerned about how "embeddable" rust-vmm is going to be. Do you think it would be possible to run it against an RTOS together with other apps written in C? I don't see any limitations of rust-vmm. For example, I am confident that we can port rust-vmm based backend into the QNX as host OS and same goes w/ Zephyr as well. Some work is needed but nothing fundamentally blocking it. We should be able to run it w/ Fuchsia as well with some effort. ---Trilok Soni
Re: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity
On 5/5/2023 8:20 AM, Mickaël Salaün wrote: Hi, This patch series is a proof-of-concept that implements new KVM features (extended page tracking, MBEC support, CR pinning) and defines a new API to protect guest VMs. No VMM (e.g., Qemu) modification is required. The main idea being that kernel self-protection mechanisms should be delegated to a more privileged part of the system, hence the hypervisor. It is still the role of the guest kernel to request such restrictions according to its Only for the guest kernel images here? Why not for the host OS kernel? Embedded devices w/ Android you have mentioned below supports the host OS as well it seems, right? Do we suggest that all the functionalities should be implemented in the Hypervisor (NS-EL2 for ARM) or even at Secure EL like Secure-EL1 (ARM). I am hoping that whatever we suggest the interface here from the Guest to the Hypervisor becomes the ABI right? # Current limitations The main limitation of this patch series is the statically enforced permissions. This is not an issue for kernels without module but this needs to be addressed. Mechanisms that dynamically impact kernel executable memory are not handled for now (e.g., kernel modules, tracepoints, eBPF JIT), and such code will need to be authenticated. Because the hypervisor is highly privileged and critical to the security of all the VMs, we don't want to implement a code authentication mechanism in the hypervisor itself but delegate this verification to something much less privileged. We are thinking of two ways to solve this: implement this verification in the VMM or spawn a dedicated special VM (similar to Windows's VBS). There are pros on cons to each approach: complexity, verification code ownership (guest's or VMM's), access to guest memory (i.e., confidential computing). Do you foresee the performance regressions due to lot of tracking here? Production kernels do have lot of tracepoints and we use it as feature in the GKI kernel for the vendor hooks implementation and in those cases every vendor driver is a module. Separate VM further fragments this design and delegates more of it to proprietary solutions? Do you have any performance numbers w/ current RFC? ---Trilok Soni
Re: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity
On 5/24/2023 3:20 PM, Edgecombe, Rick P wrote: On Fri, 2023-05-05 at 17:20 +0200, Mickaël Salaün wrote: # How does it work? This implementation mainly leverages KVM capabilities to control the Second Layer Address Translation (or the Two Dimensional Paging e.g., Intel's EPT or AMD's RVI/NPT) and Mode Based Execution Control (Intel's MBEC) introduced with the Kaby Lake (7th generation) architecture. This allows to set permissions on memory pages in a complementary way to the guest kernel's managed memory permissions. Once these permissions are set, they are locked and there is no way back. A first KVM_HC_LOCK_MEM_PAGE_RANGES hypercall enables the guest kernel to lock a set of its memory page ranges with either the HEKI_ATTR_MEM_NOWRITE or the HEKI_ATTR_MEM_EXEC attribute. The first one denies write access to a specific set of pages (allow-list approach), and the second only allows kernel execution for a set of pages (deny-list approach). The current implementation sets the whole kernel's .rodata (i.e., any const or __ro_after_init variables, which includes critical security data such as LSM parameters) and .text sections as non-writable, and the .text section is the only one where kernel execution is allowed. This is possible thanks to the new MBEC support also brough by this series (otherwise the vDSO would have to be executable). Thanks to this hardware support (VT-x, EPT and MBEC), the performance impact of such guest protection is negligible. The second KVM_HC_LOCK_CR_UPDATE hypercall enables guests to pin some of its CPU control register flags (e.g., X86_CR0_WP, X86_CR4_SMEP, X86_CR4_SMAP), which is another complementary hardening mechanism. Heki can be enabled with the heki=1 boot command argument. Can the guest kernel ask the host VMM's emulated devices to DMA into the protected data? It should go through the host userspace mappings I think, which don't care about EPT permissions. Or did I miss where you are protecting that another way? There are a lot of easy ways to ask the host to write to guest memory that don't involve the EPT. You probably need to protect the host userspace mappings, and also the places in KVM that kmap a GPA provided by the guest. [ snip ] # Current limitations The main limitation of this patch series is the statically enforced permissions. This is not an issue for kernels without module but this needs to be addressed. Mechanisms that dynamically impact kernel executable memory are not handled for now (e.g., kernel modules, tracepoints, eBPF JIT), and such code will need to be authenticated. Because the hypervisor is highly privileged and critical to the security of all the VMs, we don't want to implement a code authentication mechanism in the hypervisor itself but delegate this verification to something much less privileged. We are thinking of two ways to solve this: implement this verification in the VMM or spawn a dedicated special VM (similar to Windows's VBS). There are pros on cons to each approach: complexity, verification code ownership (guest's or VMM's), access to guest memory (i.e., confidential computing). The kernel often creates writable aliases in order to write to protected data (kernel text, etc). Some of this is done right as text is being first written out (alternatives for example), and some happens way later (jump labels, etc). So for verification, I wonder what stage you would be verifying? If you want to verify the end state, you would have to maintain knowledge in the verifier of all the touch-ups the kernel does. I think it would get very tricky. Right and for the ARM (from what I know) is that Erratas can be applied using the alternatives fwk when you hotplug in the CPU post boot. ---Trilok Soni
Re: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity
On 5/25/2023 6:25 AM, Mickaël Salaün wrote: On 24/05/2023 23:04, Trilok Soni wrote: On 5/5/2023 8:20 AM, Mickaël Salaün wrote: Hi, This patch series is a proof-of-concept that implements new KVM features (extended page tracking, MBEC support, CR pinning) and defines a new API to protect guest VMs. No VMM (e.g., Qemu) modification is required. The main idea being that kernel self-protection mechanisms should be delegated to a more privileged part of the system, hence the hypervisor. It is still the role of the guest kernel to request such restrictions according to its Only for the guest kernel images here? Why not for the host OS kernel? As explained in the Future work section, protecting the host would be useful, but that doesn't really fit with the KVM model. The Protected KVM project is a first step to help in this direction [11]. In a nutshell, KVM is close to a type-2 hypervisor, and the host kernel is also part of the hypervisor. Embedded devices w/ Android you have mentioned below supports the host OS as well it seems, right? What do you mean? I think you have answered this above w/ pKVM and I was referring the host protection as well w/ Heki. The link/references below refers to the Android OS it seems and not guest VM. Do we suggest that all the functionalities should be implemented in the Hypervisor (NS-EL2 for ARM) or even at Secure EL like Secure-EL1 (ARM). KVM runs in EL2. TrustZone is mainly used to enforce DRM, which means that we may not control the related code. This patch series is dedicated to hypervisor-enforced kernel integrity, then KVM. I am hoping that whatever we suggest the interface here from the Guest to the Hypervisor becomes the ABI right? Yes, hypercalls are part of the KVM ABI. Sure. I just hope that they are extensible enough to support for other Hypervisors too. I am not sure if they are on this list like ACRN / Xen and see if it fits their need too. Is there any other Hypervisor you plan to test this feature as well? # Current limitations The main limitation of this patch series is the statically enforced permissions. This is not an issue for kernels without module but this needs to be addressed. Mechanisms that dynamically impact kernel executable memory are not handled for now (e.g., kernel modules, tracepoints, eBPF JIT), and such code will need to be authenticated. Because the hypervisor is highly privileged and critical to the security of all the VMs, we don't want to implement a code authentication mechanism in the hypervisor itself but delegate this verification to something much less privileged. We are thinking of two ways to solve this: implement this verification in the VMM or spawn a dedicated special VM (similar to Windows's VBS). There are pros on cons to each approach: complexity, verification code ownership (guest's or VMM's), access to guest memory (i.e., confidential computing). Do you foresee the performance regressions due to lot of tracking here? The performance impact of execution prevention should be negligible because once configured the hypervisor do nothing except catch illegitimate access attempts. Yes, if you are using the static kernel only and not considering the other dynamic patching features like explained. They need to be thought upon differently to reduce the likely impact. Production kernels do have lot of tracepoints and we use it as feature in the GKI kernel for the vendor hooks implementation and in those cases every vendor driver is a module. As explained in this section, dynamic kernel modifications such as tracepoints or modules are not currently supported by this patch series. Handling tracepoints is possible but requires more work to define and check legitimate changes. This proposal is still useful for static kernels though. Separate VM further fragments this design and delegates more of it to proprietary solutions? What do you mean? KVM is not a proprietary solution. Ah, I was referring the VBS Windows VM mentioned in the above text. Is it open-source? The reference of VM (or dedicated VM) didn't mention that VM itself will be open-source running Linux kernel. For dynamic checks, this would require code not run by KVM itself, but either the VMM or a dedicated VM. In this case, the dynamic authentication code could come from the guest VM or from the VMM itself. In the former case, it is more challenging from a security point of view but doesn't rely on external (proprietary) solution. In the latter case, open-source VMMs should implement the specification to provide the required service (e.g. check kernel module signature). The goal of the common API layer provided by this RFC is to share code as much as possible between different hypervisor backends. Do you have any performance numbers w/ current RFC? No, but the only hypervisor performance impact is at boot time and