Re: [Stratos-dev] Enabling hypervisor agnosticism for VirtIO backends

2021-09-15 Thread Trilok Soni



Hello,

On 9/13/2021 4:51 PM, Stefano Stabellini via Stratos-dev wrote:

On Mon, 6 Sep 2021, AKASHI Takahiro wrote:

the second is how many context switches are involved in a transaction.
Of course with all things there is a trade off. Things involving the
very tightest latency would probably opt for a bare metal backend which
I think would imply hypervisor knowledge in the backend binary.


In configuration phase of virtio device, the latency won't be a big matter.
In device operations (i.e. read/write to block devices), if we can
resolve 'mmap' issue, as Oleksandr is proposing right now, the only issue is
how efficiently we can deliver notification to the opposite side. Right?
And this is a very common problem whatever approach we would take.

Anyhow, if we do care the latency in my approach, most of virtio-proxy-
related code can be re-implemented just as a stub (or shim?) library
since the protocols are defined as RPCs.
In this case, however, we would lose the benefit of providing "single binary"
BE.
(I know this is is an arguable requirement, though.)


In my experience, latency, performance, and security are far more
important than providing a single binary.

In my opinion, we should optimize for the best performance and security,
then be practical on the topic of hypervisor agnosticism. For instance,
a shared source with a small hypervisor-specific component, with one
implementation of the small component for each hypervisor, would provide
a good enough hypervisor abstraction. It is good to be hypervisor
agnostic, but I wouldn't go extra lengths to have a single binary. I
cannot picture a case where a BE binary needs to be moved between
different hypervisors and a recompilation is impossible (BE, not FE).
Instead, I can definitely imagine detailed requirements on IRQ latency
having to be lower than 10us or bandwidth higher than 500 MB/sec.

Instead of virtio-proxy, my suggestion is to work together on a common
project and common source with others interested in the same problem.

I would pick something like kvmtool as a basis. It doesn't have to be
kvmtools, and kvmtools specifically is GPL-licensed, which is
unfortunate because it would help if the license was BSD-style for ease
of integration with Zephyr and other RTOSes.

As long as the project is open to working together on multiple
hypervisors and deployment models then it is fine. For instance, the
shared source could be based on OpenAMP kvmtool [1] (the original
kvmtool likely prefers to stay small and narrow-focused on KVM). OpenAMP
kvmtool was created to add support for hypervisor-less virtio but they
are very open to hypervisors too. It could be a good place to add a Xen
implementation, a KVM fatqueue implementation, a Jailhouse
implementation, etc. -- work together toward the common goal of a single
BE source (not binary) supporting multiple different deployment models.


I have my reservations on using "kvmtool" to do any development here. 
"kvmtool" can't be used on the products and it is just a tool for the 
developers.


The benefit of the solving problem w/ rust-vmm is that some of the 
crates from this project can be utilized for the real product. Alex has 
mentioned that "rust-vmm" today has some KVM specific bits but the 
rust-vmm community is already discussing to remove or re-org them in 
such a way that other Hypervisors can fit in.


Microsoft has Hyper-V implementation w/ cloud-hypervisor which uses some 
of the rust-vmm components as well and they had shown interest to add 
the Hyper-V support in the "rust-vmm" project as well. I don't know the 
current progress but they had proven it it "cloud-hypervisor" project.


"rust-vmm" project's license will work as well for most of the project 
developments and I see that "CrosVM" is shipping in the products as well.



---Trilok Soni




Re: [Stratos-dev] Enabling hypervisor agnosticism for VirtIO backends

2021-09-15 Thread Trilok Soni

Hi Stefano,

On 9/14/2021 8:29 PM, Stefano Stabellini wrote:

On Tue, 14 Sep 2021, Trilok Soni wrote:

On 9/13/2021 4:51 PM, Stefano Stabellini via Stratos-dev wrote:

On Mon, 6 Sep 2021, AKASHI Takahiro wrote:

the second is how many context switches are involved in a transaction.
Of course with all things there is a trade off. Things involving the
very tightest latency would probably opt for a bare metal backend which
I think would imply hypervisor knowledge in the backend binary.


In configuration phase of virtio device, the latency won't be a big
matter.
In device operations (i.e. read/write to block devices), if we can
resolve 'mmap' issue, as Oleksandr is proposing right now, the only issue
is
how efficiently we can deliver notification to the opposite side. Right?
And this is a very common problem whatever approach we would take.

Anyhow, if we do care the latency in my approach, most of virtio-proxy-
related code can be re-implemented just as a stub (or shim?) library
since the protocols are defined as RPCs.
In this case, however, we would lose the benefit of providing "single
binary"
BE.
(I know this is is an arguable requirement, though.)


In my experience, latency, performance, and security are far more
important than providing a single binary.

In my opinion, we should optimize for the best performance and security,
then be practical on the topic of hypervisor agnosticism. For instance,
a shared source with a small hypervisor-specific component, with one
implementation of the small component for each hypervisor, would provide
a good enough hypervisor abstraction. It is good to be hypervisor
agnostic, but I wouldn't go extra lengths to have a single binary. I
cannot picture a case where a BE binary needs to be moved between
different hypervisors and a recompilation is impossible (BE, not FE).
Instead, I can definitely imagine detailed requirements on IRQ latency
having to be lower than 10us or bandwidth higher than 500 MB/sec.

Instead of virtio-proxy, my suggestion is to work together on a common
project and common source with others interested in the same problem.

I would pick something like kvmtool as a basis. It doesn't have to be
kvmtools, and kvmtools specifically is GPL-licensed, which is
unfortunate because it would help if the license was BSD-style for ease
of integration with Zephyr and other RTOSes.

As long as the project is open to working together on multiple
hypervisors and deployment models then it is fine. For instance, the
shared source could be based on OpenAMP kvmtool [1] (the original
kvmtool likely prefers to stay small and narrow-focused on KVM). OpenAMP
kvmtool was created to add support for hypervisor-less virtio but they
are very open to hypervisors too. It could be a good place to add a Xen
implementation, a KVM fatqueue implementation, a Jailhouse
implementation, etc. -- work together toward the common goal of a single
BE source (not binary) supporting multiple different deployment models.


I have my reservations on using "kvmtool" to do any development here.
"kvmtool" can't be used on the products and it is just a tool for the
developers.

The benefit of the solving problem w/ rust-vmm is that some of the crates from
this project can be utilized for the real product. Alex has mentioned that
"rust-vmm" today has some KVM specific bits but the rust-vmm community is
already discussing to remove or re-org them in such a way that other
Hypervisors can fit in.

Microsoft has Hyper-V implementation w/ cloud-hypervisor which uses some of
the rust-vmm components as well and they had shown interest to add the Hyper-V
support in the "rust-vmm" project as well. I don't know the current progress
but they had proven it it "cloud-hypervisor" project.

"rust-vmm" project's license will work as well for most of the project
developments and I see that "CrosVM" is shipping in the products as well.


Most things in open source start as a developers tool before they become
part of a product :)


Agree, but I had an offline discussions with one the active developer of 
kvmtool and the confidence of using it in the product was no where near 
we expected during our evaluation. Same goes the QEMU and one of the 
biggest problem was no. of security issues against this huge codebase of 
QEMU.




I am concerned about how "embeddable" rust-vmm is going to be. Do you
think it would be possible to run it against an RTOS together with other
apps written in C?


I don't see any limitations of rust-vmm. For example, I am confident 
that we can port rust-vmm based backend into the QNX as host OS and same 
goes w/ Zephyr as well. Some work is needed but nothing fundamentally 
blocking it. We should be able to run it w/ Fuchsia as well with some 
effort.


---Trilok Soni



Re: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity

2023-05-24 Thread Trilok Soni

On 5/5/2023 8:20 AM, Mickaël Salaün wrote:

Hi,

This patch series is a proof-of-concept that implements new KVM features
(extended page tracking, MBEC support, CR pinning) and defines a new API to
protect guest VMs. No VMM (e.g., Qemu) modification is required.

The main idea being that kernel self-protection mechanisms should be delegated
to a more privileged part of the system, hence the hypervisor. It is still the
role of the guest kernel to request such restrictions according to its


Only for the guest kernel images here? Why not for the host OS kernel? 
Embedded devices w/ Android you have mentioned below supports the host 
OS as well it seems, right?


Do we suggest that all the functionalities should be implemented in the 
Hypervisor (NS-EL2 for ARM) or even at Secure EL like Secure-EL1 (ARM).


I am hoping that whatever we suggest the interface here from the Guest 
to the Hypervisor becomes the ABI right?





# Current limitations

The main limitation of this patch series is the statically enforced
permissions. This is not an issue for kernels without module but this needs to
be addressed.  Mechanisms that dynamically impact kernel executable memory are
not handled for now (e.g., kernel modules, tracepoints, eBPF JIT), and such
code will need to be authenticated.  Because the hypervisor is highly
privileged and critical to the security of all the VMs, we don't want to
implement a code authentication mechanism in the hypervisor itself but delegate
this verification to something much less privileged. We are thinking of two
ways to solve this: implement this verification in the VMM or spawn a dedicated
special VM (similar to Windows's VBS). There are pros on cons to each approach:
complexity, verification code ownership (guest's or VMM's), access to guest
memory (i.e., confidential computing).


Do you foresee the performance regressions due to lot of tracking here? 
Production kernels do have lot of tracepoints and we use it as feature 
in the GKI kernel for the vendor hooks implementation and in those cases 
every vendor driver is a module. Separate VM further fragments this 
design and delegates more of it to proprietary solutions?


Do you have any performance numbers w/ current RFC?

---Trilok Soni



Re: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity

2023-05-24 Thread Trilok Soni

On 5/24/2023 3:20 PM, Edgecombe, Rick P wrote:

On Fri, 2023-05-05 at 17:20 +0200, Mickaël Salaün wrote:

# How does it work?

This implementation mainly leverages KVM capabilities to control the
Second
Layer Address Translation (or the Two Dimensional Paging e.g.,
Intel's EPT or
AMD's RVI/NPT) and Mode Based Execution Control (Intel's MBEC)
introduced with
the Kaby Lake (7th generation) architecture. This allows to set
permissions on
memory pages in a complementary way to the guest kernel's managed
memory
permissions. Once these permissions are set, they are locked and
there is no
way back.

A first KVM_HC_LOCK_MEM_PAGE_RANGES hypercall enables the guest
kernel to lock
a set of its memory page ranges with either the HEKI_ATTR_MEM_NOWRITE
or the
HEKI_ATTR_MEM_EXEC attribute. The first one denies write access to a
specific
set of pages (allow-list approach), and the second only allows kernel
execution
for a set of pages (deny-list approach).

The current implementation sets the whole kernel's .rodata (i.e., any
const or
__ro_after_init variables, which includes critical security data such
as LSM
parameters) and .text sections as non-writable, and the .text section
is the
only one where kernel execution is allowed. This is possible thanks
to the new
MBEC support also brough by this series (otherwise the vDSO would
have to be
executable). Thanks to this hardware support (VT-x, EPT and MBEC),
the
performance impact of such guest protection is negligible.

The second KVM_HC_LOCK_CR_UPDATE hypercall enables guests to pin some
of its
CPU control register flags (e.g., X86_CR0_WP, X86_CR4_SMEP,
X86_CR4_SMAP),
which is another complementary hardening mechanism.

Heki can be enabled with the heki=1 boot command argument.




Can the guest kernel ask the host VMM's emulated devices to DMA into
the protected data? It should go through the host userspace mappings I
think, which don't care about EPT permissions. Or did I miss where you
are protecting that another way? There are a lot of easy ways to ask
the host to write to guest memory that don't involve the EPT. You
probably need to protect the host userspace mappings, and also the
places in KVM that kmap a GPA provided by the guest.

[ snip ]



# Current limitations

The main limitation of this patch series is the statically enforced
permissions. This is not an issue for kernels without module but this
needs to
be addressed.  Mechanisms that dynamically impact kernel executable
memory are
not handled for now (e.g., kernel modules, tracepoints, eBPF JIT),
and such
code will need to be authenticated.  Because the hypervisor is highly
privileged and critical to the security of all the VMs, we don't want
to
implement a code authentication mechanism in the hypervisor itself
but delegate
this verification to something much less privileged. We are thinking
of two
ways to solve this: implement this verification in the VMM or spawn a
dedicated
special VM (similar to Windows's VBS). There are pros on cons to each
approach:
complexity, verification code ownership (guest's or VMM's), access to
guest
memory (i.e., confidential computing).


The kernel often creates writable aliases in order to write to
protected data (kernel text, etc). Some of this is done right as text
is being first written out (alternatives for example), and some happens
way later (jump labels, etc). So for verification, I wonder what stage
you would be verifying? If you want to verify the end state, you would
have to maintain knowledge in the verifier of all the touch-ups the
kernel does. I think it would get very tricky.


Right and for the ARM (from what I know) is that Erratas can be applied
using the alternatives fwk when you hotplug in the CPU post boot.

---Trilok Soni



Re: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity

2023-05-25 Thread Trilok Soni

On 5/25/2023 6:25 AM, Mickaël Salaün wrote:


On 24/05/2023 23:04, Trilok Soni wrote:

On 5/5/2023 8:20 AM, Mickaël Salaün wrote:

Hi,

This patch series is a proof-of-concept that implements new KVM features
(extended page tracking, MBEC support, CR pinning) and defines a new 
API to

protect guest VMs. No VMM (e.g., Qemu) modification is required.

The main idea being that kernel self-protection mechanisms should be 
delegated
to a more privileged part of the system, hence the hypervisor. It is 
still the

role of the guest kernel to request such restrictions according to its


Only for the guest kernel images here? Why not for the host OS kernel?


As explained in the Future work section, protecting the host would be 
useful, but that doesn't really fit with the KVM model. The Protected 
KVM project is a first step to help in this direction [11].


In a nutshell, KVM is close to a type-2 hypervisor, and the host kernel 
is also part of the hypervisor.




Embedded devices w/ Android you have mentioned below supports the host
OS as well it seems, right?


What do you mean?


I think you have answered this above w/ pKVM and I was referring the 
host protection as well w/ Heki. The link/references below refers to the 
Android OS it seems and not guest VM.







Do we suggest that all the functionalities should be implemented in the
Hypervisor (NS-EL2 for ARM) or even at Secure EL like Secure-EL1 (ARM).


KVM runs in EL2. TrustZone is mainly used to enforce DRM, which means 
that we may not control the related code.


This patch series is dedicated to hypervisor-enforced kernel integrity, 
then KVM.




I am hoping that whatever we suggest the interface here from the Guest
to the Hypervisor becomes the ABI right?


Yes, hypercalls are part of the KVM ABI.


Sure. I just hope that they are extensible enough to support for other 
Hypervisors too. I am not sure if they are on this list like ACRN / Xen 
and see if it fits their need too.


Is there any other Hypervisor you plan to test this feature as well?








# Current limitations

The main limitation of this patch series is the statically enforced
permissions. This is not an issue for kernels without module but this 
needs to
be addressed.  Mechanisms that dynamically impact kernel executable 
memory are
not handled for now (e.g., kernel modules, tracepoints, eBPF JIT), 
and such

code will need to be authenticated.  Because the hypervisor is highly
privileged and critical to the security of all the VMs, we don't want to
implement a code authentication mechanism in the hypervisor itself 
but delegate
this verification to something much less privileged. We are thinking 
of two
ways to solve this: implement this verification in the VMM or spawn a 
dedicated
special VM (similar to Windows's VBS). There are pros on cons to each 
approach:
complexity, verification code ownership (guest's or VMM's), access to 
guest

memory (i.e., confidential computing).


Do you foresee the performance regressions due to lot of tracking here?


The performance impact of execution prevention should be negligible 
because once configured the hypervisor do nothing except catch 
illegitimate access attempts.


Yes, if you are using the static kernel only and not considering the 
other dynamic patching features like explained. They need to be thought 
upon differently to reduce the likely impact.






Production kernels do have lot of tracepoints and we use it as feature
in the GKI kernel for the vendor hooks implementation and in those cases
every vendor driver is a module.


As explained in this section, dynamic kernel modifications such as 
tracepoints or modules are not currently supported by this patch series. 
Handling tracepoints is possible but requires more work to define and 
check legitimate changes. This proposal is still useful for static 
kernels though.




Separate VM further fragments this
design and delegates more of it to proprietary solutions?


What do you mean? KVM is not a proprietary solution.


Ah, I was referring the VBS Windows VM mentioned in the above text. Is 
it open-source? The reference of VM (or dedicated VM) didn't mention 
that VM itself will be open-source running Linux kernel.




For dynamic checks, this would require code not run by KVM itself, but 
either the VMM or a dedicated VM. In this case, the dynamic 
authentication code could come from the guest VM or from the VMM itself. 
In the former case, it is more challenging from a security point of view 
but doesn't rely on external (proprietary) solution. In the latter case, 
open-source VMMs should implement the specification to provide the 
required service (e.g. check kernel module signature).


The goal of the common API layer provided by this RFC is to share code 
as much as possible between different hypervisor backends.





Do you have any performance numbers w/ current RFC?


No, but the only hypervisor performance impact is at boot time and