From: Igor Kotrasinski <i.kotrasi...@partner.samsung.com> Signed-off-by: Igor Kotrasinski <i.kotrasi...@partner.samsung.com> --- MAINTAINERS | 5 ++ docs/specs/memexpose-spec.txt | 168 ++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 173 insertions(+) create mode 100644 docs/specs/memexpose-spec.txt
diff --git a/MAINTAINERS b/MAINTAINERS index 1f0bc72..73dd571 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -1639,6 +1639,11 @@ F: hw/virtio/virtio-crypto.c F: hw/virtio/virtio-crypto-pci.c F: include/hw/virtio/virtio-crypto.h +memexpose +M: Igor Kotrasinski <i.kotrasi...@partner.samsung.com> +S: Maintained +F: docs/specs/memexpose-spec.txt + nvme M: Keith Busch <keith.bu...@intel.com> L: qemu-bl...@nongnu.org diff --git a/docs/specs/memexpose-spec.txt b/docs/specs/memexpose-spec.txt new file mode 100644 index 0000000..6266149 --- /dev/null +++ b/docs/specs/memexpose-spec.txt @@ -0,0 +1,168 @@ += Specification for Inter-VM memory region sharing device = + +The inter-VM memory region sharing device (memexpose) is designed to allow two +QEMU devices to share arbitrary physical memory regions between one another, as +well as pass simple interrupts. It attempts to share memory regions directly +when feasible, falling back to MMIO via socket communication when it's not. + +The device is modeled by QEMU as a PCI device, as well as a memory +region/interrupt directly usable on platforms like ARM, with an entry in the +device tree. + +An example use case for memexpose is forwarding ARM Trustzone functionality +between two VMs running different architectures - one running a rich OS on an +x86_64 VM, the other running the trusted OS on an ARM VM. In this scenario, +sharing arbitrary memory regions allows such forwarding to work with minimal +changes to the trusted OS. + + +== Configuring the memexpose device == + +The device uses two character devices to communicate with the other VM - one for +synchronous memory accesses, another for passing interrupts. A typical +configuration of the PCI device looks like this: + + -chardev socket,...,path=/tmp/qemu-memexpose-mem,id="mem" \ + -chardev socket,...,path=/tmp/qemu-memexpose-intr,id="intr" \ + -device memexpose-pci,mem_chardev="mem",intr_chardev="intr",shm_size=0xN... + +While the arm-virt machine device can be enabled like this: + + -chardev socket,...,path=/tmp/qemu-memexpose-mem,id="mem-mem" \ + -chardev socket,...,path=/tmp/qemu-memexpose-intr,id="mem-intr" \ + -machine memexpose-ep=mem,memexpose-size=0xN... + +Normally one of the VMs would have 'server,nowait' options set on these +chardevs. + +At the moment the memory exposed to the other device always starts at 0 +(relative to system_memory). The shm_size/memexpose-size property indicates the +size of the exposed region. + +The *_chardev/memexpose-ep properties are used to point the memexpose device to +chardevs used to communicate with the other VM. + + +== Memexpose PCI device interface === + +The device has vendor ID 1af4, device ID 1111, revision 0. + +=== PCI BARs === + +The device has two BARs: +- BAR0 holds device registers and interrupt data (0x1000 byte MMIO), +- BAR1 maps memory from the other VM. + +To use the device, you must first enable it by writing 1 to BAR0 at address 0. +This makes QEMU wait for another VM to connect. Once that is done, you can +access the other machine's memory via BAR1. + +Interrupts can be sent and received by configuring the device for interrupts and +reading and writing to registers in BAR0. + +=== Device registers === + +BAR 0 has following registers: + + Offset Size Access On reset Function + 0 8 read/write 0 Enable/disable device + bit 0: device enabled / disabled + bit 1..63: reserved + 0x400 8 read/write 0 Interrupt RX address + bit 1: interrupt read + bit 1..63: reserved + 0x408 8 read-only UD RX Interrupt type + 0x410 128 read-only UD RX Interrupt data + 0x800 8 read/write 0 Interrupt TX address + 0x808 8 write-only N/A TX Interrupt type + 0x810 128 write-only N/A TX Interrupt data + +All other addresses are reserved. + +=== Handling interrupts === + +To send interrupts, write to TX interrupt address. Contents of TX interrupt type +and data regions will be send along with the interrupt. The device is holding an +internal queue of 16 interrupts, any extra interrupts are silently dropped. + +To receive interrupts, read the interrupt RX address. If the value is 1, then +RX interrupt type and data registers contain the data / type sent by the other +VM. Otherwise (the value is 0), no more interrupts are queued and RX interrupt +type/data register contents are undefined. + + +=== Platform device protocol === + +The other memexpose device type (provided on e.g. ARM via device tree) is +essentially identical to the PCI device. It provides two memory ranges that work +exactly like the PCI BAR regions and an interrupt for signaling an interrupt +from the other VM. + +== Memexpose peer protocol == + +This section describes the current memexpose protocol. It is a WIP and likely to +change. + +A connection between two VMs connected via memexpose happens on two sockets - an +interrupt socket and a memory socket. All communication on the earlier is +asynchronous, while communication on the latter is synchronous. + +When the device is enabled, QEMU waits for memexpose's chardevs to connect. No +messages are exchanged upon connection. After devices are connected, the +following messages can be exchanged: + +1. Interrupt message, via interrupt socket. This message contains interrupt type + and data. + +2. Memory access request message, via memory socket. It contains a target + address, access size and valueto write in case of writes. + +3. Memory access return message. This contains an access result (as + MemTxResult) and a value in case of reads. If the accessed region can be + shared directly, then this region's start, size and shmem file descriptor are + also sent. + +4. Memory invalidation message. This is sent when a VM's memory region changes + status and contains such region's start and size. The other VM is expected to + drop any shared regions overlapping with it. + +5. Memory invalidation response. This is sent in response to a memory + invalidation message; after receiving this the remote VM is guaranteed have + scheduled region invalidation before accessing the region again. + +As QEMU performes memory accesses synchronously, we want to perform memory +invalidation before returning to guest OS and both VMs might try to perform a +remote memory access at the same time, all messages passed via the memory socket +have an associated priority. + +At any time, only one message with a given priority is in flight. After sending +a message, the VM reads messages on the memory socket, servicing all messages +with a priority higher than its own. Once it receives a message with a priority +lower than its own, it waits for a response to its own message before servicing +it. This guarantees no deadlocks, assuming that messages don't trigger further +messages. Message priorities, from highest to lowest, are as follows: + +1. Memory invalidation message/response. +2. Memory access message/response. + +Additionally, one of the VMs is assigned a sub-priority higher than another, so +that its messages of the same type have priority over the other VM's messages. + +Memory access messages have the lowest priority in order to guarantee that QEMU +will not attempt to access memory while in the middle of a memory region +listener. + +=== Memexpose memory sharing === + +This section describes the memexpose memory sharing mechanism. + +Memory sharing is implemented lazily, initially no memory regions are shared +between devices. When a memory access is performed via a socket, the remote VM +checks whether the underlying memory range is backed by shareable memory. If it +is, the VM finds out the maximum contiguous flat range backed by this region and +sends its file descriptor to the local VM, where it is mapped as a subregion. + +The memexpose device registers memory listeners for the memory region it's +using. Whenever a flat range for this region (that is not this device's +subregion) changes, that range is sent to the other VM and any directly shared +memory region intersecting this range is scheduled for removal via a BH. -- 2.7.4