From: Manish Honap <[email protected]> Capture the ownership model, bind sequence, region layout, and the DVSEC + HDM + CM cap-array virtualization contract for vfio-pci Type-2 device passthrough in Documentation/driver-api/vfio-pci-cxl.rst.
cxl-core owns the CXL register virtualization through devm_cxl_passthrough_create() and the cxl_passthrough_*_rw() helpers; vfio-pci is a transport that forwards guest reads and writes through them. The HDM HPA range is mapped by vfio for the mmappable HDM region. Topology constraints and host-bridge decoder limitations are listed under Known limitations. Signed-off-by: Manish Honap <[email protected]> --- Documentation/driver-api/index.rst | 1 + Documentation/driver-api/vfio-pci-cxl.rst | 282 ++++++++++++++++++++++ 2 files changed, 283 insertions(+) create mode 100644 Documentation/driver-api/vfio-pci-cxl.rst diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst index eaf7161ff957..52f0c06a376a 100644 --- a/Documentation/driver-api/index.rst +++ b/Documentation/driver-api/index.rst @@ -47,6 +47,7 @@ of interest to most developers working on device drivers. vfio-mediated-device vfio vfio-pci-device-specific-driver-acceptance + vfio-pci-cxl Bus-level documentation ======================= diff --git a/Documentation/driver-api/vfio-pci-cxl.rst b/Documentation/driver-api/vfio-pci-cxl.rst new file mode 100644 index 000000000000..1527b7dd85d0 --- /dev/null +++ b/Documentation/driver-api/vfio-pci-cxl.rst @@ -0,0 +1,282 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: <isonum.txt> + +=========================================== +VFIO-PCI: CXL Type-2 device passthrough +=========================================== + +:Author: Manish Honap <[email protected]> + +Overview +======== + +vfio-pci-core, when built with ``CONFIG_VFIO_PCI_CXL=y``, passes a +CXL Type-2 accelerator (CXL r4.0, HDM-D / HDM-DB) through to a KVM +guest. The host firmware commits the endpoint's HDM decoder before +vfio-pci binds; the guest sees a CXL Type-2 device whose CXL.mem +range is already programmed and locked. The guest may inspect the +HDM Decoder Capability block and DVSEC Device capability via spec- +defined paths, and access the device's CXL.mem range as +mmap'd memory. + +Scope +===== + +The supported scope is intentionally narrow: + +* One CXL endpoint per host bridge. +* The endpoint exposes exactly one HDM decoder (decoder 0). +* No interleave. +* Host firmware has committed the endpoint HDM decoder before + vfio-pci probes. Devices whose HDM decoder is *uncommitted* fail + vfio-pci bind cleanly. +* The host bridge is in single-RP-passthrough mode (the CXL host + bridge's own HDM decoder is not used; CFMWS-to-RP decode flows + implicitly). This assumption is currently *not enforced* by + vfio-pci-core; it is a known limitation, see the Known + limitations section. + +Multi-decoder, interleave, FLR / reset state-machine integration, +and host-bridge HDM decoder programming are explicitly out of scope. +Adding any of them is additive on top of the contract described +below. + +Driver model +============ + +There is no dedicated ``vfio-cxl`` PCI driver. vfio-pci is the only +driver that binds to the host PCI device. When built with +``CONFIG_VFIO_PCI_CXL=y``, vfio-pci-core calls into the cxl subsystem +to do four things at bind time: + +1. ``devm_cxl_dev_state_create()`` — allocate per-device CXL state + embedded in ``struct vfio_pci_cxl_state``. +2. ``cxl_pci_setup_regs()`` + ``cxl_get_hdm_info()`` — probe the + Register Locator DVSEC and harvest the HDM block's BAR-relative + offset and size. +3. ``cxl_await_range_active()`` — wait for the firmware-committed + range to become live. +4. ``devm_cxl_passthrough_create()`` — snapshot the CXL Device DVSEC + body, the HDM Decoder block, and the CXL.cache/mem cap-array + prefix into shadows owned by cxl-core. All subsequent + register-virtualization happens inside ``drivers/cxl/core/passthrough.c``. +5. ``devm_cxl_probe_mem()`` — register a ``cxl_memdev``, enumerate + the endpoint port, and auto-attach the firmware-committed + region. cxl_mem binds to the memdev as it would for any other + Type-2 accelerator. + +Ownership split +=============== + +Each device-visible surface is owned by exactly one subsystem: + +============================================ ============================================== +Surface Owner +============================================ ============================================== +PCI config (non-DVSEC, non-CXL) vfio-pci-core ``vconfig`` (existing perm-bits) +CXL Device DVSEC body cxl-core ``cxl_passthrough_dvsec_rw()`` +HDM Decoder Capability block cxl-core ``cxl_passthrough_hdm_rw()`` +CM cap-array (read-only snapshot) cxl-core ``cxl_passthrough_cm_rw()`` +``cxl_memdev`` / endpoint port / autoregion cxl-core ``devm_cxl_probe_mem()`` +HDM HPA range mapping vfio-pci ``request_mem_region`` + ``memremap`` +Sparse mmap layout for the component BAR vfio-pci +============================================ ============================================== + +The vfio side holds no shadow buffer of its own. ``vfio_pci_cxl_state`` +caches small scalars (DVSEC offset/size, HDM offset/size, component +BAR layout) for dispatch decisions; the actual virtualization +semantics live in cxl-core. + +Bind sequence +============= + +``vfio_pci_cxl_acquire()`` is called from +``vfio_pci_core_register_device()`` at PCI bind time. The sequence:: + + 0. devm_cxl_dev_state_create(parent, CXL_DEVTYPE_DEVMEM, dsn, + dvsec_off, vfio_pci_cxl_state, cxlds, + /*mbox=*/false) + + 1. pcie_is_cxl() and pci_find_dvsec_capability(CXL_DEVICE) + -> -ENODEV if either is absent + -> -ENODEV if the DVSEC's MEM_CAPABLE bit is clear + + 2. pci_enable_device_mem() + + 2a. cxl_pci_setup_regs(CXL_REGLOC_RBI_COMPONENT) + 2b. cxl_get_hdm_info() — REJECT hdm_count != 1 with -EOPNOTSUPP + 2c. cxl_regblock_get_bar_info() + 2d. cxl_await_range_active() + 2e. devm_cxl_passthrough_create(&pdev->dev, &cxlds) + + 3. pci_disable_device() + Clears PCI_COMMAND_MASTER but NOT PCI_COMMAND_MEMORY (see + do_pci_disable_device() in drivers/pci/pci.c). Subsequent + MMIO from step 4 still succeeds. + + 4. devm_cxl_probe_mem(&cxlds, &hpa_range) + Registers the memdev, enumerates the endpoint port, attaches + the firmware-committed autoregion. + + 5. request_mem_region(hpa_base, hpa_size) + memremap_wb() + + 6. vdev->cxl = cxl (state published; HDM and COMP_REGS regions + are registered later when the VFIO fd is opened) + +Fail-closed semantics +--------------------- + +Three errnos are mapped to "not a CXL device; caller falls back to +plain vfio-pci": ``pcie_is_cxl()`` false, DVSEC absent, ``MEM_CAPABLE`` +clear. All three return ``-ENODEV`` from +``vfio_pci_cxl_acquire()``; the caller treats them as a silent +fall-through. + +Any other negative errno from the bind sequence aborts the vfio-pci +bind entirely. The guest never sees a half-initialised CXL device. +Once ``devm_cxl_probe_mem()`` has succeeded the published memdev +holds a pointer into the embedded ``cxl_dev_state``; a failure in +``vfio_cxl_map_hdm()`` after that point cannot ``devm_kfree(cxl)`` +and leaves the state allocated for the lifetime of the PCI device +(devres unwinds it at pdev removal). + +VFIO regions exposed +==================== + +When the VFIO fd is first opened, ``vfio_pci_cxl_open()`` registers +two additional regions on top of the standard vfio-pci BARs / config +region: + +HDM region (``VFIO_REGION_SUBTYPE_CXL``) + Mappable view of the device's firmware-committed HPA range. + + * ``mmap``: fault handler does + ``vmf_insert_pfn(vma, addr, PHYS_PFN(hpa_base + off))``. The + guest gets the same backing physical memory the host sees. + * ``pread`` / ``pwrite``: served from the ``memremap_wb()`` kva + captured at bind time. + +COMP_REGS region (``VFIO_REGION_SUBTYPE_CXL_COMP_REGS``) + Shadow of the CXL component register sub-range. ``pread`` / + ``pwrite`` only; ``mmap`` is intentionally not supported (the VMM + uses this region instead of mmapping the BAR). Dword-aligned + access only; sub-dword accesses return ``-EINVAL``. + + Dispatch by offset: + + ============================================ ================================= + Offset range cxl-core helper + ============================================ ================================= + ``< CXL_CM_OFFSET`` zero-fill (reserved) + ``CXL_CM_OFFSET .. hdm_reg_offset`` ``cxl_passthrough_cm_rw()`` + ``hdm_reg_offset .. +hdm_reg_size`` ``cxl_passthrough_hdm_rw()`` + ``>= hdm_reg_offset + hdm_reg_size`` zero-fill (reserved) + ============================================ ================================= + +DVSEC virtualization contract +============================= + +The CXL Device DVSEC body is reached through the standard PCI +config-space path. ``vfio_pci_config_rw_single()`` clips chunks at +the DVSEC body boundary via ``vfio_pci_cxl_config_boundary()`` and +forwards body bytes to ``vfio_pci_cxl_config_rw()``, which in turn +calls ``cxl_passthrough_dvsec_rw()``. + +Per-field write semantics (CXL r4.0 §8.1.3): + +============================================ ============================================== +Field (offset from DVSEC cap base) Spec attribute / behaviour +============================================ ============================================== +CAPABILITY (0x0a) HwInit — writes dropped +CONTROL (0x0c) RWL — gated on DVSEC CONFIG_LOCK +STATUS (0x0e) RW1C +CONTROL2 (0x10) RWL — gated on DVSEC CONFIG_LOCK +STATUS2 (0x12) RW1C +LOCK (0x14) RWO — first 1-write latches CONFIG_LOCK +Range1 SIZE_HI/LO BASE_HI/LO (0x18..0x27) HwInit — writes dropped +Range2 SIZE_HI/LO BASE_HI/LO (0x28..0x37) RsvdZ — writes dropped +============================================ ============================================== + +HDM virtualization contract +=========================== + +Per CXL r4.0 §8.2.4.20, on the single firmware-committed decoder: + +============================================ ============================================== +Field (offset from HDM block base) Spec attribute / behaviour +============================================ ============================================== +HDM Decoder Capability Header (0x00) HwInit — writes dropped +HDM Decoder Global Control (0x04) RW — shadow +Decoder 0 BASE_LO / BASE_HI RWL — gated on COMMITTED or LOCK_ON_COMMIT +Decoder 0 SIZE_LO / SIZE_HI RWL — same gate +Decoder 0 CTRL Implements COMMIT → COMMITTED handshake; once + COMMITTED, only COMMIT toggles are honoured +============================================ ============================================== + +CM cap-array +============ + +The CM cap-array (CXL r4.0 §8.2.4) prefix is snapshotted from the +device's component register MMIO at bind time and served read-only +through ``cxl_passthrough_cm_rw()``. Guest writes to the cap-array +are silently dropped. + +UAPI: CAP_CXL +============= + +``VFIO_DEVICE_GET_INFO`` returns ``VFIO_DEVICE_FLAGS_CXL`` and a +``VFIO_DEVICE_INFO_CAP_CXL`` capability:: + + struct vfio_device_info_cap_cxl { + struct vfio_info_cap_header header; + __u32 flags; + #define VFIO_CXL_CAP_HOST_FIRMWARE_COMMITTED (1 << 0) + __u32 hdm_region_idx; + __u32 comp_reg_region_idx; + __u32 comp_reg_bar; + __u32 __resv; + __u64 comp_reg_offset; + __u64 comp_reg_size; + }; + +``VFIO_DEVICE_GET_REGION_INFO`` on the component BAR returns a +``VFIO_REGION_INFO_CAP_SPARSE_MMAP`` that excludes +``[comp_reg_offset, comp_reg_offset + comp_reg_size)`` from the +mmappable areas. + +Known limitations +================= + +* Host bridge HDM decoder programming is not driven by this driver. + The driver silently assumes single-RP-passthrough topology (the + CXL host bridge's own HDM decoder is not used). Two remediations + are possible: either refuse to bind when the topology is not + single-RP-passthrough, or extend the kernel ABI so a host-bridge + HDM decoder programmer can attest the lock before vfio bind. Both + leave the existing contract intact or add a single boolean to + CAP_CXL. + +* Function-level reset (FLR) does not re-snapshot the shadows. + Guests that issue FLR will see stale HDM and DVSEC state after + the reset. + +* Multi-decoder devices return ``-EOPNOTSUPP`` at bind. + +* Hotplug while the device is held by vfio is not supported. + +* Raw BAR read/write into the CXL component register sub-range is + unsupported. VMMs must use the COMP_REGS region. + +Selftest +======== + +``tools/testing/selftests/vfio/vfio_cxl_type2_test`` exercises the +five surfaces: + +* ``device_is_cxl`` — GET_INFO returns FLAGS_CXL + CAP_CXL. +* ``hdm_region_mmap_rw`` — mmap + read/write pattern. +* ``component_bar_sparse_mmap`` — SPARSE_MMAP cap excludes the CXL + block. +* ``comp_regs_cm_cap_array_read`` — CM cap-array header is served + from the cxl-core snapshot. +* ``dvsec_lock_byte_read`` -- DVSEC config-rw clipping shim is wired. -- 2.25.1

