This moves page pinning (get_user_pages_fast()/put_page()) code out of
the platform IOMMU code and puts it to VFIO IOMMU driver where it belongs
to as the platform code does not deal with page pinning.
This makes iommu_take_ownership()/iommu_release_ownership() deal with
the IOMMU table bitmap onl
Normally a bitmap from the iommu_table is used to track what TCE entry
is in use. Since we are going to use iommu_table without its locks and
do xchg() instead, it becomes essential not to put bits which are not
implied in the direction flag as the old TCE value (more precisely -
the permission bit
This adds a iommu_table_ops struct and puts pointer to it into
the iommu_table struct. This moves tce_build/tce_free/tce_get/tce_flush
callbacks from ppc_md to the new struct where they really belong to.
This adds the requirement for @it_ops to be initialized before calling
iommu_init_table() to m
This adds missing locks in iommu_take_ownership()/
iommu_release_ownership().
This marks all pages busy in iommu_table::it_map in order to catch
errors if there is an attempt to use this table while ownership over it
is taken.
This only clears TCE content if there is no page marked busy in it_map
The set_iommu_table_base_and_group() name suggests that the function
sets table base and add a device to an IOMMU group.
The actual purpose for table base setting is to put some reference
into a device so later iommu_add_device() can get the IOMMU group
reference and the device to the group.
At t
This relies on the fact that a PCI device always has an IOMMU table
which may not be the case when we get dynamic DMA windows so
let's use more reliable check for IOMMU group here.
As we do not rely on the table presence here, remove the workaround
from pnv_pci_ioda2_set_bypass(); also remove the
This is a part of moving DMA window programming to an iommu_ops
callback. pnv_pci_ioda2_set_window() takes an iommu_table_group as
a first parameter (not pnv_ioda_pe) as it is going to be used as
a callback for VFIO DDW code.
This adds pnv_pci_ioda2_tvt_invalidate() to invalidate TVT as it is
a go
This moves iommu_table creation to the beginning to make following changes
easier to review. This starts using table parameters from the iommu_table
struct.
This should cause no behavioural change.
Signed-off-by: Alexey Kardashevskiy
Reviewed-by: David Gibson
Reviewed-by: Gavin Shan
---
Change
This replaces direct accesses to TCE table with a helper which
returns an TCE entry address. This does not make difference now but will
when multi-level TCE tables get introduces.
No change in behavior is expected.
Signed-off-by: Alexey Kardashevskiy
Reviewed-by: David Gibson
Reviewed-by: Gavin
This is a part of moving TCE table allocation into an iommu_ops
callback to support multiple IOMMU groups per one VFIO container.
This moves the code which allocates the actual TCE tables to helpers:
pnv_pci_ioda2_table_alloc_pages() and pnv_pci_ioda2_table_free_pages().
These do not allocate/free
TCE tables might get too big in case of 4K IOMMU pages and DDW enabled
on huge guests (hundreds of GB of RAM) so the kernel might be unable to
allocate contiguous chunk of physical memory to store the TCE table.
To address this, POWER8 CPU (actually, IODA2) supports multi-level
TCE tables, up to 5
This enables sPAPR defined feature called Dynamic DMA windows (DDW).
Each Partitionable Endpoint (IOMMU group) has an address range on a PCI bus
where devices are allowed to do DMA. These ranges are called DMA windows.
By default, there is a single DMA window, 1 or 2GB big, mapped at zero
on a PC
At the moment writing new TCE value to the IOMMU table fails with EBUSY
if there is a valid entry already. However PAPR specification allows
the guest to write new TCE value without clearing it first.
Another problem this patch is addressing is the use of pool locks for
external IOMMU users such a
The existing code programmed TVT#0 with some address and then
immediately released that memory.
This makes use of pnv_pci_ioda2_unset_window() and
pnv_pci_ioda2_set_bypass() which do correct resource release and
TVT update.
Signed-off-by: Alexey Kardashevskiy
---
arch/powerpc/platforms/powernv/
This adds tce_iommu_take_ownership() and tce_iommu_release_ownership
which call in a loop iommu_take_ownership()/iommu_release_ownership()
for every table on the group. As there is just one now, no change in
behaviour is expected.
At the moment the iommu_table struct has a set_bypass() which enabl
This is to make extended ownership and multiple groups support patches
simpler for review.
This should cause no behavioural change.
Signed-off-by: Alexey Kardashevskiy
[aw: for the vfio related changes]
Acked-by: Alex Williamson
Reviewed-by: David Gibson
Reviewed-by: Gavin Shan
---
drivers/v
This makes use of the it_page_size from the iommu_table struct
as page size can differ.
This replaces missing IOMMU_PAGE_SHIFT macro in commented debug code
as recently introduced IOMMU_PAGE_XXX macros do not include
IOMMU_PAGE_SHIFT.
Signed-off-by: Alexey Kardashevskiy
Reviewed-by: David Gibson
Modern IBM POWERPC systems support multiple (currently two) TCE tables
per IOMMU group (a.k.a. PE). This adds a iommu_table_group container
for TCE tables. Right now just one table is supported.
This defines iommu_table_group struct which stores pointers to
iommu_group and iommu_table(s). This rep
Before the IOMMU user (VFIO) would take control over the IOMMU table
belonging to a specific IOMMU group. This approach did not allow sharing
tables between IOMMU groups attached to the same container.
This introduces a new IOMMU ownership flavour when the user can not
just control the existing IO
The iommu_table struct keeps a list of IOMMU groups it is used for.
At the moment there is just a single group attached but further
patches will add TCE table sharing. When sharing is enabled, TCE cache
in each PE needs to be invalidated so does the patch.
This does not change pnv_pci_ioda1_tce_in
The existing implementation accounts the whole DMA window in
the locked_vm counter. This is going to be worse with multiple
containers and huge DMA windows. Also, real-time accounting would requite
additional tracking of accounted pages due to the page size difference -
IOMMU uses 4K pages and syst
This adds create/remove window ioctls to create and remove DMA windows.
sPAPR defines a Dynamic DMA windows capability which allows
para-virtualized guests to create additional DMA windows on a PCI bus.
The existing linux kernels use this new window to map the entire guest
memory and switch to the
This extends iommu_table_group_ops by a set of callbacks to support
dynamic DMA windows management.
create_table() creates a TCE table with specific parameters.
it receives iommu_table_group to know nodeid in order to allocate
TCE table memory closer to the PHB. The exact format of allocated
multi
There moves locked pages accounting to helpers.
Later they will be reused for Dynamic DMA windows (DDW).
This reworks debug messages to show the current value and the limit.
This stores the locked pages number in the container so when unlocking
the iommu table pointer won't be needed. This does n
So far an iommu_table lifetime was the same as PE. Dynamic DMA windows
will change this and iommu_free_table() will not always require
the group to be released.
This moves iommu_group_put() out of iommu_free_table().
This adds a iommu_pseries_free_table() helper which does
iommu_group_put() and i
This adds a way for the IOMMU user to know how much a new table will
use so it can be accounted in the locked_vm limit before allocation
happens.
This stores the allocated table size in pnv_pci_ioda2_get_table_size()
so the locked_vm counter can be updated correctly when a table is
being disposed.
At the moment DMA map/unmap requests are handled irrespective to
the container's state. This allows the user space to pin memory which
it might not be allowed to pin.
This adds checks to MAP/UNMAP that the container is enabled, otherwise
-EPERM is returned.
Signed-off-by: Alexey Kardashevskiy
[a
The existing code has 3 calls to iommu_register_group() and
all 3 branches actually cover all possible cases.
This replaces 3 calls with one and moves the registration earlier;
the latter will make more sense when we add TCE table sharing.
Signed-off-by: Alexey Kardashevskiy
Reviewed-by: Gavin S
This is a pretty mechanical patch to make next patches simpler.
New tce_iommu_unuse_page() helper does put_page() now but it might skip
that after the memory registering patch applied.
As we are here, this removes unnecessary checks for a value returned
by pfn_to_page() as it cannot possibly retu
We are adding support for DMA memory pre-registration to be used in
conjunction with VFIO. The idea is that the userspace which is going to
run a guest may want to pre-register a user space memory region so
it all gets pinned once and never goes away. Having this done,
a hypervisor will not have to
At the moment iommu_free_table() only releases memory if
the table was initialized for the platform code use, i.e. it had
it_map initialized (which purpose is to track DMA memory space use).
With dynamic DMA windows, we will need to be able to release
iommu_table even if it was used for VFIO in wh
Modern IBM POWERPC systems support multiple (currently two) TCE tables
per IOMMU group (a.k.a. PE). This adds a iommu_table_group container
for TCE tables. Right now just one table is supported.
For IODA, instead of embedding iommu_table, the new iommu_table_group
keeps pointers to those. The iomm
The pnv_pci_ioda_tce_invalidate() helper invalidates TCE cache. It is
supposed to be called on IODA1/2 and not called on p5ioc2. It receives
start and end host addresses of TCE table.
IODA2 actually needs PCI addresses to invalidate the cache. Those
can be calculated from host addresses but since
This checks that the TCE table page size is not bigger that the size of
a page we just pinned and going to put its physical address to the table.
Otherwise the hardware gets unwanted access to physical memory between
the end of the actual page and the end of the aligned up TCE page.
Since compoun
At the moment the DMA setup code looks for the "ibm,opal-tce-kill"
property which contains the TCE kill register address. Writing to
this register invalidates TCE cache on IODA/IODA2 hub.
This moves the register address from iommu_table to pnv_pnb as this
register belongs to PHB and invalidates TC
This commit defines the API headers for guest debugging. There are two
architecture specific debug structures:
- kvm_guest_debug_arch, allows us to pass in HW debug registers
- kvm_debug_exit_arch, signals exception and possible faulting address
The type of debugging being used is controlled
Here is V5 of the KVM Guest Debug support for arm64.
The changes are fairly minimal from the last round:
- dropped KVM_GUESTDBG_USE_SW/HW_BP unifying patch (ABI break)
- new comment patch to fix comments in hyp.S (also sent separately)
- simplified singlestep code (no longer needs to preser
This is a precursor for later patches which will need to do more to
setup debug state before entering the hyp.S switch code. The existing
functionality for setting mdcr_el2 has been moved out of hyp.S and now
uses the value kept in vcpu->arch.mdcr_el2.
As the assembler used to previously mask and
This introduces a level of indirection for the debug registers. Instead
of using the sys_regs[] directly we store registers in a structure in
the vcpu. As we are no longer tied to the layout of the sys_regs[] we
can make the copies size appropriate for control and value registers.
This also entail
Bring into line with the comments for the other structures and their
KVM_EXIT_* cases. Also update api.txt to reflect use in kvm_run
documentation.
Signed-off-by: Alex Bennée
Reviewed-by: David Hildenbrand
Reviewed-by: Andrew Jones
Acked-by: Christoffer Dall
---
v2
- add comments for other
This adds support for SW breakpoints inserted by userspace.
We do this by trapping all guest software debug exceptions to the
hypervisor (MDCR_EL2.TDE). The exit handler sets an exit reason of
KVM_EXIT_DEBUG with the kvm_debug_exit_arch structure holding the
exception syndrome information.
It wil
The elr_el2 and spsr_el2 registers in fact contain the processor state
before entry into the hypervisor code. In the case of guest state it
could be in either el0 or el1.
Signed-off-by: Alex Bennée
---
arch/arm64/kvm/hyp.S | 8
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git
This is a pre-cursor to sharing the code with the guest debug support.
This replaces the big macro that fishes data out of a fixed location
with a more general helper macro to restore a set of debug registers. It
uses macro substitution so it can be re-used for debug control and value
registers. It
This commit adds a stub function to support the KVM_SET_GUEST_DEBUG
ioctl. Any unsupported flag will return -EINVAL. For now, only
KVM_GUESTDBG_ENABLE is supported, although it won't have any effects.
Signed-off-by: Alex Bennée .
Reviewed-by: Christoffer Dall
---
v2
- simplified form of the io
This adds support for single-stepping the guest. To do this we need to
manipulate the guests PSTATE.SS and MDSCR_EL1.SS bits which we do in the
kvm_arm_setup/clear_debug() so we don't affect the apparent state of the
guest. Additionally while the host is debugging the guest we suppress
the ability
Finally advertise the KVM capability for SET_GUEST_DEBUG. Once arm
support is added this check can be moved to the common
kvm_vm_ioctl_check_extension() code.
Signed-off-by: Alex Bennée
Acked-by: Christoffer Dall
---
v3:
- separated capability check from previous patches
- moved into arm64 s
This includes trace points for:
kvm_arch_setup_guest_debug
kvm_arch_clear_guest_debug
I've also added some generic register setting trace events and also a
trace point to dump the array of hardware registers.
Signed-off-by: Alex Bennée
---
v3
- add trace event for debug access.
- remove
This adds support for userspace to control the HW debug registers for
guest debug. In the debug ioctl we copy the IMPDEF defined number of
registers into a new register set called host_debug_state. There is now
a new vcpu parameter called debug_ptr which selects which register set
is to copied into
Currently we track which IRQ has been mapped to which VGIC list
register and also have to synchronize both. We used to do this
to hold some extra state (for instance the active bit).
It turns out that this extra state in the LRs is no longer needed and
this extra tracking causes some pain later.
Re
The GICv3 ITS (Interrupt Translation Service) is a part of the
ARM GICv3 interrupt controller used for implementing MSIs.
It specifies a new kind of interrupts (LPIs), which are mapped to
establish a connection between a device, its MSI payload value and
the target processor the IRQ is eventually d
The ARM GICv3 ITS MSI controller requires a device ID to be able to
assign the proper interrupt vector. On real hardware, this ID is
sampled from the bus. To be able to emulate an ITS controller, extend
the KVM MSI interface to let userspace provide such a device ID. For
PCI devices, the device ID
Currently we destroy the VGIC emulation in one function that cares for
all emulated models. The ITS emulation will require some
differentiation, so introduce a per-emulation-model destroy method.
Use it for a tiny GICv3 specific code already.
Signed-off-by: Andre Przywara
---
include/kvm/arm_vgi
The properties and status of the GICv3 LPIs are hold in tables in
(guest) memory. To achieve reasonable performance, we cache this
data in our own data structures, so we need to sync those two views
from time to time. This behaviour is well described in the GICv3 spec
and is also exercised by hardw
The ARM GICv3 ITS emulation code goes into a separate file, but
needs to be connected to the GICv3 emulation, of which it is an
option.
Introduce the skeletton with function stubs to be filled later.
Introduce the basic ITS data structure and initialize it, but don't
return any success yet, as we a
As the actual LPI number in a guest can be quite high, but is mostly
assigned using a very sparse allocation scheme, bitmaps and arrays
for storing the virtual interrupt status are a waste of memory.
We use our equivalent of the "Interrupt Translation Table Entry"
(ITTE) to hold this extra status i
In the GICv3 redistributor there are the PENDBASER and PROPBASER
registers which we did not emulate so far, as they only make sense
when having an ITS. In preparation for that emulate those MMIO
accesses by storing the 64-bit data written into it into a variable
which we later read in the ITS emula
The GICv3 Interrupt Translation Service (ITS) uses tables in memory
to allow a sophisticated interrupt routing. It features device tables,
an interrupt table per device and a table connecting "collections" to
actual CPUs (aka. redistributors in the GICv3 lingo).
Since the interrupt numbers for the
Add emulation for some basic MMIO registers used in the ITS emulation.
This includes:
- GITS_{CTLR,TYPER,IIDR}
- ID registers
- GITS_{CBASER,CREAD,CWRITER}
those implement the ITS command buffer handling
Signed-off-by: Andre Przywara
---
include/kvm/arm_vgic.h | 3 +
include/linu
When userland wants to inject a MSI into the guest, we have to use
our data structures to find the LPI number and the VCPU to receivce
the interrupt.
Use the wrapper functions to iterate the linked lists and find the
proper Interrupt Translation Table Entry. Then set the pending bit
in this ITTE to
The connection between a device, an event ID, the LPI number and the
allocated CPU is stored in in-memory tables in a GICv3, but their
format is not specified by the spec. Instead software uses a command
queue to let the ITS implementation use their own format.
Implement handlers for the various IT
The ARM GICv3 ITS controller requires a separate register frame to
cover ITS specific registers. Add a new VGIC address type and store
the address in a field in the vgic_dist structure.
Provide a function to check whether userland has provided the address,
so ITS functionality can be guarded by tha
If userspace has provided a base address for the ITS register frame,
we enable the bits that advertise LPIs in the GICv3.
When the guest has enabled LPIs and the ITS, we enable the emulation
part by initializing the ITS data structures and trapping on ITS
register frame accesses by the guest.
Also
dpkg in the guest fails when it tries to use fsync() on a directory:
openat(AT_FDCWD, "/var/lib/dpkg",
O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY|O_CLOEXEC) = 4
fsync(4)= -1 EINVAL (Invalid argument)
stracing lkvm shows that this is converted to:
openat(AT_FDCWD
On Thu, May 28, 2015 at 01:34:33AM -0400, Wei Huang wrote:
> This patches enables ACPI support for KVM virtual GICv2. KVM parses
> ACPI table for virt GIC related information and initializes resources.
>
> Signed-off-by: Alexander Spyridaki
> Signed-off-by: Wei Huang
> ---
> virt/kvm/arm/vgic-v
After building, there is a lot of clutter from the dependency system.
Let's clean this up by using dir/.file.d style dependencies, similar
to those used in the Linux kernel.
In order to support this, rearrange the dependency generation to
create the dependency files as we build rather than as a se
On 05/29/2015 09:06 AM, Andrew Jones wrote:
> On Thu, May 28, 2015 at 01:34:33AM -0400, Wei Huang wrote:
>> This patches enables ACPI support for KVM virtual GICv2. KVM parses
>> ACPI table for virt GIC related information and initializes resources.
>>
>> Signed-off-by: Alexander Spyridaki
>> Si
As we haven't always had guest debug support we need to probe for it.
Additionally we don't do this in the start-up capability code so we
don't fall over on old kernels.
Signed-off-by: Alex Bennée
---
target-arm/kvm64.c | 18 ++
1 file changed, 18 insertions(+)
diff --git a/targ
This adds support for single-step. There isn't much to do on the QEMU
side as after we set-up the request for single step via the debug ioctl
it is all handled within the kernel.
Signed-off-by: Alex Bennée
---
v2
- convert to using HSR_EC
v3
- use internals.h definitions
---
target-arm/kvm.
Hi,
You may be wondering what happened to v3 and v4. They do exist but
they didn't change much from the the original patches as I've been
mostly looking the kernel side of the equation. So in summary the
changes are:
- updates to the kernel ABI
- don't fall over on kernels without debug suppo
I assume I'll properly merge the KVM Headers direct from Linux when
the kernel side is upstream. These headers came from:
https://git.linaro.org/people/alex.bennee/linux.git/shortlog/refs/heads/guest-debug/4.1-rc5-v5
Signed-off-by: Alex Bennée
---
v2
- update ABI to include ->far
v3
- updat
From: Alex Bennée
If we can't find details for the debug exception in our debug state
then we can assume the exception is due to debugging inside the guest.
To inject the exception into the guest state we re-use the TCG exception
code (do_interupt).
However while guest debugging is in effect we
This adds basic support for HW assisted debug. The ioctl interface to
KVM allows us to pass an implementation defined number of break and
watch point registers. When KVM_GUESTDBG_USE_HW_BP is specified these
debug registers will be installed in place on the world switch into the
guest.
The hardwar
These don't involve messing around with debug registers, just setting
the breakpoint instruction in memory. GDB will not use this mechanism if
it can't access the memory to write the breakpoint.
All the kernel has to do is ensure the hypervisor traps the breakpoint
exceptions and returns to usersp
If a GICv3-enabled guest tries to configure Group0, we print a
warning on the console (because we don't support Group0 interrupts).
This is fairly pointless, and would allow a guest to spam the
console. Let's just drop the warning.
Signed-off-by: Marc Zyngier
---
virt/kvm/arm/vgic-v3-emul.c | 2
2015-05-27 19:05+0200, Paolo Bonzini:
> This patch includes changes to the external API for SMM support.
> All the changes are predicated by the availability of a new
> capability, KVM_CAP_X86_SMM, which is added at the end of the
> patch series.
>
> Signed-off-by: Paolo Bonzini
> ---
> diff --gi
I found a corner case that doesn't fit any specific patch:
We allow INIT while in SMM. This brings some security complications as
we also don't reset hflags (another long standing bug?), but we don't
really need to because INIT in SMM is against the spec anyway;
APM May 2013 2:10.3.3 Exceptions a
2015-05-27 19:05+0200, Paolo Bonzini:
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> @@ -1616,6 +1727,27 @@ int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const
> void *data,
| int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
| unsigned long len)
|
On 29/05/2015 21:03, Radim Krčmář wrote:
> I found a corner case that doesn't fit any specific patch:
>
> We allow INIT while in SMM. This brings some security complications as
> we also don't reset hflags (another long standing bug?), but we don't
> really need to because INIT in SMM is agains
On 05/28/2015 11:49 AM, Christoffer Dall wrote:
> Until now we have been calling kvm_guest_exit after re-enabling
> interrupts when we come back from the guest, but this has the
> unfortunate effect that CPU time accounting done in the context of timer
> interrupts occurring while the guest is runn
79 matches
Mail list logo