[RFC PATCH v2 0/3] KVM: allow mapping of compound tail pages for IO or PFNMAP mapping
This is a v2 for previous series [1] to allow mapping for compound tail pages for IO or PFNMAP mapping. Compared to v1, this version provides selftest to check functionality in KVM to map memslots for MMIO BARs (VMAs with flag VM_IO | VM_PFNMAP), as requested by Sean in [1]. The selftest can also be used to test series "allow mapping non-refcounted pages" [2]. Tag RFC is added because a test driver is introduced in patch 2, which is new to KVM selftest, and test "set_memory_region_io" in patch 3 depends on that the test driver is compiled and loaded in kernel. Besides, patch 3 calls vm_set_user_memory_region() directly without modifying vm_mem_add(). So, this series is sent to ensure the main direction is right. Thanks Yan [1] https://lore.kernel.org/all/20230719083332.4584-1-yan.y.z...@intel.com/ [2] https://lore.kernel.org/all/20230911021637.1941096-1-steve...@google.com/ v2: added patch 2 and 3 to do selftest for patch 1 (Sean). Yan Zhao (3): KVM: allow mapping of compound tail pages for IO or PFNMAP mapping KVM: selftests: add selftest driver for KVM to test memory slots for MMIO BARs KVM: selftests: Add set_memory_region_io to test memslots for MMIO BARs lib/Kconfig.debug | 14 + lib/Makefile | 1 + lib/test_kvm_mock_device.c| 281 ++ lib/test_kvm_mock_device_uapi.h | 16 + tools/testing/selftests/kvm/Makefile | 1 + .../selftests/kvm/set_memory_region_io.c | 188 virt/kvm/kvm_main.c | 2 +- 7 files changed, 502 insertions(+), 1 deletion(-) create mode 100644 lib/test_kvm_mock_device.c create mode 100644 lib/test_kvm_mock_device_uapi.h create mode 100644 tools/testing/selftests/kvm/set_memory_region_io.c base-commit: 8ed26ab8d59111c2f7b86d200d1eb97d2a458fd1 -- 2.17.1
[RFC PATCH v2 1/3] KVM: allow mapping of compound tail pages for IO or PFNMAP mapping
Allow mapping of tail pages of compound pages for IO or PFNMAP mapping by trying and getting ref count of its head page. For IO or PFNMAP mapping, sometimes it's backed by compound pages. KVM will just return error on mapping of tail pages of the compound pages, as ref count of the tail pages are always 0. So, rather than check and add ref count of a tail page, check and add ref count of its folio (head page) to allow mapping of the compound tail pages. This will not break the origial intention to disallow mapping of tail pages of non-compound higher order allocations as the folio of a non-compound tail page is the same as the page itself. On the other side, put_page() has already converted page to folio before putting page ref. Signed-off-by: Yan Zhao --- virt/kvm/kvm_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index acd67fb40183..f53b58446ac7 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2892,7 +2892,7 @@ static int kvm_try_get_pfn(kvm_pfn_t pfn) if (!page) return 1; - return get_page_unless_zero(page); + return folio_try_get(page_folio(page)); } static int hva_to_pfn_remapped(struct vm_area_struct *vma, -- 2.17.1
[RFC PATCH v2 2/3] KVM: selftests: add selftest driver for KVM to test memory slots for MMIO BARs
This driver is for testing KVM memory slots for device MMIO BARs that are mapped to pages serving as device resources. This driver implements a mock device whose device resources are pages array that can be mmaped into user space. It provides ioctl interface to users to configure whether the pages are allocated as a compound huge page or not. KVM selftest code can then map the mock device resource to KVM memslots to check if any error encountered. After VM shutdown, mock device resource's page reference counters are checked to ensure KVM does not hold extra reference count during memslot add/removal. Signed-off-by: Yan Zhao --- lib/Kconfig.debug | 14 ++ lib/Makefile| 1 + lib/test_kvm_mock_device.c | 281 lib/test_kvm_mock_device_uapi.h | 16 ++ 4 files changed, 312 insertions(+) create mode 100644 lib/test_kvm_mock_device.c create mode 100644 lib/test_kvm_mock_device_uapi.h diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index cc7d53d9dc01..c0fd4b53db89 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -2922,6 +2922,20 @@ config TEST_HMM If unsure, say N. +config TEST_KVM_MOCK_DEVICE + tristate "Test page-backended BAR to KVM mock device" + help + This is a mock KVM assigned device whose MMIO BAR is backended by + struct page. + Say M here if you want to build the "test_kvm_mock_device" module. + Doing so will allow you to run KVM selftest + tools/testing/selftest/kvm/set_memory_region_io, which tests + functionality of adding page-backended MMIO memslots in KVM and + ensures that reference count of the backend pages are correctly + handled. + + If unsure, say N. + config TEST_FREE_PAGES tristate "Test freeing pages" help diff --git a/lib/Makefile b/lib/Makefile index 6b09731d8e61..894a185bbabd 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -83,6 +83,7 @@ obj-$(CONFIG_TEST_STATIC_KEYS) += test_static_key_base.o obj-$(CONFIG_TEST_DYNAMIC_DEBUG) += test_dynamic_debug.o obj-$(CONFIG_TEST_PRINTF) += test_printf.o obj-$(CONFIG_TEST_SCANF) += test_scanf.o +obj-$(CONFIG_TEST_KVM_MOCK_DEVICE) += test_kvm_mock_device.o obj-$(CONFIG_TEST_BITMAP) += test_bitmap.o ifeq ($(CONFIG_CC_IS_CLANG)$(CONFIG_KASAN),yy) diff --git a/lib/test_kvm_mock_device.c b/lib/test_kvm_mock_device.c new file mode 100644 index ..4e7527c230cd --- /dev/null +++ b/lib/test_kvm_mock_device.c @@ -0,0 +1,281 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * This is a module to test KVM DEVICE MMIO PASSTHROUGH. + */ +#include +#include +#include +#include +#include +#include +#include + +#include "test_kvm_mock_device_uapi.h" + +/* kvm mock device */ +struct kvm_mock_dev { + dev_t devt; + struct device device; + struct cdev cdev; +}; +static struct kvm_mock_dev kvm_mock_dev; + +struct kvm_mock_device { + bool compound; + struct page *resource; + u64 bar_size; + int order; + int *ref_array; + struct mutex lock; + bool prepared; +}; + +static bool opened; + +#define BAR_SIZE 0x20UL +#define DEFAULT_COMPOUND true + +static vm_fault_t kvm_mock_device_mmap_fault(struct vm_fault *vmf) +{ + struct vm_area_struct *vma = vmf->vma; + struct kvm_mock_device *kmdev = vma->vm_private_data; + struct page *p = kmdev->resource; + vm_fault_t ret = VM_FAULT_NOPAGE; + unsigned long addr; + int i; + + for (addr = vma->vm_start, i = vma->vm_pgoff; addr < vma->vm_end; +addr += PAGE_SIZE, i++) { + + ret = vmf_insert_pfn(vma, addr, page_to_pfn(p + i)); + if (ret == VM_FAULT_NOPAGE) + continue; + + zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start); + return ret; + + } + return ret; +} + +static const struct vm_operations_struct kvm_mock_device_mmap_ops = { + .fault = kvm_mock_device_mmap_fault, +}; + +static int kvm_mock_device_fops_mmap(struct file *file, struct vm_area_struct *vma) +{ + struct kvm_mock_device *kmdev = file->private_data; + u64 offset, req_len; + int ret = 0; + + mutex_lock(&kmdev->lock); + if (!kmdev->prepared) { + ret = -ENODEV; + goto out; + } + + offset = vma->vm_pgoff << PAGE_SHIFT; + req_len = vma->vm_end - vma->vm_start; + if (offset + req_len > BAR_SIZE) { + ret = -EINVAL; + goto out; + } + + vm_flags_set(vma, VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP); + vma->vm_ops = &kvm_mock_device_mmap_ops; + vma->vm_private_data = kmdev; +out: + mutex_unlock(&kmdev->lock); + return ret; +} + +static int kvm_mock_device_prepare_resource(struct kvm_mock_device *kmdev) +{ + gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO; + unsigned int order = kmdev
[RFC PATCH v2 3/3] KVM: selftests: Add set_memory_region_io to test memslots for MMIO BARs
Added a selftest set_memory_region_io to test memslots for MMIO BARs. The MMIO BARs' backends are compound/non-compound huge pages serving as device resources allocated by a mock device driver. This selftest will assert and report "errno=14 - Bad address" in vcpu_run() if any failure is met to add such MMIO BAR memslots. After MMIO memslots removal, page reference counts of the device resources are also checked. As this selftest will interacts with a mock device "/dev/kvm_mock_device", it depends on test driver test_kvm_mock_device.ko in the kernel. CONFIG_TEST_KVM_MOCK_DEVICE=m must be enabled in the kernel. Currently, this selftest is only compiled for __x86_64__. Signed-off-by: Yan Zhao --- tools/testing/selftests/kvm/Makefile | 1 + .../selftests/kvm/set_memory_region_io.c | 188 ++ 2 files changed, 189 insertions(+) create mode 100644 tools/testing/selftests/kvm/set_memory_region_io.c diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile index 4412b42d95de..9d39514b6403 100644 --- a/tools/testing/selftests/kvm/Makefile +++ b/tools/testing/selftests/kvm/Makefile @@ -144,6 +144,7 @@ TEST_GEN_PROGS_x86_64 += memslot_modification_stress_test TEST_GEN_PROGS_x86_64 += memslot_perf_test TEST_GEN_PROGS_x86_64 += rseq_test TEST_GEN_PROGS_x86_64 += set_memory_region_test +TEST_GEN_PROGS_x86_64 += set_memory_region_io TEST_GEN_PROGS_x86_64 += steal_time TEST_GEN_PROGS_x86_64 += kvm_binary_stats_test TEST_GEN_PROGS_x86_64 += system_counter_offset_test diff --git a/tools/testing/selftests/kvm/set_memory_region_io.c b/tools/testing/selftests/kvm/set_memory_region_io.c new file mode 100644 index ..e221103091f4 --- /dev/null +++ b/tools/testing/selftests/kvm/set_memory_region_io.c @@ -0,0 +1,188 @@ +// SPDX-License-Identifier: GPL-2.0 +#define _GNU_SOURCE /* for program_invocation_short_name */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include +#include +#include + +#include <../../../../lib/test_kvm_mock_device_uapi.h> + +/* + * Somewhat arbitrary location and slot, intended to not overlap anything. + */ +#define MEM_REGION_GPA_BASE0xc000 +#define RANDOM_OFFSET 0x1000 +#define MEM_REGION_GPA_RANDOM (MEM_REGION_GPA_BASE + RANDOM_OFFSET) +#define MEM_REGION_SLOT_ID 10 + +static const bool non_compound_supported; + +static const uint64_t BASE_VAL = 0x; +static const uint64_t RANDOM_VAL = 0x; + +static unsigned long bar_size; + +static sem_t vcpu_ready; + +static void guest_code_read_bar(void) +{ + uint64_t val; + + GUEST_SYNC(0); + + val = READ_ONCE(*((uint64_t *)MEM_REGION_GPA_RANDOM)); + GUEST_ASSERT_EQ(val, RANDOM_VAL); + + val = READ_ONCE(*((uint64_t *)MEM_REGION_GPA_BASE)); + GUEST_ASSERT_EQ(val, BASE_VAL); + + GUEST_DONE(); +} + +static void *vcpu_worker(void *data) +{ + struct kvm_vcpu *vcpu = data; + struct kvm_run *run = vcpu->run; + struct ucall uc; + uint64_t cmd; + + /* +* Loop until the guest is done. Re-enter the guest on all MMIO exits, +* which will occur if the guest attempts to access a memslot after it +* has been deleted or while it is being moved . +*/ + while (1) { + vcpu_run(vcpu); + + if (run->exit_reason == KVM_EXIT_IO) { + cmd = get_ucall(vcpu, &uc); + if (cmd != UCALL_SYNC) + break; + + sem_post(&vcpu_ready); + continue; + } + + if (run->exit_reason != KVM_EXIT_MMIO) + break; + + TEST_ASSERT(!run->mmio.is_write, "Unexpected exit mmio write"); + TEST_ASSERT(run->mmio.len == 8, + "Unexpected exit mmio size = %u", run->mmio.len); + + TEST_ASSERT(run->mmio.phys_addr < MEM_REGION_GPA_BASE || + run->mmio.phys_addr >= MEM_REGION_GPA_BASE + bar_size, + "Unexpected exit mmio address = 0x%llx", + run->mmio.phys_addr); + } + + if (run->exit_reason == KVM_EXIT_IO && cmd == UCALL_ABORT) + REPORT_GUEST_ASSERT(uc); + + return NULL; +} + +static void wait_for_vcpu(void) +{ + struct timespec ts; + + TEST_ASSERT(!clock_gettime(CLOCK_REALTIME, &ts), + "clock_gettime() failed: %d\n", errno); + + ts.tv_sec += 2; + TEST_ASSERT(!sem_timedwait(&vcpu_ready, &ts), + "sem_timedwait() failed: %d\n", errno); + + /* Wait for the vCPU thread to reenter the guest. */ + usleep(10); +} + +static void test_kvm_mock_device_bar(bool compound) +{ + struct kvm_vm *vm; + void *mem; + struct kvm_vcpu *vcpu; + pthread_t vcpu_thread; + int fd, r
Re: [PATCH v10 10/10] iommu/vt-d: Add iotlb flush for nested domain
On Wed, Jan 03, 2024 at 11:06:19AM +0800, Baolu Lu wrote: > On 2024/1/3 9:33, Yi Liu wrote: > > On 2024/1/3 02:44, Jason Gunthorpe wrote: > > > On Tue, Jan 02, 2024 at 06:38:34AM -0800, Yi Liu wrote: > > > > > > > +static void intel_nested_flush_cache(struct dmar_domain > > > > *domain, u64 addr, > > > > + unsigned long npages, bool ih, u32 *error) > > > > +{ > > > > + struct iommu_domain_info *info; > > > > + unsigned long i; > > > > + unsigned mask; > > > > + u32 fault; > > > > + > > > > + xa_for_each(&domain->iommu_array, i, info) > > > > + qi_flush_piotlb(info->iommu, > > > > + domain_id_iommu(domain, info->iommu), > > > > + IOMMU_NO_PASID, addr, npages, ih, NULL); > > > > > > This locking on the xarray is messed up throughout the driver. There > > > could be a concurrent detach at this point which will free info and > > > UAF this. > > > > hmmm, xa_for_each() takes and releases rcu lock, and according to the > > domain_detach_iommu(), info is freed after xa_erase(). For an existing > > info stored in xarray, xa_erase() should return after rcu lock is released. > > is it? Any idea? @Baolu > > I once thought locking for xarray is self-contained. I need more thought > on this before taking further action. The locking of xarray itself is self-contained, but once it returns a value then the user has to provide locking to protect the value. In this case the xarray storage memory itself will not UAF but the info pointer to memory returned from the xarray will. I've been thinking arm/amd/intel all need the same datastructure here, and it is a bit complicated. We should try to make a library to handle it.. It is straightforward except for the RCU list walk for invalidation.. Jason
Re: [PATCH net-next v2 2/3] net: gro: parse ipv6 ext headers without frag0 invalidation
Eric Dumazet wrote: > On Tue, Jan 2, 2024 at 2:25 PM Richard Gobert > wrote: >> >> The existing code always pulls the IPv6 header and sets the transport >> offset initially. Then optionally again pulls any extension headers in >> ipv6_gso_pull_exthdrs and sets the transport offset again on return from >> that call. skb->data is set at the start of the first extension header >> before calling ipv6_gso_pull_exthdrs, and must disable the frag0 >> optimization because that function uses pskb_may_pull/pskb_pull instead of >> skb_gro_ helpers. It sets the GRO offset to the TCP header with >> skb_gro_pull and sets the transport header. Then returns skb->data to its >> position before this block. >> >> This commit introduces a new helper function - ipv6_gro_pull_exthdrs - >> which is used in ipv6_gro_receive to pull ipv6 ext headers instead of >> ipv6_gso_pull_exthdrs. Thus, there is no modification of skb->data, all >> operations use skb_gro_* helpers, and the frag0 fast path can be taken for >> IPv6 packets with ext headers. >> >> Signed-off-by: Richard Gobert >> Reviewed-by: Willem de Bruijn >> --- >> include/net/ipv6.h | 1 + >> net/ipv6/ip6_offload.c | 51 +- >> 2 files changed, 42 insertions(+), 10 deletions(-) >> >> diff --git a/include/net/ipv6.h b/include/net/ipv6.h >> index 78d38dd88aba..217240efa182 100644 >> --- a/include/net/ipv6.h >> +++ b/include/net/ipv6.h >> @@ -26,6 +26,7 @@ struct ip_tunnel_info; >> #define SIN6_LEN_RFC2133 24 >> >> #define IPV6_MAXPLEN 65535 >> +#define IPV6_MIN_EXTHDR_LEN8 > > // Hmm see my following comment. > >> >> /* >> * NextHeader field of IPv6 header >> diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c >> index 0e0b5fed0995..c07111d8f56a 100644 >> --- a/net/ipv6/ip6_offload.c >> +++ b/net/ipv6/ip6_offload.c >> @@ -37,6 +37,40 @@ >> INDIRECT_CALL_L4(cb, f2, f1, head, skb);\ >> }) >> >> +static int ipv6_gro_pull_exthdrs(struct sk_buff *skb, int off, int proto) >> +{ >> + const struct net_offload *ops = NULL; >> + struct ipv6_opt_hdr *opth; >> + >> + for (;;) { >> + int len; >> + >> + ops = rcu_dereference(inet6_offloads[proto]); >> + >> + if (unlikely(!ops)) >> + break; >> + >> + if (!(ops->flags & INET6_PROTO_GSO_EXTHDR)) >> + break; >> + >> + opth = skb_gro_header(skb, off + IPV6_MIN_EXTHDR_LEN, off); > > I do not see a compelling reason for adding yet another constant here. > > I would stick to > >opth = skb_gro_header(skb, off + sizeof(*opth), off); > > Consistency with similar helpers is desirable. > In terms of consistency - similar helper functions (ipv6_gso_pull_exthdrs, ipv6_parse_hopopts) also pull 8 bytes at the beginning of every IPv6 extension header, because the minimum extension header length is 8 bytes. sizeof(*opth) = 2, so for an IPv6 packet with one extension header with a common length of 8 bytes, pskb_may_pull will be called twice: first with length = 2 and again with length = 8, which might not be ideal when parsing non-linear packets. Willem suggested adding a constant to make the code more self-documenting. >> + if (unlikely(!opth)) >> + break; >> + >> + len = ipv6_optlen(opth); >> + >> + opth = skb_gro_header(skb, off + len, off); > > Note this call will take care of precise pull. > >> + if (unlikely(!opth)) >> + break; >> + proto = opth->nexthdr; >> + >> + off += len; >> + } >> + >> + skb_gro_pull(skb, off - skb_network_offset(skb)); >> + return proto; >> +} >> + >> static int ipv6_gso_pull_exthdrs(struct sk_buff *skb, int proto) >> { >> const struct net_offload *ops = NULL; >> @@ -203,28 +237,25 @@ INDIRECT_CALLABLE_SCOPE struct sk_buff >> *ipv6_gro_receive(struct list_head *head, >> goto out; >> >> skb_set_network_header(skb, off); >> - skb_gro_pull(skb, sizeof(*iph)); >> - skb_set_transport_header(skb, skb_gro_offset(skb)); >> >> - flush += ntohs(iph->payload_len) != skb_gro_len(skb); >> + flush += ntohs(iph->payload_len) != skb->len - hlen; >> >> proto = iph->nexthdr; >> ops = rcu_dereference(inet6_offloads[proto]); >> if (!ops || !ops->callbacks.gro_receive) { >> - pskb_pull(skb, skb_gro_offset(skb)); >> - skb_gro_frag0_invalidate(skb); >> - proto = ipv6_gso_pull_exthdrs(skb, proto); >> - skb_gro_pull(skb, -skb_transport_offset(skb)); >> - skb_reset_transport_header(skb); >> - __skb_push(skb, skb_gro_offset(skb)); >> + proto = ipv6_gro_pull_exthdrs(skb, hlen, proto); >> >> ops = rcu_dereference(inet6_offloads[proto]); >> if (!op
Re: [PATCH net-next v2 2/3] net: gro: parse ipv6 ext headers without frag0 invalidation
On Wed, Jan 3, 2024 at 2:08 PM Richard Gobert wrote: > > > > Eric Dumazet wrote: > > On Tue, Jan 2, 2024 at 2:25 PM Richard Gobert > > wrote: > >> > >> The existing code always pulls the IPv6 header and sets the transport > >> offset initially. Then optionally again pulls any extension headers in > >> ipv6_gso_pull_exthdrs and sets the transport offset again on return from > >> that call. skb->data is set at the start of the first extension header > >> before calling ipv6_gso_pull_exthdrs, and must disable the frag0 > >> optimization because that function uses pskb_may_pull/pskb_pull instead of > >> skb_gro_ helpers. It sets the GRO offset to the TCP header with > >> skb_gro_pull and sets the transport header. Then returns skb->data to its > >> position before this block. > >> > >> This commit introduces a new helper function - ipv6_gro_pull_exthdrs - > >> which is used in ipv6_gro_receive to pull ipv6 ext headers instead of > >> ipv6_gso_pull_exthdrs. Thus, there is no modification of skb->data, all > >> operations use skb_gro_* helpers, and the frag0 fast path can be taken for > >> IPv6 packets with ext headers. > >> > >> Signed-off-by: Richard Gobert > >> Reviewed-by: Willem de Bruijn > >> --- > >> include/net/ipv6.h | 1 + > >> net/ipv6/ip6_offload.c | 51 +- > >> 2 files changed, 42 insertions(+), 10 deletions(-) > >> > >> diff --git a/include/net/ipv6.h b/include/net/ipv6.h > >> index 78d38dd88aba..217240efa182 100644 > >> --- a/include/net/ipv6.h > >> +++ b/include/net/ipv6.h > >> @@ -26,6 +26,7 @@ struct ip_tunnel_info; > >> #define SIN6_LEN_RFC2133 24 > >> > >> #define IPV6_MAXPLEN 65535 > >> +#define IPV6_MIN_EXTHDR_LEN8 > > > > // Hmm see my following comment. > > > >> > >> /* > >> * NextHeader field of IPv6 header > >> diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c > >> index 0e0b5fed0995..c07111d8f56a 100644 > >> --- a/net/ipv6/ip6_offload.c > >> +++ b/net/ipv6/ip6_offload.c > >> @@ -37,6 +37,40 @@ > >> INDIRECT_CALL_L4(cb, f2, f1, head, skb);\ > >> }) > >> > >> +static int ipv6_gro_pull_exthdrs(struct sk_buff *skb, int off, int proto) > >> +{ > >> + const struct net_offload *ops = NULL; > >> + struct ipv6_opt_hdr *opth; > >> + > >> + for (;;) { > >> + int len; > >> + > >> + ops = rcu_dereference(inet6_offloads[proto]); > >> + > >> + if (unlikely(!ops)) > >> + break; > >> + > >> + if (!(ops->flags & INET6_PROTO_GSO_EXTHDR)) > >> + break; > >> + > >> + opth = skb_gro_header(skb, off + IPV6_MIN_EXTHDR_LEN, off); > > > > I do not see a compelling reason for adding yet another constant here. > > > > I would stick to > > > >opth = skb_gro_header(skb, off + sizeof(*opth), off); > > > > Consistency with similar helpers is desirable. > > > > In terms of consistency - similar helper functions (ipv6_gso_pull_exthdrs, > ipv6_parse_hopopts) also pull 8 bytes at the beginning of every IPv6 > extension header, because the minimum extension header length is 8 bytes. > > sizeof(*opth) = 2, so for an IPv6 packet with one extension header with a > common length of 8 bytes, pskb_may_pull will be called twice: first with > length = 2 and again with length = 8, which might not be ideal when parsing > non-linear packets. > > Willem suggested adding a constant to make the code more self-documenting. Hmm... I was looking at skb_checksum_setup_ipv6() , it uses skb_maybe_pull_tail( ... sizeof(struct ipv6_opt_hdr)) ipv6_skip_exthdr() also uses sizeof(struct ipv6_opt_hdr) ip6_tnl_parse_tlv_enc_lim also uses the same. hbh_mt6(), ipv6header_mt6(), .. same... ip6_find_1stfragopt(), get_ipv6_ext_hdrs(), tcf_csum_ipv6(), mip6_rthdr_offset() same So it seems you found two helpers that went the other way. If you think pulling 8 bytes first is a win, I would suggest a stand alone patch, adding the magic constant using it in all places, so that a casual reader can make sense of the magical 8 value.
Re: [PATCH net-next v2 2/3] net: gro: parse ipv6 ext headers without frag0 invalidation
Eric Dumazet wrote: > > > Hmm... I was looking at > > skb_checksum_setup_ipv6() , it uses skb_maybe_pull_tail( ... > sizeof(struct ipv6_opt_hdr)) > ipv6_skip_exthdr() also uses sizeof(struct ipv6_opt_hdr) > ip6_tnl_parse_tlv_enc_lim also uses the same. > hbh_mt6(), ipv6header_mt6(), .. same... > ip6_find_1stfragopt(), get_ipv6_ext_hdrs(), tcf_csum_ipv6(), > mip6_rthdr_offset() same > > So it seems you found two helpers that went the other way. > > If you think pulling 8 bytes first is a win, I would suggest a stand > alone patch, adding the magic constant > using it in all places, so that a casual reader can make sense of the > magical 8 value. I guess pulling 8 bytes first is not such a big advantage. I will submit a v3 with sizeof(*opth) as you suggested.
[PATCH net-next v3 0/3] net: gro: reduce extension header parsing overhead
This series attempts to reduce the parsing overhead of IPv6 extension headers in GRO and GSO, by removing extension header specific code and enabling the frag0 fast path. The following changes were made: - Removed some unnecessary HBH conditionals by adding HBH offload to inet6_offloads - Added a utility function to support frag0 fast path in ipv6_gro_receive - Added selftests for IPv6 packets with extension headers in GRO Richard v2 -> v3: - Removed previously added IPv6 extension header length constant and using sizeof(*opth) instead. - Removed unnecessary conditional in gro selftest framework - v2: https://lore.kernel.org/netdev/127b8199-1cd4-42d7-9b2b-875abaad9...@gmail.com/ v1 -> v2: - Added a minimum IPv6 extension header length constant to make code self documenting. - Added new selftest which checks that packets with different extension header payloads do not coalesce. - Added more info in the second commit message regarding the code changes. - v1: https://lore.kernel.org/netdev/f4eff69d-3917-4c42-8c6b-d09597ac4...@gmail.com/ Richard Gobert (3): net: gso: add HBH extension header offload support net: gro: parse ipv6 ext headers without frag0 invalidation selftests/net: fix GRO coalesce test and add ext header coalesce tests net/ipv6/exthdrs_offload.c| 11 net/ipv6/ip6_offload.c| 76 + tools/testing/selftests/net/gro.c | 93 +-- 3 files changed, 150 insertions(+), 30 deletions(-) -- 2.36.1
[PATCH net-next v3 1/3] net: gso: add HBH extension header offload support
This commit adds net_offload to IPv6 Hop-by-Hop extension headers (as it is done for routing and dstopts) since it is supported in GSO and GRO. This allows to remove specific HBH conditionals in GSO and GRO when pulling and parsing an incoming packet. Signed-off-by: Richard Gobert Reviewed-by: Willem de Bruijn Reviewed-by: David Ahern Reviewed-by: Eric Dumazet --- net/ipv6/exthdrs_offload.c | 11 +++ net/ipv6/ip6_offload.c | 25 +++-- 2 files changed, 22 insertions(+), 14 deletions(-) diff --git a/net/ipv6/exthdrs_offload.c b/net/ipv6/exthdrs_offload.c index 06750d65d480..4c00398f4dca 100644 --- a/net/ipv6/exthdrs_offload.c +++ b/net/ipv6/exthdrs_offload.c @@ -16,6 +16,10 @@ static const struct net_offload dstopt_offload = { .flags = INET6_PROTO_GSO_EXTHDR, }; +static const struct net_offload hbh_offload = { + .flags = INET6_PROTO_GSO_EXTHDR, +}; + int __init ipv6_exthdrs_offload_init(void) { int ret; @@ -28,9 +32,16 @@ int __init ipv6_exthdrs_offload_init(void) if (ret) goto out_rt; + ret = inet6_add_offload(&hbh_offload, IPPROTO_HOPOPTS); + if (ret) + goto out_dstopts; + out: return ret; +out_dstopts: + inet6_del_offload(&dstopt_offload, IPPROTO_DSTOPTS); + out_rt: inet6_del_offload(&rthdr_offload, IPPROTO_ROUTING); goto out; diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c index d6314287338d..0e0b5fed0995 100644 --- a/net/ipv6/ip6_offload.c +++ b/net/ipv6/ip6_offload.c @@ -45,15 +45,13 @@ static int ipv6_gso_pull_exthdrs(struct sk_buff *skb, int proto) struct ipv6_opt_hdr *opth; int len; - if (proto != NEXTHDR_HOP) { - ops = rcu_dereference(inet6_offloads[proto]); + ops = rcu_dereference(inet6_offloads[proto]); - if (unlikely(!ops)) - break; + if (unlikely(!ops)) + break; - if (!(ops->flags & INET6_PROTO_GSO_EXTHDR)) - break; - } + if (!(ops->flags & INET6_PROTO_GSO_EXTHDR)) + break; if (unlikely(!pskb_may_pull(skb, 8))) break; @@ -171,13 +169,12 @@ static int ipv6_exthdrs_len(struct ipv6hdr *iph, proto = iph->nexthdr; for (;;) { - if (proto != NEXTHDR_HOP) { - *opps = rcu_dereference(inet6_offloads[proto]); - if (unlikely(!(*opps))) - break; - if (!((*opps)->flags & INET6_PROTO_GSO_EXTHDR)) - break; - } + *opps = rcu_dereference(inet6_offloads[proto]); + if (unlikely(!(*opps))) + break; + if (!((*opps)->flags & INET6_PROTO_GSO_EXTHDR)) + break; + opth = (void *)opth + optlen; optlen = ipv6_optlen(opth); len += optlen; -- 2.36.1
[PATCH net-next v3 2/3] net: gro: parse ipv6 ext headers without frag0 invalidation
The existing code always pulls the IPv6 header and sets the transport offset initially. Then optionally again pulls any extension headers in ipv6_gso_pull_exthdrs and sets the transport offset again on return from that call. skb->data is set at the start of the first extension header before calling ipv6_gso_pull_exthdrs, and must disable the frag0 optimization because that function uses pskb_may_pull/pskb_pull instead of skb_gro_ helpers. It sets the GRO offset to the TCP header with skb_gro_pull and sets the transport header. Then returns skb->data to its position before this block. This commit introduces a new helper function - ipv6_gro_pull_exthdrs - which is used in ipv6_gro_receive to pull ipv6 ext headers instead of ipv6_gso_pull_exthdrs. Thus, there is no modification of skb->data, all operations use skb_gro_* helpers, and the frag0 fast path can be taken for IPv6 packets with ext headers. Signed-off-by: Richard Gobert Reviewed-by: Willem de Bruijn Reviewed-by: David Ahern --- net/ipv6/ip6_offload.c | 51 +- 1 file changed, 41 insertions(+), 10 deletions(-) diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c index 0e0b5fed0995..cca64c7809be 100644 --- a/net/ipv6/ip6_offload.c +++ b/net/ipv6/ip6_offload.c @@ -37,6 +37,40 @@ INDIRECT_CALL_L4(cb, f2, f1, head, skb);\ }) +static int ipv6_gro_pull_exthdrs(struct sk_buff *skb, int off, int proto) +{ + const struct net_offload *ops = NULL; + struct ipv6_opt_hdr *opth; + + for (;;) { + int len; + + ops = rcu_dereference(inet6_offloads[proto]); + + if (unlikely(!ops)) + break; + + if (!(ops->flags & INET6_PROTO_GSO_EXTHDR)) + break; + + opth = skb_gro_header(skb, off + sizeof(*opth), off); + if (unlikely(!opth)) + break; + + len = ipv6_optlen(opth); + + opth = skb_gro_header(skb, off + len, off); + if (unlikely(!opth)) + break; + proto = opth->nexthdr; + + off += len; + } + + skb_gro_pull(skb, off - skb_network_offset(skb)); + return proto; +} + static int ipv6_gso_pull_exthdrs(struct sk_buff *skb, int proto) { const struct net_offload *ops = NULL; @@ -203,28 +237,25 @@ INDIRECT_CALLABLE_SCOPE struct sk_buff *ipv6_gro_receive(struct list_head *head, goto out; skb_set_network_header(skb, off); - skb_gro_pull(skb, sizeof(*iph)); - skb_set_transport_header(skb, skb_gro_offset(skb)); - flush += ntohs(iph->payload_len) != skb_gro_len(skb); + flush += ntohs(iph->payload_len) != skb->len - hlen; proto = iph->nexthdr; ops = rcu_dereference(inet6_offloads[proto]); if (!ops || !ops->callbacks.gro_receive) { - pskb_pull(skb, skb_gro_offset(skb)); - skb_gro_frag0_invalidate(skb); - proto = ipv6_gso_pull_exthdrs(skb, proto); - skb_gro_pull(skb, -skb_transport_offset(skb)); - skb_reset_transport_header(skb); - __skb_push(skb, skb_gro_offset(skb)); + proto = ipv6_gro_pull_exthdrs(skb, hlen, proto); ops = rcu_dereference(inet6_offloads[proto]); if (!ops || !ops->callbacks.gro_receive) goto out; - iph = ipv6_hdr(skb); + iph = skb_gro_network_header(skb); + } else { + skb_gro_pull(skb, sizeof(*iph)); } + skb_set_transport_header(skb, skb_gro_offset(skb)); + NAPI_GRO_CB(skb)->proto = proto; flush--; -- 2.36.1
[PATCH net-next v3 3/3] selftests/net: fix GRO coalesce test and add ext header coalesce tests
Currently there is no test which checks that IPv6 extension header packets successfully coalesce. This commit adds a test, which verifies two IPv6 packets with HBH extension headers do coalesce, and another test which checks that packets with different extension header data do not coalesce in GRO. I changed the receive socket filter to accept a packet with one extension header. This change exposed a bug in the fragment test -- the old BPF did not accept the fragment packet. I updated correct_num_packets in the fragment test accordingly. Signed-off-by: Richard Gobert Reviewed-by: Willem de Bruijn --- tools/testing/selftests/net/gro.c | 93 +-- 1 file changed, 87 insertions(+), 6 deletions(-) diff --git a/tools/testing/selftests/net/gro.c b/tools/testing/selftests/net/gro.c index 30024d0ed373..353e1e867fbb 100644 --- a/tools/testing/selftests/net/gro.c +++ b/tools/testing/selftests/net/gro.c @@ -71,6 +71,12 @@ #define MAX_PAYLOAD (IP_MAXPACKET - sizeof(struct tcphdr) - sizeof(struct ipv6hdr)) #define NUM_LARGE_PKT (MAX_PAYLOAD / MSS) #define MAX_HDR_LEN (ETH_HLEN + sizeof(struct ipv6hdr) + sizeof(struct tcphdr)) +#define MIN_EXTHDR_SIZE 8 +#define EXT_PAYLOAD_1 "\x00\x00\x00\x00\x00\x00" +#define EXT_PAYLOAD_2 "\x11\x11\x11\x11\x11\x11" + +#define ipv6_optlen(p) (((p)->hdrlen+1) << 3) /* calculate IPv6 extension header len */ +#define BUILD_BUG_ON(condition) ((void)sizeof(char[1 - 2*!!(condition)])) static const char *addr6_src = "fdaa::2"; static const char *addr6_dst = "fdaa::1"; @@ -104,7 +110,7 @@ static void setup_sock_filter(int fd) const int dport_off = tcp_offset + offsetof(struct tcphdr, dest); const int ethproto_off = offsetof(struct ethhdr, h_proto); int optlen = 0; - int ipproto_off; + int ipproto_off, opt_ipproto_off; int next_off; if (proto == PF_INET) @@ -116,14 +122,30 @@ static void setup_sock_filter(int fd) if (strcmp(testname, "ip") == 0) { if (proto == PF_INET) optlen = sizeof(struct ip_timestamp); - else - optlen = sizeof(struct ip6_frag); + else { + BUILD_BUG_ON(sizeof(struct ip6_hbh) > MIN_EXTHDR_SIZE); + BUILD_BUG_ON(sizeof(struct ip6_dest) > MIN_EXTHDR_SIZE); + BUILD_BUG_ON(sizeof(struct ip6_frag) > MIN_EXTHDR_SIZE); + + /* same size for HBH and Fragment extension header types */ + optlen = MIN_EXTHDR_SIZE; + opt_ipproto_off = ETH_HLEN + sizeof(struct ipv6hdr) + + offsetof(struct ip6_ext, ip6e_nxt); + } } + /* this filter validates the following: +* - packet is IPv4/IPv6 according to the running test. +* - packet is TCP. Also handles the case of one extension header and then TCP. +* - checks the packet tcp dport equals to DPORT. Also handles the case of one +*extension header and then TCP. +*/ struct sock_filter filter[] = { BPF_STMT(BPF_LD + BPF_H + BPF_ABS, ethproto_off), - BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, ntohs(ethhdr_proto), 0, 7), + BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, ntohs(ethhdr_proto), 0, 9), BPF_STMT(BPF_LD + BPF_B + BPF_ABS, ipproto_off), + BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, IPPROTO_TCP, 2, 0), + BPF_STMT(BPF_LD + BPF_B + BPF_ABS, opt_ipproto_off), BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, IPPROTO_TCP, 0, 5), BPF_STMT(BPF_LD + BPF_H + BPF_ABS, dport_off), BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, DPORT, 2, 0), @@ -576,6 +598,39 @@ static void add_ipv4_ts_option(void *buf, void *optpkt) iph->check = checksum_fold(iph, sizeof(struct iphdr) + optlen, 0); } +static void add_ipv6_exthdr(void *buf, void *optpkt, __u8 exthdr_type, char *ext_payload) +{ + struct ipv6_opt_hdr *exthdr = (struct ipv6_opt_hdr *)(optpkt + tcp_offset); + struct ipv6hdr *iph = (struct ipv6hdr *)(optpkt + ETH_HLEN); + char *exthdr_payload_start = (char *)(exthdr + 1); + + exthdr->hdrlen = 0; + exthdr->nexthdr = IPPROTO_TCP; + + memcpy(exthdr_payload_start, ext_payload, MIN_EXTHDR_SIZE - sizeof(*exthdr)); + + memcpy(optpkt, buf, tcp_offset); + memcpy(optpkt + tcp_offset + MIN_EXTHDR_SIZE, buf + tcp_offset, + sizeof(struct tcphdr) + PAYLOAD_LEN); + + iph->nexthdr = exthdr_type; + iph->payload_len = htons(ntohs(iph->payload_len) + MIN_EXTHDR_SIZE); +} + +static void send_ipv6_exthdr(int fd, struct sockaddr_ll *daddr, char *ext_data1, char *ext_data2) +{ + static char buf[MAX_HDR_LEN + PAYLOAD_LEN]; + static char exthdr_pck[sizeof(buf) + MIN_EXTHD
Re: [PATCH v7 1/3] iommufd: Add data structure for Intel VT-d stage-1 cache invalidation
On Wed, Jan 03, 2024 at 10:24:42AM +0800, Yi Liu wrote: > On 2024/1/3 07:38, Jason Gunthorpe wrote: > > On Fri, Dec 15, 2023 at 12:01:19PM +0800, Yi Liu wrote: > > > > I think I misread Yi's narrative: dev_id is a working approach > > > > for VMM to convert to a vRID, while he is asking for a better > > > > alternative :) > > > > > > In concept, dev_id works, but in reality we have problem to get a dev_id > > > for a given device in intel iommu driver, hence I'm asking for help here. > > > :) > > > > I think we just need to solve this one way or another.. Even if you > > use a viommu object you still end up having difficult coupling to > > iommufd > > > > Some: > >iommufd_get_dev_id(struct iommufd_ctx *ictx, struct device *dev) > > > > Callable by a driver (using the driver-callable function > > infrastructure we made for dirty tracking) Is really all that is > > needed here. > > yep, I noticed IOMMUFD_DRIVER was selected by intel iommu driver when > IOMMUFD is configed. Maybe such an API could be compiled when > IOMMUFD_DRIVER is enabled as well. This does address my concern on making > intel iommu driver depending on iommufd. But still need a way to pass in > the iommufd_ctx pointer to intel iommu driver, and store it. Hence intel > iommu driver could call the above iommufd_get_dev_id(). One possible way is > passing it when attaching device to domain and clear it in detach. However, > this seems not ideal as iommufd_ctx information should be static between > bind_iommufd and unbind. But we don't call into intel iommu driver in the > bind and unbind operations. May need to add new iommu op. Any suggestion > here? You can pass the ctx to the invalidate op, it is already implied because the passed iommu_domain is linked to a single iommufd ctx. Jason
Re: [PATCH net-next v3 2/3] net: gro: parse ipv6 ext headers without frag0 invalidation
On Wed, Jan 3, 2024 at 3:44 PM Richard Gobert wrote: > > The existing code always pulls the IPv6 header and sets the transport > offset initially. Then optionally again pulls any extension headers in > ipv6_gso_pull_exthdrs and sets the transport offset again on return from > that call. skb->data is set at the start of the first extension header > before calling ipv6_gso_pull_exthdrs, and must disable the frag0 > optimization because that function uses pskb_may_pull/pskb_pull instead of > skb_gro_ helpers. It sets the GRO offset to the TCP header with > skb_gro_pull and sets the transport header. Then returns skb->data to its > position before this block. > > This commit introduces a new helper function - ipv6_gro_pull_exthdrs - > which is used in ipv6_gro_receive to pull ipv6 ext headers instead of > ipv6_gso_pull_exthdrs. Thus, there is no modification of skb->data, all > operations use skb_gro_* helpers, and the frag0 fast path can be taken for > IPv6 packets with ext headers. > > Signed-off-by: Richard Gobert > Reviewed-by: Willem de Bruijn > Reviewed-by: David Ahern Reviewed-by: Eric Dumazet Thanks !
Re: [PATCH v7 1/3] iommufd: Add data structure for Intel VT-d stage-1 cache invalidation
On Wed, Jan 03, 2024 at 12:01:08PM -0400, Jason Gunthorpe wrote: > On Wed, Jan 03, 2024 at 10:24:42AM +0800, Yi Liu wrote: > > On 2024/1/3 07:38, Jason Gunthorpe wrote: > > > On Fri, Dec 15, 2023 at 12:01:19PM +0800, Yi Liu wrote: > > > > > I think I misread Yi's narrative: dev_id is a working approach > > > > > for VMM to convert to a vRID, while he is asking for a better > > > > > alternative :) > > > > > > > > In concept, dev_id works, but in reality we have problem to get a dev_id > > > > for a given device in intel iommu driver, hence I'm asking for help > > > > here. :) > > > > > > I think we just need to solve this one way or another.. Even if you > > > use a viommu object you still end up having difficult coupling to > > > iommufd > > > > > > Some: > > >iommufd_get_dev_id(struct iommufd_ctx *ictx, struct device *dev) > > > > > > Callable by a driver (using the driver-callable function > > > infrastructure we made for dirty tracking) Is really all that is > > > needed here. > > > > yep, I noticed IOMMUFD_DRIVER was selected by intel iommu driver when > > IOMMUFD is configed. Maybe such an API could be compiled when > > IOMMUFD_DRIVER is enabled as well. This does address my concern on making > > intel iommu driver depending on iommufd. But still need a way to pass in > > the iommufd_ctx pointer to intel iommu driver, and store it. Hence intel > > iommu driver could call the above iommufd_get_dev_id(). One possible way is > > passing it when attaching device to domain and clear it in detach. However, > > this seems not ideal as iommufd_ctx information should be static between > > bind_iommufd and unbind. But we don't call into intel iommu driver in the > > bind and unbind operations. May need to add new iommu op. Any suggestion > > here? > > You can pass the ctx to the invalidate op, it is already implied > because the passed iommu_domain is linked to a single iommufd ctx. The device virtual id lookup API needs something similar, yet it likely needs a viommu pointer (or its id) instead? As the table is attached to a viommu while an ictx can have multiple viommus, right? Thanks Nic
Re: [PATCH v7 1/3] iommufd: Add data structure for Intel VT-d stage-1 cache invalidation
On Wed, Jan 03, 2024 at 08:48:46AM -0800, Nicolin Chen wrote: > > You can pass the ctx to the invalidate op, it is already implied > > because the passed iommu_domain is linked to a single iommufd ctx. > > The device virtual id lookup API needs something similar, yet it > likely needs a viommu pointer (or its id) instead? As the table > is attached to a viommu while an ictx can have multiple viommus, > right? Yes, when we get to an API for that it will have to be some op 'invalidate_viommu(..)' and it can get the necessary pointers. The viommu object will have to be some driver object like the iommu_domain. Jason
Re: [PATCH v7 1/3] iommufd: Add data structure for Intel VT-d stage-1 cache invalidation
On Wed, Jan 03, 2024 at 12:58:48PM -0400, Jason Gunthorpe wrote: > On Wed, Jan 03, 2024 at 08:48:46AM -0800, Nicolin Chen wrote: > > > You can pass the ctx to the invalidate op, it is already implied > > > because the passed iommu_domain is linked to a single iommufd ctx. > > > > The device virtual id lookup API needs something similar, yet it > > likely needs a viommu pointer (or its id) instead? As the table > > is attached to a viommu while an ictx can have multiple viommus, > > right? > > Yes, when we get to an API for that it will have to be some op > 'invalidate_viommu(..)' and it can get the necessary pointers. OK! I will try that first. > The viommu object will have to be some driver object like the > iommu_domain. I drafted something like this, linking it to struct iommu_device: +struct iommufd_viommu { + struct iommufd_object obj; + struct iommufd_ctx *ictx; + struct iommu_device *iommu_dev; + struct iommufd_hwpt_paging *hwpt; + /* array of struct iommufd_device, indexed by device virtual id */ + struct xarray device_ids; +}; Thanks Nicolin
Re: [PATCH v7 1/3] iommufd: Add data structure for Intel VT-d stage-1 cache invalidation
On Wed, Jan 03, 2024 at 09:06:23AM -0800, Nicolin Chen wrote: > On Wed, Jan 03, 2024 at 12:58:48PM -0400, Jason Gunthorpe wrote: > > On Wed, Jan 03, 2024 at 08:48:46AM -0800, Nicolin Chen wrote: > > > > You can pass the ctx to the invalidate op, it is already implied > > > > because the passed iommu_domain is linked to a single iommufd ctx. > > > > > > The device virtual id lookup API needs something similar, yet it > > > likely needs a viommu pointer (or its id) instead? As the table > > > is attached to a viommu while an ictx can have multiple viommus, > > > right? > > > > Yes, when we get to an API for that it will have to be some op > > 'invalidate_viommu(..)' and it can get the necessary pointers. > > OK! I will try that first. > > > The viommu object will have to be some driver object like the > > iommu_domain. > > I drafted something like this, linking it to struct iommu_device: > > +struct iommufd_viommu { > + struct iommufd_object obj; > + struct iommufd_ctx *ictx; > + struct iommu_device *iommu_dev; > + struct iommufd_hwpt_paging *hwpt; > + /* array of struct iommufd_device, indexed by device virtual id */ > + struct xarray device_ids; > +}; The driver would have to create it and there would be some driver specific enclosing struct to go with it Perhaps device_ids goes in the driver specific struct, I don't know. Not sure it should have hwpt at all, probably vmid should come from the driver specific struct in some driver specific way Jason
Re: [PATCH RESEND v4 0/3] livepatch: Move modules to selftests and add a new test
On Wed, Dec 20, 2023 at 01:53:11PM -0300, Marcos Paulo de Souza wrote: > Changes in v4: > * Documented how to compile the livepatch selftests without running the > tests (Joe) > * Removed the mention to lib/livepatch on MAINTAINERS file, reported by > checkpatch. > > Changes in v3: > * Rebased on top of v6.6-rc5 > * The commits messages were improved (Thanks Petr!) > * Created TEST_GEN_MODS_DIR variable to point to a directly that contains > kernel > modules, and adapt selftests to build it before running the test. > * Moved test_klp-call_getpid out of test_programs, since the gen_tar > would just copy the generated test programs to the livepatches dir, > and so scripts relying on test_programs/test_klp-call_getpid will fail. > * Added a module_param for klp_pids, describing it's usage. > * Simplified the call_getpid program to ignore the return of getpid syscall, > since we only want to make sure the process transitions correctly to the > patched stated > * The test-syscall.sh not prints a log message showing the number of remaining > processes to transition into to livepatched state, and check_output expects > it > to be 0. > * Added MODULE_AUTHOR and MODULE_DESCRIPTION to test_klp_syscall.c > > - Link to v3: > https://lore.kernel.org/r/20231031-send-lp-kselftests-v3-0-2b1655c26...@suse.com > - Link to v2: > https://lore.kernel.org/linux-kselftest/20220630141226.2802-1-mpdeso...@suse.com/ > > This patchset moves the current kernel testing livepatch modules from > lib/livepatches to tools/testing/selftest/livepatch/test_modules, and compiles > them as out-of-tree modules before testing. > > There is also a new test being added. This new test exercises multiple > processes > calling a syscall, while a livepatch patched the syscall. > > Why this move is an improvement: > * The modules are now compiled as out-of-tree modules against the current > running kernel, making them capable of being tested on different systems > with > newer or older kernels. > * Such approach now needs kernel-devel package to be installed, since they are > out-of-tree modules. These can be generated by running "make rpm-pkg" in the > kernel source. > > What needs to be solved: > * Currently gen_tar only packages the resulting binaries of the tests, and not > the sources. For the current approach, the newly added modules would be > compiled and then packaged. It works when testing on a system with the same > kernel version. But it will fail when running on a machine with different > kernel > version, since module was compiled against the kernel currently running. > > This is not a new problem, just aligning the expectations. For the current > approach to be truly system agnostic gen_tar would need to include the > module > and program sources to be compiled in the target systems. > > Thanks in advance! > Marcos > > Signed-off-by: Marcos Paulo de Souza > --- > Marcos Paulo de Souza (3): > kselftests: lib.mk: Add TEST_GEN_MODS_DIR variable > livepatch: Move tests from lib/livepatch to selftests/livepatch > selftests: livepatch: Test livepatching a heavily called syscall > > Documentation/dev-tools/kselftest.rst | 4 + > MAINTAINERS| 1 - > arch/s390/configs/debug_defconfig | 1 - > arch/s390/configs/defconfig| 1 - > lib/Kconfig.debug | 22 > lib/Makefile | 2 - > lib/livepatch/Makefile | 14 --- > tools/testing/selftests/lib.mk | 20 +++- > tools/testing/selftests/livepatch/Makefile | 5 +- > tools/testing/selftests/livepatch/README | 25 +++-- > tools/testing/selftests/livepatch/config | 1 - > tools/testing/selftests/livepatch/functions.sh | 34 +++--- > .../testing/selftests/livepatch/test-callbacks.sh | 50 - > tools/testing/selftests/livepatch/test-ftrace.sh | 6 +- > .../testing/selftests/livepatch/test-livepatch.sh | 10 +- > .../selftests/livepatch/test-shadow-vars.sh| 2 +- > tools/testing/selftests/livepatch/test-state.sh| 18 ++-- > tools/testing/selftests/livepatch/test-syscall.sh | 53 ++ > tools/testing/selftests/livepatch/test-sysfs.sh| 6 +- > .../selftests/livepatch/test_klp-call_getpid.c | 44 > .../selftests/livepatch/test_modules/Makefile | 20 > .../test_modules}/test_klp_atomic_replace.c| 0 > .../test_modules}/test_klp_callbacks_busy.c| 0 > .../test_modules}/test_klp_callbacks_demo.c| 0 > .../test_modules}/test_klp_callbacks_demo2.c | 0 > .../test_modules}/test_klp_callbacks_mod.c | 0 > .../livepatch/test_modules}/test_klp_livepatch.c | 0 > .../livepatch/test_modules}/test_klp_shadow_vars.c | 0 > .../livepatch/test_modules}/test_klp_state.c
Re: [PATCH v7 1/3] iommufd: Add data structure for Intel VT-d stage-1 cache invalidation
On Wed, Jan 03, 2024 at 01:52:02PM -0400, Jason Gunthorpe wrote: > On Wed, Jan 03, 2024 at 09:06:23AM -0800, Nicolin Chen wrote: > > On Wed, Jan 03, 2024 at 12:58:48PM -0400, Jason Gunthorpe wrote: > > > On Wed, Jan 03, 2024 at 08:48:46AM -0800, Nicolin Chen wrote: > > > > > You can pass the ctx to the invalidate op, it is already implied > > > > > because the passed iommu_domain is linked to a single iommufd ctx. > > > > > > > > The device virtual id lookup API needs something similar, yet it > > > > likely needs a viommu pointer (or its id) instead? As the table > > > > is attached to a viommu while an ictx can have multiple viommus, > > > > right? > > > > > > Yes, when we get to an API for that it will have to be some op > > > 'invalidate_viommu(..)' and it can get the necessary pointers. > > > > OK! I will try that first. > > > > > The viommu object will have to be some driver object like the > > > iommu_domain. > > > > I drafted something like this, linking it to struct iommu_device: > > > > +struct iommufd_viommu { > > + struct iommufd_object obj; > > + struct iommufd_ctx *ictx; > > + struct iommu_device *iommu_dev; > > + struct iommufd_hwpt_paging *hwpt; > > + /* array of struct iommufd_device, indexed by device virtual id */ > > + struct xarray device_ids; > > +}; > > The driver would have to create it and there would be some driver > specific enclosing struct to go with it > > Perhaps device_ids goes in the driver specific struct, I don't know. +struct iommufd_viommu { + struct iommufd_object obj; + struct iommufd_ctx *ictx; + struct iommu_device *iommu_dev; + struct iommufd_hwpt_paging *hwpt; /* maybe unneeded */ + + int vmid; + + union iommu_driver_user_data { + struct iommu_driver_user_vtd; + struct iommu_driver_user_arm_smmuv3; + struct iommu_driver_user_amd_viommu; + }; +}; Then iommu_ops would need something like: iommu_user_alloc/free(struct iommu_device *iommu_dev, union *iommu_driver_user_data, int *vmid); iommu_user_set/unset_dev_id(union iommu_driver_user_data, struct device* dev. u32/u64 id); iommu_user_invalidate(union iommu_driver_user_data *iommu, struct iommu_user_data_array *array); Comments and ideas on better naming convention? > Not sure it should have hwpt at all, probably vmid should come from > the driver specific struct in some driver specific way The idea having a hwpt pointer is to share the paging hwpt among the viommu objects. I don't think it "shouldn't have", yet likely we can avoid it depending on whether it will have some use or not in the context. Nicolin
Re: [PATCH RESEND v4 1/3] kselftests: lib.mk: Add TEST_GEN_MODS_DIR variable
On 1/2/24 15:31, Joe Lawrence wrote: On Wed, Dec 20, 2023 at 01:53:12PM -0300, Marcos Paulo de Souza wrote: Add TEST_GEN_MODS_DIR variable for kselftests. It can point to a directory containing kernel modules that will be used by selftest scripts. The modules are built as external modules for the running kernel. As a result they are always binary compatible and the same tests can be used for older or newer kernels. The build requires "kernel-devel" package to be installed. For example, in the upstream sources, the rpm devel package is produced by "make rpm-pkg" The modules can be built independently by make -C tools/testing/selftests/livepatch/ or they will be automatically built before running the tests via make -C tools/testing/selftests/livepatch/ run_tests Note that they are _not_ built when running the standalone tests by calling, for example, ./test-state.sh. Signed-off-by: Marcos Paulo de Souza --- Documentation/dev-tools/kselftest.rst | 4 tools/testing/selftests/lib.mk| 20 +++- 2 files changed, 19 insertions(+), 5 deletions(-) diff --git a/Documentation/dev-tools/kselftest.rst b/Documentation/dev-tools/kselftest.rst index ab376b316c36..7f3582a67318 100644 --- a/Documentation/dev-tools/kselftest.rst +++ b/Documentation/dev-tools/kselftest.rst @@ -245,6 +245,10 @@ Contributing new tests (details) TEST_PROGS, TEST_GEN_PROGS mean it is the executable tested by default. + TEST_GEN_MODS_DIR should be used by tests that require modules to be built + before the test starts. The variable will contain the name of the directory + containing the modules. + TEST_CUSTOM_PROGS should be used by tests that require custom build rules and prevent common build rule use. diff --git a/tools/testing/selftests/lib.mk b/tools/testing/selftests/lib.mk index 118e0964bda9..6c7c5a0112cf 100644 --- a/tools/testing/selftests/lib.mk +++ b/tools/testing/selftests/lib.mk @@ -70,12 +70,15 @@ KHDR_INCLUDES := -isystem $(KHDR_DIR) # TEST_PROGS are for test shell scripts. # TEST_CUSTOM_PROGS and TEST_PROGS will be run by common run_tests # and install targets. Common clean doesn't touch them. +# TEST_GEN_MODS_DIR is used to specify a directory with modules to be built +# before the test executes. These modules are cleaned on the clean target as well. TEST_GEN_PROGS := $(patsubst %,$(OUTPUT)/%,$(TEST_GEN_PROGS)) TEST_GEN_PROGS_EXTENDED := $(patsubst %,$(OUTPUT)/%,$(TEST_GEN_PROGS_EXTENDED)) TEST_GEN_FILES := $(patsubst %,$(OUTPUT)/%,$(TEST_GEN_FILES)) +TEST_GEN_MODS_DIR := $(patsubst %,$(OUTPUT)/%,$(TEST_GEN_MODS_DIR)) all: kernel_header_files $(TEST_GEN_PROGS) $(TEST_GEN_PROGS_EXTENDED) \ - $(TEST_GEN_FILES) + $(TEST_GEN_FILES) $(if $(TEST_GEN_MODS_DIR),gen_mods_dir) kernel_header_files: @ls $(KHDR_DIR)/linux/*.h >/dev/null 2>/dev/null; \ @@ -105,8 +108,8 @@ endef run_tests: all ifdef building_out_of_srctree - @if [ "X$(TEST_PROGS)$(TEST_PROGS_EXTENDED)$(TEST_FILES)" != "X" ]; then \ - rsync -aq --copy-unsafe-links $(TEST_PROGS) $(TEST_PROGS_EXTENDED) $(TEST_FILES) $(OUTPUT); \ + @if [ "X$(TEST_PROGS)$(TEST_PROGS_EXTENDED)$(TEST_FILES)$(TEST_GEN_MODS_DIR)" != "X" ]; then \ + rsync -aq --copy-unsafe-links $(TEST_PROGS) $(TEST_PROGS_EXTENDED) $(TEST_FILES) $(TEST_GEN_MODS_DIR) $(OUTPUT); \ fi @if [ "X$(TEST_PROGS)" != "X" ]; then \ $(call RUN_TESTS, $(TEST_GEN_PROGS) $(TEST_CUSTOM_PROGS) \ @@ -118,6 +121,12 @@ else @$(call RUN_TESTS, $(TEST_GEN_PROGS) $(TEST_CUSTOM_PROGS) $(TEST_PROGS)) endif +gen_mods_dir: + $(Q)$(MAKE) -C $(TEST_GEN_MODS_DIR) + +clean_mods_dir: + $(Q)$(MAKE) -C $(TEST_GEN_MODS_DIR) clean + define INSTALL_SINGLE_RULE $(if $(INSTALL_LIST),@mkdir -p $(INSTALL_PATH)) $(if $(INSTALL_LIST),rsync -a --copy-unsafe-links $(INSTALL_LIST) $(INSTALL_PATH)/) @@ -131,6 +140,7 @@ define INSTALL_RULE $(eval INSTALL_LIST = $(TEST_CUSTOM_PROGS)) $(INSTALL_SINGLE_RULE) $(eval INSTALL_LIST = $(TEST_GEN_PROGS_EXTENDED)) $(INSTALL_SINGLE_RULE) $(eval INSTALL_LIST = $(TEST_GEN_FILES)) $(INSTALL_SINGLE_RULE) + $(eval INSTALL_LIST = $(TEST_GEN_MODS_DIR)) $(INSTALL_SINGLE_RULE) Hi Marcos, Sorry for the late reply on this, but I'm reviewing this version by trying to retrofit it into our selftest packaging (pre-build the test module .ko's and stash those into an rpm rather than building on the test host). Since $TEST_GEN_MODS_DIR is treated as a directory, I found that the selftest install target copies a bunch of intermediate object and kbuild files: $ mkdir /tmp/test-install $ make KDIR=$(pwd) INSTALL_PATH=/tmp/test-install TARGETS=livepatch \ -C tools/testing/selftests/ install [ ... builds livepatch selftests ... ] the rsync in question: rsync -a --copy-unsafe-links /home/jolawren/src/kernel/tools/testing/s
[PATCH bpf-next 0/2] Annotate kfuncs in .BTF_ids section
This is a bpf-treewide change that annotates all kfuncs as such inside .BTF_ids. This annotation eventually allows us to automatically generate kfunc prototypes from bpftool. We store this metadata inside a yet-unused flags field inside struct btf_id_set8 (thanks Kumar!). pahole will be taught where to look. More details about the full chain of events are available in commit 2's description. Daniel Xu (2): bpf: btf: Support optional flags for BTF_SET8 sets bpf: treewide: Annotate BPF kfuncs in BTF drivers/hid/bpf/hid_bpf_dispatch.c | 4 ++-- fs/verity/measure.c | 2 +- include/linux/btf_ids.h | 17 - kernel/bpf/btf.c| 3 +++ kernel/bpf/cpumask.c| 2 +- kernel/bpf/helpers.c| 4 ++-- kernel/bpf/map_iter.c | 2 +- kernel/cgroup/rstat.c | 2 +- kernel/trace/bpf_trace.c| 4 ++-- net/bpf/test_run.c | 4 ++-- net/core/filter.c | 8 net/core/xdp.c | 2 +- net/ipv4/bpf_tcp_ca.c | 2 +- net/ipv4/fou_bpf.c | 2 +- net/ipv4/tcp_bbr.c | 2 +- net/ipv4/tcp_cubic.c| 2 +- net/ipv4/tcp_dctcp.c| 2 +- net/netfilter/nf_conntrack_bpf.c| 2 +- net/netfilter/nf_nat_bpf.c | 2 +- net/xfrm/xfrm_interface_bpf.c | 2 +- net/xfrm/xfrm_state_bpf.c | 2 +- .../selftests/bpf/bpf_testmod/bpf_testmod.c | 2 +- 22 files changed, 42 insertions(+), 32 deletions(-) -- 2.42.1
[PATCH bpf-next 2/2] bpf: treewide: Annotate BPF kfuncs in BTF
This commit marks kfuncs as such inside the .BTF_ids section. The upshot of these annotations is that we'll be able to automatically generate kfunc prototypes for downstream users. The process is as follows: 1. In source, tag kfunc BTF_SET8 sets with BTF_SET8_KFUNC flag 2. During build, pahole injects into BTF a "bpf_kfunc" BTF_DECL_TAG for each function inside BTF_SET8_KFUNC sets 3. At runtime, vmlinux or module BTF is made available in sysfs 4. At runtime, bpftool (or similar) can look at provided BTF and generate appropriate prototypes for functions with "bpf_kfunc" tag To ensure future kfunc are similarly tagged, we add a WARN_ON() inside kfunc registration so developers will notice issues duringn development. Signed-off-by: Daniel Xu --- drivers/hid/bpf/hid_bpf_dispatch.c| 4 ++-- fs/verity/measure.c | 2 +- include/linux/btf_ids.h | 3 +++ kernel/bpf/btf.c | 3 +++ kernel/bpf/cpumask.c | 2 +- kernel/bpf/helpers.c | 4 ++-- kernel/bpf/map_iter.c | 2 +- kernel/cgroup/rstat.c | 2 +- kernel/trace/bpf_trace.c | 4 ++-- net/bpf/test_run.c| 4 ++-- net/core/filter.c | 8 net/core/xdp.c| 2 +- net/ipv4/bpf_tcp_ca.c | 2 +- net/ipv4/fou_bpf.c| 2 +- net/ipv4/tcp_bbr.c| 2 +- net/ipv4/tcp_cubic.c | 2 +- net/ipv4/tcp_dctcp.c | 2 +- net/netfilter/nf_conntrack_bpf.c | 2 +- net/netfilter/nf_nat_bpf.c| 2 +- net/xfrm/xfrm_interface_bpf.c | 2 +- net/xfrm/xfrm_state_bpf.c | 2 +- tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c | 2 +- 22 files changed, 33 insertions(+), 27 deletions(-) diff --git a/drivers/hid/bpf/hid_bpf_dispatch.c b/drivers/hid/bpf/hid_bpf_dispatch.c index d9ef45fcaeab..e277b74e8831 100644 --- a/drivers/hid/bpf/hid_bpf_dispatch.c +++ b/drivers/hid/bpf/hid_bpf_dispatch.c @@ -172,7 +172,7 @@ hid_bpf_get_data(struct hid_bpf_ctx *ctx, unsigned int offset, const size_t rdwr * The following set contains all functions we agree BPF programs * can use. */ -BTF_SET8_START(hid_bpf_kfunc_ids) +BTF_SET8_START(hid_bpf_kfunc_ids, BTF_SET8_KFUNC) BTF_ID_FLAGS(func, hid_bpf_get_data, KF_RET_NULL) BTF_SET8_END(hid_bpf_kfunc_ids) @@ -440,7 +440,7 @@ static const struct btf_kfunc_id_set hid_bpf_fmodret_set = { }; /* for syscall HID-BPF */ -BTF_SET8_START(hid_bpf_syscall_kfunc_ids) +BTF_SET8_START(hid_bpf_syscall_kfunc_ids, BTF_SET8_KFUNC) BTF_ID_FLAGS(func, hid_bpf_attach_prog) BTF_ID_FLAGS(func, hid_bpf_allocate_context, KF_ACQUIRE | KF_RET_NULL) BTF_ID_FLAGS(func, hid_bpf_release_context, KF_RELEASE) diff --git a/fs/verity/measure.c b/fs/verity/measure.c index bf7a5f4cccaf..7e6c4ad9f8ce 100644 --- a/fs/verity/measure.c +++ b/fs/verity/measure.c @@ -159,7 +159,7 @@ __bpf_kfunc int bpf_get_fsverity_digest(struct file *file, struct bpf_dynptr_ker __bpf_kfunc_end_defs(); -BTF_SET8_START(fsverity_set_ids) +BTF_SET8_START(fsverity_set_ids, BTF_SET8_KFUNC) BTF_ID_FLAGS(func, bpf_get_fsverity_digest, KF_TRUSTED_ARGS) BTF_SET8_END(fsverity_set_ids) diff --git a/include/linux/btf_ids.h b/include/linux/btf_ids.h index 88f914579fa1..771e29762a2d 100644 --- a/include/linux/btf_ids.h +++ b/include/linux/btf_ids.h @@ -8,6 +8,9 @@ struct btf_id_set { u32 ids[]; }; +/* This flag implies BTF_SET8 holds kfunc(s) */ +#define BTF_SET8_KFUNC (1 << 0) + struct btf_id_set8 { u32 cnt; u32 flags; diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 51e8b4bee0c8..b8ba00a4179f 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -7769,6 +7769,9 @@ static int __register_btf_kfunc_id_set(enum btf_kfunc_hook hook, struct btf *btf; int ret, i; + /* All kfuncs need to be tagged as such in BTF */ + WARN_ON(!(kset->set->flags & BTF_SET8_KFUNC)); + btf = btf_get_module_btf(kset->owner); if (!btf) { if (!kset->owner && IS_ENABLED(CONFIG_DEBUG_INFO_BTF)) { diff --git a/kernel/bpf/cpumask.c b/kernel/bpf/cpumask.c index 2e73533a3811..4155c479cc2f 100644 --- a/kernel/bpf/cpumask.c +++ b/kernel/bpf/cpumask.c @@ -424,7 +424,7 @@ __bpf_kfunc u32 bpf_cpumask_weight(const struct cpumask *cpumask) __bpf_kfunc_end_defs(); -BTF_SET8_START(cpumask_kfunc_btf_ids) +BTF_SET8_START(cpumask_kfunc_btf_ids, BTF_SET8_KFUNC) BTF_ID_FLAGS(func, bpf_cpumask_create, KF_ACQUIRE | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_cpumask_release, KF_RELEASE) BTF_ID_FLAG
Re: [PATCH v7 1/3] iommufd: Add data structure for Intel VT-d stage-1 cache invalidation
On Wed, Jan 03, 2024 at 12:18:35PM -0800, Nicolin Chen wrote: > > The driver would have to create it and there would be some driver > > specific enclosing struct to go with it > > > > Perhaps device_ids goes in the driver specific struct, I don't know. > > +struct iommufd_viommu { > + struct iommufd_object obj; > + struct iommufd_ctx *ictx; > + struct iommu_device *iommu_dev; > + struct iommufd_hwpt_paging *hwpt; /* maybe unneeded */ > + > + int vmid; > + > + union iommu_driver_user_data { > + struct iommu_driver_user_vtd; > + struct iommu_driver_user_arm_smmuv3; > + struct iommu_driver_user_amd_viommu; > + }; Not like that, in the usual container_of way Jason
Re: [PATCH v2 net-next] selftests/net: change shebang to bash to support "source"
Hello: This patch was applied to netdev/net-next.git (main) by Jakub Kicinski : On Fri, 29 Dec 2023 21:19:31 +0800 you wrote: > The patch set [1] added a general lib.sh in net selftests, and converted > several test scripts to source the lib.sh. > > unicast_extensions.sh (converted in [1]) and pmtu.sh (converted in [2]) > have a /bin/sh shebang which may point to various shells in different > distributions, but "source" is only available in some of them. For > example, "source" is a built-it function in bash, but it cannot be > used in dash. > > [...] Here is the summary with links: - [v2,net-next] selftests/net: change shebang to bash to support "source" https://git.kernel.org/netdev/net-next/c/05d92cb0e919 You are awesome, thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/patchwork/pwbot.html