date:20250317

Re: [PATCH v4 1/2] compiler_types: Introduce __flex_counter() and family

2025-03-17 Thread Przemek Kitszel


On 3/17/25 10:26, Przemek Kitszel wrote:

On 3/15/25 04:15, Kees Cook wrote:

Introduce __flex_counter() which wraps __builtin_counted_by_ref(),
as newly introduced by GCC[1] and Clang[2]. Use of __flex_counter()
allows access to the counter member of a struct's flexible array member
when it has been annotated with __counted_by().

Introduce typeof_flex_counter(), can_set_flex_counter(), and
set_flex_counter() to provide the needed _Generic() wrappers to get sane
results out of __flex_counter().

For example, with:

struct foo {
    int counter;
    short array[] __counted_by(counter);
} *p;

__flex_counter(p->array) will resolve to: &p->counter

typeof_flex_counter(p->array) will resolve to "int". (If p->array was not
annotated, it would resolve to "size_t".)

can_set_flex_counter(p->array, COUNT) is the same as:

COUNT <= type_max(p->counter) && COUNT >= type_min(p->counter)

(If p->array was not annotated it would return true since everything
fits in size_t.)

set_flex_counter(p->array, COUNT) is the same as:

p->counter = COUNT;

(It is a no-op if p->array is not annotated with __counted_by().)

Signed-off-by: Kees Cook 


I agree that there is no suitable fallback handy, but I see counter
as integral part of the struct (in contrast to being merely annotation),
IOW, without set_flex_counter() doing the assignment, someone will
reference it later anyway, without any warning when kzalloc()'d

So, maybe BUILD_BUG() instead of no-op?


I get that so far this is only used as an internal helper (in the next
patch), so for me it would be also fine to just add __ prefix: 
__set_flex_counter(), at least until the following is true:

 "manual initialization of the flexible array counter is still
required (at some point) after allocation as not all compiler versions
support the __counted_by annotation yet"




+#define set_flex_counter(FAM, COUNT)    \
+({    \
+    *_Generic(__flex_counter(FAM),    \
+  void *:  &(size_t){ 0 },    \
+  default: __flex_counter(FAM)) = (COUNT);    \
+})
+
  #endif /* __LINUX_OVERFLOW_H */

Re: [PATCH] Documentation: kcsan: fix "Plain Accesses and Data Races" URL in kcsan.rst

2025-03-17 Thread Jonathan Corbet

Ignacio Encinas Rubio  writes:

> On 15/3/25 3:41, Akira Yokosawa wrote:
>> This might be something Jon would like to keep secret, but ...
>> 
>> See the message and the thread it belongs at:
>> 
>> 
>> https://lore.kernel.org/lkml/pine.lnx.4.44l0.1907310947340.1497-100...@iolanthe.rowland.org/
>> 
>> It happened in 2019 responding to Mauro's attempt to conversion of
>> LKMM docs.
>> 
>> I haven't see any change in sentiment among LKMM maintainers since.
>
> Thanks for the information!

FWIW, I don't think it has really been discussed since.

>> Your way forward would be to keep those .txt files *pure plain text"
>> and to convert them on-the-fly into reST.  Of course only if such an
>> effort sounds worthwhile to you.
>
> With this you mean producing a .rst from the original .txt file using an 
> script before building the documentation, right? I'm not sure how hard 
> this is, but I can look into it.
>
>> Another approach might be to include those docs literally.
>> Similar approach has applied to
>> 
>> Documentation/
>>  atomic_t.txt
>>  atomic_bitops.txt
>> memory-barriers.txt
>
> Right, I got to [1]. 
>
> It looks like there are several options here:
>
>   A) Include the text files like in [1]
>   B) Explore the "on-the-fly" translation
>   C) Do A) and then B)
>
> Does any of the above sound good, Jon?

Using the wrapper technique will surely work and should be an
improvement over what we have now.  I don't hold out much hope for "on
the fly" mangling of the text - it sounds brittle and never quite good
enough, but I'm willing to be proved wrong on that front.

The original discussion from all those years ago centered around worries
about inserting lots of markup into the plain-text file.  But I'm not
convinced that anything requires all that markup; indeed, the proposed
conversion at the time didn't do that.  The question was quickly dropped
because we had so much to do back then...

I think there might be value in trying another minimal-markup
conversion; it would be *nicer* to use more fonts in the HTML version,
but not doing so seems better than not having an HTML version at all.
But, obviously, there are no guarantees that it will clear the bar.

Thanks,

jon

Re: [PATCH net-next v9 3/6] tun: Introduce virtio-net hash feature

2025-03-17 Thread Jason Wang

On Mon, Mar 17, 2025 at 3:07 PM Akihiko Odaki  wrote:
>
> On 2025/03/17 10:12, Jason Wang wrote:
> > On Wed, Mar 12, 2025 at 1:03 PM Akihiko Odaki  
> > wrote:
> >>
> >> On 2025/03/12 11:35, Jason Wang wrote:
> >>> On Tue, Mar 11, 2025 at 2:11 PM Akihiko Odaki  
> >>> wrote:
> 
>  On 2025/03/11 9:38, Jason Wang wrote:
> > On Mon, Mar 10, 2025 at 3:45 PM Akihiko Odaki 
> >  wrote:
> >>
> >> On 2025/03/10 12:55, Jason Wang wrote:
> >>> On Fri, Mar 7, 2025 at 7:01 PM Akihiko Odaki 
> >>>  wrote:
> 
>  Hash reporting
>  ==
> 
>  Allow the guest to reuse the hash value to make receive steering
>  consistent between the host and guest, and to save hash computation.
> 
>  RSS
>  ===
> 
>  RSS is a receive steering algorithm that can be negotiated to use 
>  with
>  virtio_net. Conventionally the hash calculation was done by the VMM.
>  However, computing the hash after the queue was chosen defeats the
>  purpose of RSS.
> 
>  Another approach is to use eBPF steering program. This approach has
>  another downside: it cannot report the calculated hash due to the
>  restrictive nature of eBPF steering program.
> 
>  Introduce the code to perform RSS to the kernel in order to overcome
>  thse challenges. An alternative solution is to extend the eBPF 
>  steering
>  program so that it will be able to report to the userspace, but I 
>  didn't
>  opt for it because extending the current mechanism of eBPF steering
>  program as is because it relies on legacy context rewriting, and
>  introducing kfunc-based eBPF will result in non-UAPI dependency while
>  the other relevant virtualization APIs such as KVM and vhost_net are
>  UAPIs.
> 
>  Signed-off-by: Akihiko Odaki 
>  Tested-by: Lei Yang 
>  ---
>   Documentation/networking/tuntap.rst |   7 ++
>   drivers/net/Kconfig |   1 +
>   drivers/net/tap.c   |  68 ++-
>   drivers/net/tun.c   |  98 +-
>   drivers/net/tun_vnet.h  | 159 
>  ++--
>   include/linux/if_tap.h  |   2 +
>   include/linux/skbuff.h  |   3 +
>   include/uapi/linux/if_tun.h |  75 +
>   net/core/skbuff.c   |   4 +
>   9 files changed, 386 insertions(+), 31 deletions(-)
> 
>  diff --git a/Documentation/networking/tuntap.rst 
>  b/Documentation/networking/tuntap.rst
>  index 
>  4d7087f727be5e37dfbf5066a9e9c872cc98898d..86b4ae8caa8ad062c1e558920be42ce0d4217465
>   100644
>  --- a/Documentation/networking/tuntap.rst
>  +++ b/Documentation/networking/tuntap.rst
>  @@ -206,6 +206,13 @@ enable is true we enable it, otherwise we 
>  disable it::
> return ioctl(fd, TUNSETQUEUE, (void *)&ifr);
> }
> 
> >>>
> >>> [...]
> >>>
>  +static inline long tun_vnet_ioctl_sethash(struct 
>  tun_vnet_hash_container __rcu **hashp,
>  + bool can_rss, void __user 
>  *argp)
> >>>
> >>> So again, can_rss seems to be tricky. Looking at its caller, it tires
> >>> to make eBPF and RSS mutually exclusive. I still don't understand why
> >>> we need this. Allow eBPF program to override some of the path seems to
> >>> be common practice.
> >>>
> >>> What's more, we didn't try (or even can't) to make automq and eBPF to
> >>> be mutually exclusive. So I still didn't see what we gain from this
> >>> and it complicates the codes and may lead to ambiguous uAPI/behaviour.
> >>
> >> automq and eBPF are mutually exclusive; automq is disabled when an eBPF
> >> steering program is set so I followed the example here.
> >
> > I meant from the view of uAPI, the kernel doesn't or can't reject eBPF
> > while using automq.
> > >>
> >> We don't even have an interface for eBPF to let it fall back to another
> >> alogirhtm.
> >
> > It doesn't even need this, e.g XDP overrides the default receiving path.
> >
> >> I could make it fall back to RSS if the eBPF steeering
> >> program is designed to fall back to automq when it returns e.g., -1. 
> >> But
> >> such an interface is currently not defined and defining one is out of
> >> scope of this patch series.
> >
> > Just to make sure we are on the same page, I meant we just need to
> > make the behaviour consistent: allow eBPF to override the behaviour of
> > both automq and rss

Re: [PATCH net-next v11 00/10] tun: Introduce virtio-net hashing feature

2025-03-17 Thread Lei Yang

QE tested this series of patches v11 under linux-next repo with
virtio-net regression tests, everything works fine.

Tested-by: Lei Yang 

On Tue, Mar 18, 2025 at 8:29 AM Jason Wang  wrote:
>
> On Mon, Mar 17, 2025 at 6:58 PM Akihiko Odaki  
> wrote:
> >
> > virtio-net have two usage of hashes: one is RSS and another is hash
> > reporting. Conventionally the hash calculation was done by the VMM.
> > However, computing the hash after the queue was chosen defeats the
> > purpose of RSS.
> >
> > Another approach is to use eBPF steering program. This approach has
> > another downside: it cannot report the calculated hash due to the
> > restrictive nature of eBPF.
> >
> > Introduce the code to compute hashes to the kernel in order to overcome
> > thse challenges.
> >
> > An alternative solution is to extend the eBPF steering program so that it
> > will be able to report to the userspace, but it is based on context
> > rewrites, which is in feature freeze. We can adopt kfuncs, but they will
> > not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM
> > and vhost_net).
> >
> > The patches for QEMU to use this new feature was submitted as RFC and
> > is available at:
> > https://patchew.org/QEMU/20250313-hash-v4-0-c75c494b4...@daynix.com/
> >
> > This work was presented at LPC 2024:
> > https://lpc.events/event/18/contributions/1963/
> >
> > V1 -> V2:
> >   Changed to introduce a new BPF program type.
> >
> > Signed-off-by: Akihiko Odaki 
> > ---
> > Changes in v11:
> > - Added the missing code to free vnet_hash in patch
> >   "tap: Introduce virtio-net hash feature".
> > - Link to v10: 
> > https://lore.kernel.org/r/20250313-rss-v10-0-3185d73a9...@daynix.com
> >
>
> We only have 2 or 3 points that need to be sorted out. Let's hold on
> to the iteration until we had an agreement.
>
> Thanks
>

Re: [PATCH v4] docs: clarify rules wrt tagging other people

2025-03-17 Thread Jonathan Corbet

Jonathan Corbet  writes:

> Sorry for being slow ... but also, I guess, for not communicating my
> point very well.  My concern wasn't about somebody not wanting to appear
> in the repository at all; it was more with somebody not wanting their
> tag in a specific patch where they had not offered it.
>
> It seems I'm the only one who is worried about this, though.  It seems
> like we should go ahead and get this change in before the merge window
> hits.

OK, I have gone ahead and applied it ... though I'm still not 100%
comfortable with the wording as it is... :)

Thanks,

jon

Re: [PATCH net-next v11 00/10] tun: Introduce virtio-net hashing feature

2025-03-17 Thread Jason Wang

On Mon, Mar 17, 2025 at 6:58 PM Akihiko Odaki  wrote:
>
> virtio-net have two usage of hashes: one is RSS and another is hash
> reporting. Conventionally the hash calculation was done by the VMM.
> However, computing the hash after the queue was chosen defeats the
> purpose of RSS.
>
> Another approach is to use eBPF steering program. This approach has
> another downside: it cannot report the calculated hash due to the
> restrictive nature of eBPF.
>
> Introduce the code to compute hashes to the kernel in order to overcome
> thse challenges.
>
> An alternative solution is to extend the eBPF steering program so that it
> will be able to report to the userspace, but it is based on context
> rewrites, which is in feature freeze. We can adopt kfuncs, but they will
> not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM
> and vhost_net).
>
> The patches for QEMU to use this new feature was submitted as RFC and
> is available at:
> https://patchew.org/QEMU/20250313-hash-v4-0-c75c494b4...@daynix.com/
>
> This work was presented at LPC 2024:
> https://lpc.events/event/18/contributions/1963/
>
> V1 -> V2:
>   Changed to introduce a new BPF program type.
>
> Signed-off-by: Akihiko Odaki 
> ---
> Changes in v11:
> - Added the missing code to free vnet_hash in patch
>   "tap: Introduce virtio-net hash feature".
> - Link to v10: 
> https://lore.kernel.org/r/20250313-rss-v10-0-3185d73a9...@daynix.com
>

We only have 2 or 3 points that need to be sorted out. Let's hold on
to the iteration until we had an agreement.

Thanks

[PATCH net-next v11 01/10] virtio_net: Add functions for hashing

2025-03-17 Thread Akihiko Odaki

They are useful to implement VIRTIO_NET_F_RSS and
VIRTIO_NET_F_HASH_REPORT.

Signed-off-by: Akihiko Odaki 
Tested-by: Lei Yang 
---
 include/linux/virtio_net.h | 188 +
 1 file changed, 188 insertions(+)

diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index 02a9f4dc594d..426f33b4b824 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -9,6 +9,194 @@
 #include 
 #include 
 
+struct virtio_net_hash {
+   u32 value;
+   u16 report;
+};
+
+struct virtio_net_toeplitz_state {
+   u32 hash;
+   const u32 *key;
+};
+
+#define VIRTIO_NET_SUPPORTED_HASH_TYPES (VIRTIO_NET_RSS_HASH_TYPE_IPv4 | \
+VIRTIO_NET_RSS_HASH_TYPE_TCPv4 | \
+VIRTIO_NET_RSS_HASH_TYPE_UDPv4 | \
+VIRTIO_NET_RSS_HASH_TYPE_IPv6 | \
+VIRTIO_NET_RSS_HASH_TYPE_TCPv6 | \
+VIRTIO_NET_RSS_HASH_TYPE_UDPv6)
+
+#define VIRTIO_NET_RSS_MAX_KEY_SIZE 40
+
+static inline void virtio_net_toeplitz_convert_key(u32 *input, size_t len)
+{
+   while (len >= sizeof(*input)) {
+   *input = be32_to_cpu((__force __be32)*input);
+   input++;
+   len -= sizeof(*input);
+   }
+}
+
+static inline void virtio_net_toeplitz_calc(struct virtio_net_toeplitz_state 
*state,
+   const __be32 *input, size_t len)
+{
+   while (len >= sizeof(*input)) {
+   for (u32 map = be32_to_cpu(*input); map; map &= (map - 1)) {
+   u32 i = ffs(map);
+
+   state->hash ^= state->key[0] << (32 - i) |
+  (u32)((u64)state->key[1] >> i);
+   }
+
+   state->key++;
+   input++;
+   len -= sizeof(*input);
+   }
+}
+
+static inline u8 virtio_net_hash_key_length(u32 types)
+{
+   size_t len = 0;
+
+   if (types & VIRTIO_NET_HASH_REPORT_IPv4)
+   len = max(len,
+ sizeof(struct flow_dissector_key_ipv4_addrs));
+
+   if (types &
+   (VIRTIO_NET_HASH_REPORT_TCPv4 | VIRTIO_NET_HASH_REPORT_UDPv4))
+   len = max(len,
+ sizeof(struct flow_dissector_key_ipv4_addrs) +
+ sizeof(struct flow_dissector_key_ports));
+
+   if (types & VIRTIO_NET_HASH_REPORT_IPv6)
+   len = max(len,
+ sizeof(struct flow_dissector_key_ipv6_addrs));
+
+   if (types &
+   (VIRTIO_NET_HASH_REPORT_TCPv6 | VIRTIO_NET_HASH_REPORT_UDPv6))
+   len = max(len,
+ sizeof(struct flow_dissector_key_ipv6_addrs) +
+ sizeof(struct flow_dissector_key_ports));
+
+   return len + sizeof(u32);
+}
+
+static inline u32 virtio_net_hash_report(u32 types,
+const struct flow_keys_basic *keys)
+{
+   switch (keys->basic.n_proto) {
+   case cpu_to_be16(ETH_P_IP):
+   if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) {
+   if (keys->basic.ip_proto == IPPROTO_TCP &&
+   (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv4))
+   return VIRTIO_NET_HASH_REPORT_TCPv4;
+
+   if (keys->basic.ip_proto == IPPROTO_UDP &&
+   (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv4))
+   return VIRTIO_NET_HASH_REPORT_UDPv4;
+   }
+
+   if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv4)
+   return VIRTIO_NET_HASH_REPORT_IPv4;
+
+   return VIRTIO_NET_HASH_REPORT_NONE;
+
+   case cpu_to_be16(ETH_P_IPV6):
+   if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) {
+   if (keys->basic.ip_proto == IPPROTO_TCP &&
+   (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv6))
+   return VIRTIO_NET_HASH_REPORT_TCPv6;
+
+   if (keys->basic.ip_proto == IPPROTO_UDP &&
+   (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv6))
+   return VIRTIO_NET_HASH_REPORT_UDPv6;
+   }
+
+   if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv6)
+   return VIRTIO_NET_HASH_REPORT_IPv6;
+
+   return VIRTIO_NET_HASH_REPORT_NONE;
+
+   default:
+   return VIRTIO_NET_HASH_REPORT_NONE;
+   }
+}
+
+static inline void virtio_net_hash_rss(const struct sk_buff *skb,
+  u32 types, const u32 *key,
+  struct virtio_net_hash *hash)
+{
+   struct virtio_net_toeplitz_state toeplitz_state = { .key = key };
+   struct flow_keys flow;
+   struct flow_keys_basic flow_basic;
+   u16 repor

[PATCH net-next v11 09/10] selftest: tap: Add tests for virtio-net ioctls

2025-03-17 Thread Akihiko Odaki

They only test the ioctls are wired up to the implementation common with
tun as it is already tested for tun.

Signed-off-by: Akihiko Odaki 
---
 tools/testing/selftests/net/tap.c | 97 ++-
 1 file changed, 95 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/net/tap.c 
b/tools/testing/selftests/net/tap.c
index 247c3b3ac1c9..fbd38b08fdfa 100644
--- a/tools/testing/selftests/net/tap.c
+++ b/tools/testing/selftests/net/tap.c
@@ -363,6 +363,7 @@ size_t 
build_test_packet_crash_tap_invalid_eth_proto(uint8_t *buf,
 FIXTURE(tap)
 {
int fd;
+   bool deleted;
 };
 
 FIXTURE_SETUP(tap)
@@ -387,8 +388,10 @@ FIXTURE_TEARDOWN(tap)
if (self->fd != -1)
close(self->fd);
 
-   ret = dev_delete(param_dev_tap_name);
-   EXPECT_EQ(ret, 0);
+   if (!self->deleted) {
+   ret = dev_delete(param_dev_tap_name);
+   EXPECT_EQ(ret, 0);
+   }
 
ret = dev_delete(param_dev_dummy_name);
EXPECT_EQ(ret, 0);
@@ -431,4 +434,94 @@ TEST_F(tap, test_packet_crash_tap_invalid_eth_proto)
ASSERT_EQ(errno, EINVAL);
 }
 
+TEST_F(tap, test_vnethdrsz)
+{
+   int sz = sizeof(struct virtio_net_hdr_v1_hash);
+
+   ASSERT_FALSE(dev_delete(param_dev_tap_name));
+   self->deleted = true;
+
+   ASSERT_FALSE(ioctl(self->fd, TUNSETVNETHDRSZ, &sz));
+   sz = 0;
+   ASSERT_FALSE(ioctl(self->fd, TUNGETVNETHDRSZ, &sz));
+   EXPECT_EQ(sizeof(struct virtio_net_hdr_v1_hash), sz);
+}
+
+TEST_F(tap, test_vnetle)
+{
+   int le = 1;
+
+   ASSERT_FALSE(dev_delete(param_dev_tap_name));
+   self->deleted = true;
+
+   ASSERT_FALSE(ioctl(self->fd, TUNSETVNETLE, &le));
+   le = 0;
+   ASSERT_FALSE(ioctl(self->fd, TUNGETVNETLE, &le));
+   EXPECT_EQ(1, le);
+}
+
+TEST_F(tap, test_vnetbe)
+{
+   int be = 1;
+   int ret;
+
+   ASSERT_FALSE(dev_delete(param_dev_tap_name));
+   self->deleted = true;
+
+   ret = ioctl(self->fd, TUNSETVNETBE, &be);
+   if (ret == -1 && errno == EINVAL)
+   SKIP(return, "TUNSETVNETBE not supported");
+
+   ASSERT_FALSE(ret);
+   be = 0;
+   ASSERT_FALSE(ioctl(self->fd, TUNGETVNETBE, &be));
+   EXPECT_EQ(1, be);
+}
+
+TEST_F(tap, test_getvnethashcap)
+{
+   static const struct tun_vnet_hash expected = {
+   .flags = TUN_VNET_HASH_REPORT | TUN_VNET_HASH_RSS,
+   .types = VIRTIO_NET_RSS_HASH_TYPE_IPv4 |
+VIRTIO_NET_RSS_HASH_TYPE_TCPv4 |
+VIRTIO_NET_RSS_HASH_TYPE_UDPv4 |
+VIRTIO_NET_RSS_HASH_TYPE_IPv6 |
+VIRTIO_NET_RSS_HASH_TYPE_TCPv6 |
+VIRTIO_NET_RSS_HASH_TYPE_UDPv6
+   };
+   struct tun_vnet_hash seen;
+   int ret;
+
+   ASSERT_FALSE(dev_delete(param_dev_tap_name));
+   self->deleted = true;
+
+   ret = ioctl(self->fd, TUNGETVNETHASHCAP, &seen);
+
+   if (ret == -1 && errno == EINVAL)
+   SKIP(return, "TUNGETVNETHASHCAP not supported");
+
+   EXPECT_FALSE(ret);
+   EXPECT_FALSE(memcmp(&expected, &seen, sizeof(expected)));
+}
+
+TEST_F(tap, test_setvnethash_alive)
+{
+   struct tun_vnet_hash hash = { .flags = 0 };
+
+   EXPECT_FALSE(ioctl(self->fd, TUNSETVNETHASH, &hash));
+}
+
+TEST_F(tap, test_setvnethash_deleted)
+{
+   ASSERT_FALSE(dev_delete(param_dev_tap_name));
+   self->deleted = true;
+
+   ASSERT_EQ(-1, ioctl(self->fd, TUNSETVNETHASH));
+
+   if (errno == EINVAL)
+   SKIP(return, "TUNSETVNETHASH not supported");
+
+   EXPECT_EQ(EBADFD, errno);
+}
+
 TEST_HARNESS_MAIN

-- 
2.48.1

[PATCH net-next v11 05/10] tun: Introduce virtio-net hash feature

2025-03-17 Thread Akihiko Odaki

Add ioctls and storage required for the virtio-net hash feature to TUN.

Signed-off-by: Akihiko Odaki 
---
 drivers/net/Kconfig|  1 +
 drivers/net/tun.c  | 54 ++
 include/linux/skbuff.h |  3 +++
 net/core/skbuff.c  |  4 
 4 files changed, 54 insertions(+), 8 deletions(-)

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 1fd5acdc73c6..aecfd244dd83 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -395,6 +395,7 @@ config TUN
tristate "Universal TUN/TAP device driver support"
depends on INET
select CRC32
+   select SKB_EXTENSIONS
help
  TUN/TAP provides packet reception and transmission for user space
  programs.  It can be viewed as a simple Point-to-Point or Ethernet
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 03d47799e9bd..b2d74e0ec932 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -209,6 +209,7 @@ struct tun_struct {
struct bpf_prog __rcu *xdp_prog;
struct tun_prog __rcu *steering_prog;
struct tun_prog __rcu *filter_prog;
+   struct tun_vnet_hash_container __rcu *vnet_hash;
struct ethtool_link_ksettings link_ksettings;
/* init args */
struct file *file;
@@ -451,9 +452,14 @@ static inline void tun_flow_save_rps_rxhash(struct 
tun_flow_entry *e, u32 hash)
e->rps_rxhash = hash;
 }
 
+static struct virtio_net_hash *tun_add_hash(struct sk_buff *skb)
+{
+   return skb_ext_add(skb, SKB_EXT_TUN_VNET_HASH);
+}
+
 static const struct virtio_net_hash *tun_find_hash(const struct sk_buff *skb)
 {
-   return NULL;
+   return skb_ext_find(skb, SKB_EXT_TUN_VNET_HASH);
 }
 
 /* We try to identify a flow through its rxhash. The reason that
@@ -462,14 +468,21 @@ static const struct virtio_net_hash *tun_find_hash(const 
struct sk_buff *skb)
  * the userspace application move between processors, we may get a
  * different rxq no. here.
  */
-static u16 tun_automq_select_queue(struct tun_struct *tun, struct sk_buff *skb)
+static u16 tun_automq_select_queue(struct tun_struct *tun,
+  const struct tun_vnet_hash_container 
*vnet_hash,
+  struct sk_buff *skb)
 {
+   struct flow_keys keys;
+   struct flow_keys_basic keys_basic;
struct tun_flow_entry *e;
u32 txq, numqueues;
 
numqueues = READ_ONCE(tun->numqueues);
 
-   txq = __skb_get_hash_symmetric(skb);
+   memset(&keys, 0, sizeof(keys));
+   skb_flow_dissect(skb, &flow_keys_dissector_symmetric, &keys, 0);
+
+   txq = flow_hash_from_keys(&keys);
e = tun_flow_find(&tun->flows[tun_hashfn(txq)], txq);
if (e) {
tun_flow_save_rps_rxhash(e, txq);
@@ -478,6 +491,13 @@ static u16 tun_automq_select_queue(struct tun_struct *tun, 
struct sk_buff *skb)
txq = reciprocal_scale(txq, numqueues);
}
 
+   keys_basic = (struct flow_keys_basic) {
+   .control = keys.control,
+   .basic = keys.basic
+   };
+   tun_vnet_hash_report(vnet_hash, skb, &keys_basic, skb->l4_hash ? 
skb->hash : txq,
+tun_add_hash);
+
return txq;
 }
 
@@ -513,8 +533,15 @@ static u16 tun_select_queue(struct net_device *dev, struct 
sk_buff *skb,
u16 ret;
 
rcu_read_lock();
-   if (!tun_ebpf_select_queue(tun, skb, &ret))
-   ret = tun_automq_select_queue(tun, skb);
+   if (!tun_ebpf_select_queue(tun, skb, &ret)) {
+   struct tun_vnet_hash_container *vnet_hash = 
rcu_dereference(tun->vnet_hash);
+
+   if (vnet_hash && (vnet_hash->common.flags & TUN_VNET_HASH_RSS))
+   ret = 
tun_vnet_rss_select_queue(READ_ONCE(tun->numqueues), vnet_hash,
+   skb, tun_add_hash);
+   else
+   ret = tun_automq_select_queue(tun, vnet_hash, skb);
+   }
rcu_read_unlock();
 
return ret;
@@ -2235,6 +2262,7 @@ static void tun_free_netdev(struct net_device *dev)
security_tun_dev_free_security(tun->security);
__tun_set_ebpf(tun, &tun->steering_prog, NULL);
__tun_set_ebpf(tun, &tun->filter_prog, NULL);
+   kfree_rcu_mightsleep(rcu_access_pointer(tun->vnet_hash));
 }
 
 static void tun_setup(struct net_device *dev)
@@ -3014,16 +3042,22 @@ static long __tun_chr_ioctl(struct file *file, unsigned 
int cmd,
} else {
memset(&ifr, 0, sizeof(ifr));
}
-   if (cmd == TUNGETFEATURES) {
+   switch (cmd) {
+   case TUNGETFEATURES:
/* Currently this just means: "what IFF flags are valid?".
 * This is needed because we never checked for invalid flags on
 * TUNSETIFF.
 */
return put_user(IFF_TUN | IFF_TAP | IFF_NO_CARRIER |
TUN_FEATURES,

[PATCH net-next v11 00/10] tun: Introduce virtio-net hashing feature

2025-03-17 Thread Akihiko Odaki

virtio-net have two usage of hashes: one is RSS and another is hash
reporting. Conventionally the hash calculation was done by the VMM.
However, computing the hash after the queue was chosen defeats the
purpose of RSS.

Another approach is to use eBPF steering program. This approach has
another downside: it cannot report the calculated hash due to the
restrictive nature of eBPF.

Introduce the code to compute hashes to the kernel in order to overcome
thse challenges.

An alternative solution is to extend the eBPF steering program so that it
will be able to report to the userspace, but it is based on context
rewrites, which is in feature freeze. We can adopt kfuncs, but they will
not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM
and vhost_net).

The patches for QEMU to use this new feature was submitted as RFC and
is available at:
https://patchew.org/QEMU/20250313-hash-v4-0-c75c494b4...@daynix.com/

This work was presented at LPC 2024:
https://lpc.events/event/18/contributions/1963/

V1 -> V2:
  Changed to introduce a new BPF program type.

Signed-off-by: Akihiko Odaki 
---
Changes in v11:
- Added the missing code to free vnet_hash in patch
  "tap: Introduce virtio-net hash feature".
- Link to v10: 
https://lore.kernel.org/r/20250313-rss-v10-0-3185d73a9...@daynix.com

Changes in v10:
- Split common code and TUN/TAP-specific code into separate patches.
- Reverted a spurious style change in patch "tun: Introduce virtio-net
  hash feature".
- Added a comment explaining disable_ipv6 in tests.
- Used AF_PACKET for patch "selftest: tun: Add tests for
  virtio-net hashing". I also added the usage of FIXTURE_VARIANT() as
  the testing function now needs access to more variant-specific
  variables.
- Corrected the message of patch "selftest: tun: Add tests for
  virtio-net hashing"; it mentioned validation of configuration but
  it is not scope of this patch.
- Expanded the description of patch "selftest: tun: Add tests for
  virtio-net hashing".
- Added patch "tun: Allow steering eBPF program to fall back".
- Changed to handle TUNGETVNETHASHCAP before taking the rtnl lock.
- Removed redundant tests for tun_vnet_ioctl().
- Added patch "selftest: tap: Add tests for virtio-net ioctls".
- Added a design explanation of ioctls for extensibility and migration.
- Removed a few branches in patch
  "vhost/net: Support VIRTIO_NET_F_HASH_REPORT".
- Link to v9: 
https://lore.kernel.org/r/20250307-rss-v9-0-df7662402...@daynix.com

Changes in v9:
- Added a missing return statement in patch
  "tun: Introduce virtio-net hash feature".
- Link to v8: 
https://lore.kernel.org/r/20250306-rss-v8-0-7ab4f56ff...@daynix.com

Changes in v8:
- Disabled IPv6 to eliminate noises in tests.
- Added a branch in tap to avoid unnecessary dissection when hash
  reporting is disabled.
- Removed unnecessary rtnl_lock().
- Extracted code to handle new ioctls into separate functions to avoid
  adding extra NULL checks to the code handling other ioctls.
- Introduced variable named "fd" to __tun_chr_ioctl().
- s/-/=/g in a patch message to avoid confusing Git.
- Link to v7: 
https://lore.kernel.org/r/20250228-rss-v7-0-844205cbb...@daynix.com

Changes in v7:
- Ensured to set hash_report to VIRTIO_NET_HASH_REPORT_NONE for
  VHOST_NET_F_VIRTIO_NET_HDR.
- s/4/sizeof(u32)/ in patch "virtio_net: Add functions for hashing".
- Added tap_skb_cb type.
- Rebased.
- Link to v6: 
https://lore.kernel.org/r/20250109-rss-v6-0-b1c90ad70...@daynix.com

Changes in v6:
- Extracted changes to fill vnet header holes into another series.
- Squashed patches "skbuff: Introduce SKB_EXT_TUN_VNET_HASH", "tun:
  Introduce virtio-net hash reporting feature", and "tun: Introduce
  virtio-net RSS" into patch "tun: Introduce virtio-net hash feature".
- Dropped the RFC tag.
- Link to v5: 
https://lore.kernel.org/r/20241008-rss-v5-0-f3cf68df0...@daynix.com

Changes in v5:
- Fixed a compilation error with CONFIG_TUN_VNET_CROSS_LE.
- Optimized the calculation of the hash value according to:
  https://git.dpdk.org/dpdk/commit/?id=3fb1ea032bd6ff8317af5dac9af901f1f324cab4
- Added patch "tun: Unify vnet implementation".
- Dropped patch "tap: Pad virtio header with zero".
- Added patch "selftest: tun: Test vnet ioctls without device".
- Reworked selftests to skip for older kernels.
- Documented the case when the underlying device is deleted and packets
  have queue_mapping set by TC.
- Reordered test harness arguments.
- Added code to handle fragmented packets.
- Link to v4: 
https://lore.kernel.org/r/20240924-rss-v4-0-84e932ec0...@daynix.com

Changes in v4:
- Moved tun_vnet_hash_ext to if_tun.h.
- Renamed virtio_net_toeplitz() to virtio_net_toeplitz_calc().
- Replaced htons() with cpu_to_be16().
- Changed virtio_net_hash_rss() to return void.
- Reordered variable declarations in virtio_net_hash_rss().
- Removed virtio_net_hdr_v1_hash_from_skb().
- Updated messages of "tap: Pad virtio header with zero" and
  "tun: Pad virtio header with zero".
- Fixed vnet_hash allocation size

[PATCH net-next v11 03/10] tun: Allow steering eBPF program to fall back

2025-03-17 Thread Akihiko Odaki

This clarifies a steering eBPF program takes precedence over the other
steering algorithms.

Signed-off-by: Akihiko Odaki 
---
 Documentation/networking/tuntap.rst |  7 +++
 drivers/net/tun.c   | 28 +---
 include/uapi/linux/if_tun.h |  9 +
 3 files changed, 33 insertions(+), 11 deletions(-)

diff --git a/Documentation/networking/tuntap.rst 
b/Documentation/networking/tuntap.rst
index 4d7087f727be..86b4ae8caa8a 100644
--- a/Documentation/networking/tuntap.rst
+++ b/Documentation/networking/tuntap.rst
@@ -206,6 +206,13 @@ enable is true we enable it, otherwise we disable it::
   return ioctl(fd, TUNSETQUEUE, (void *)&ifr);
   }
 
+3.4 Reference
+-
+
+``linux/if_tun.h`` defines the interface described below:
+
+.. kernel-doc:: include/uapi/linux/if_tun.h
+
 Universal TUN/TAP device driver Frequently Asked Question
 =
 
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index d8f4d3e996a7..9133ab9ed3f5 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -476,21 +476,29 @@ static u16 tun_automq_select_queue(struct tun_struct 
*tun, struct sk_buff *skb)
return txq;
 }
 
-static u16 tun_ebpf_select_queue(struct tun_struct *tun, struct sk_buff *skb)
+static bool tun_ebpf_select_queue(struct tun_struct *tun, struct sk_buff *skb,
+ u16 *ret)
 {
struct tun_prog *prog;
u32 numqueues;
-   u16 ret = 0;
+   u32 prog_ret;
+
+   prog = rcu_dereference(tun->steering_prog);
+   if (!prog)
+   return false;
 
numqueues = READ_ONCE(tun->numqueues);
-   if (!numqueues)
-   return 0;
+   if (!numqueues) {
+   *ret = 0;
+   return true;
+   }
 
-   prog = rcu_dereference(tun->steering_prog);
-   if (prog)
-   ret = bpf_prog_run_clear_cb(prog->prog, skb);
+   prog_ret = bpf_prog_run_clear_cb(prog->prog, skb);
+   if (prog_ret == TUN_STEERINGEBPF_FALLBACK)
+   return false;
 
-   return ret % numqueues;
+   *ret = (u16)prog_ret % numqueues;
+   return true;
 }
 
 static u16 tun_select_queue(struct net_device *dev, struct sk_buff *skb,
@@ -500,9 +508,7 @@ static u16 tun_select_queue(struct net_device *dev, struct 
sk_buff *skb,
u16 ret;
 
rcu_read_lock();
-   if (rcu_dereference(tun->steering_prog))
-   ret = tun_ebpf_select_queue(tun, skb);
-   else
+   if (!tun_ebpf_select_queue(tun, skb, &ret))
ret = tun_automq_select_queue(tun, skb);
rcu_read_unlock();
 
diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h
index 287cdc81c939..980de74724fc 100644
--- a/include/uapi/linux/if_tun.h
+++ b/include/uapi/linux/if_tun.h
@@ -115,4 +115,13 @@ struct tun_filter {
__u8   addr[][ETH_ALEN];
 };
 
+/**
+ * define TUN_STEERINGEBPF_FALLBACK - A steering eBPF return value to fall back
+ *
+ * A steering eBPF program may return this value to fall back to the steering
+ * algorithm that should have been used if the program was not set. This allows
+ * selectively overriding the steering decision.
+ */
+#define TUN_STEERINGEBPF_FALLBACK -1
+
 #endif /* _UAPI__IF_TUN_H */

-- 
2.48.1

[PATCH net-next v11 02/10] net: flow_dissector: Export flow_keys_dissector_symmetric

2025-03-17 Thread Akihiko Odaki

flow_keys_dissector_symmetric is useful to derive a symmetric hash
and to know its source such as IPv4, IPv6, TCP, and UDP.

Signed-off-by: Akihiko Odaki 
Tested-by: Lei Yang 
---
 include/net/flow_dissector.h | 1 +
 net/core/flow_dissector.c| 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index ced79dc8e856..d01c1ec77b7d 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -423,6 +423,7 @@ __be32 flow_get_u32_src(const struct flow_keys *flow);
 __be32 flow_get_u32_dst(const struct flow_keys *flow);
 
 extern struct flow_dissector flow_keys_dissector;
+extern struct flow_dissector flow_keys_dissector_symmetric;
 extern struct flow_dissector flow_keys_basic_dissector;
 
 /* struct flow_keys_digest:
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 9cd8de6bebb5..32c7ee31330c 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -1862,7 +1862,8 @@ void make_flow_keys_digest(struct flow_keys_digest 
*digest,
 }
 EXPORT_SYMBOL(make_flow_keys_digest);
 
-static struct flow_dissector flow_keys_dissector_symmetric __read_mostly;
+struct flow_dissector flow_keys_dissector_symmetric __read_mostly;
+EXPORT_SYMBOL(flow_keys_dissector_symmetric);
 
 u32 __skb_get_hash_symmetric_net(const struct net *net, const struct sk_buff 
*skb)
 {

-- 
2.48.1

[PATCH net-next v11 04/10] tun: Add common virtio-net hash feature code

2025-03-17 Thread Akihiko Odaki

Add common code required for the features being added to TUN and TAP.
They will be enabled for each of them in following patches.

Added Features
==

Hash reporting
--

Allow the guest to reuse the hash value to make receive steering
consistent between the host and guest, and to save hash computation.

Receive Side Scaling (RSS)
--

RSS is a receive steering algorithm that can be negotiated to use with
virtio_net. Conventionally the hash calculation was done by the VMM.
However, computing the hash after the queue was chosen defeats the
purpose of RSS.

Another approach is to use eBPF steering program. This approach has
another downside: it cannot report the calculated hash due to the
restrictive nature of eBPF steering program.

Introduce the code to perform RSS to the kernel in order to overcome
thse challenges. An alternative solution is to extend the eBPF steering
program so that it will be able to report to the userspace, but I didn't
opt for it because extending the current mechanism of eBPF steering
program as is because it relies on legacy context rewriting, and
introducing kfunc-based eBPF will result in non-UAPI dependency while
the other relevant virtualization APIs such as KVM and vhost_net are
UAPIs.

Added ioctls


They are designed to make extensibility and VM migration compatible.
This change only adds the implementation and does not expose them to
the userspace.

TUNGETVNETHASHCAP
-

This ioctl tells supported features and hash types. It is useful to
check if a VM can be migrated to the current host.

TUNSETVNETHASH
--

This ioctl allows setting features and hash types to be enabled. It
limits the features exposed to the guest to ensure proper migration. It
also sets RSS parameters, depending on the enabled features and hash
types.

Signed-off-by: Akihiko Odaki 
Tested-by: Lei Yang 
---
 drivers/net/tap.c   |  10 ++-
 drivers/net/tun.c   |  12 +++-
 drivers/net/tun_vnet.h  | 155 +---
 include/uapi/linux/if_tun.h |  73 +
 4 files changed, 236 insertions(+), 14 deletions(-)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index d4ece538f1b2..25c60ff2d3f2 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -179,6 +179,11 @@ static void tap_put_queue(struct tap_queue *q)
sock_put(&q->sk);
 }
 
+static const struct virtio_net_hash *tap_find_hash(const struct sk_buff *skb)
+{
+   return NULL;
+}
+
 /*
  * Select a queue based on the rxq of the device on which this packet
  * arrived. If the incoming device is not mq, calculate a flow hash
@@ -711,11 +716,12 @@ static ssize_t tap_put_user(struct tap_queue *q,
int total;
 
if (q->flags & IFF_VNET_HDR) {
-   struct virtio_net_hdr vnet_hdr;
+   struct virtio_net_hdr_v1_hash vnet_hdr;
 
vnet_hdr_len = READ_ONCE(q->vnet_hdr_sz);
 
-   ret = tun_vnet_hdr_from_skb(q->flags, NULL, skb, &vnet_hdr);
+   ret = tun_vnet_hdr_from_skb(vnet_hdr_len, q->flags, NULL, skb,
+   tap_find_hash, &vnet_hdr);
if (ret)
return ret;
 
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 9133ab9ed3f5..03d47799e9bd 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -451,6 +451,11 @@ static inline void tun_flow_save_rps_rxhash(struct 
tun_flow_entry *e, u32 hash)
e->rps_rxhash = hash;
 }
 
+static const struct virtio_net_hash *tun_find_hash(const struct sk_buff *skb)
+{
+   return NULL;
+}
+
 /* We try to identify a flow through its rxhash. The reason that
  * we do not check rxq no. is because some cards(e.g 82599), chooses
  * the rxq based on the txq where the last packet of the flow comes. As
@@ -1993,7 +1998,7 @@ static ssize_t tun_put_user_xdp(struct tun_struct *tun,
ssize_t ret;
 
if (tun->flags & IFF_VNET_HDR) {
-   struct virtio_net_hdr gso = { 0 };
+   struct virtio_net_hdr_v1_hash gso = { 0 };
 
vnet_hdr_sz = READ_ONCE(tun->vnet_hdr_sz);
ret = tun_vnet_hdr_put(vnet_hdr_sz, iter, &gso);
@@ -2046,9 +2051,10 @@ static ssize_t tun_put_user(struct tun_struct *tun,
}
 
if (vnet_hdr_sz) {
-   struct virtio_net_hdr gso;
+   struct virtio_net_hdr_v1_hash gso;
 
-   ret = tun_vnet_hdr_from_skb(tun->flags, tun->dev, skb, &gso);
+   ret = tun_vnet_hdr_from_skb(vnet_hdr_sz, tun->flags, tun->dev,
+   skb, tun_find_hash, &gso);
if (ret)
return ret;
 
diff --git a/drivers/net/tun_vnet.h b/drivers/net/tun_vnet.h
index 58b9ac7a5fc4..578adaac0671 100644
--- a/drivers/net/tun_vnet.h
+++ b/drivers/net/tun_vnet.h
@@ -6,6 +6,16 @@
 #define TUN_VNET_LE 0x8000
 #define TUN_VNET_BE 0x4000

[PATCH net-next v11 06/10] tap: Introduce virtio-net hash feature

2025-03-17 Thread Akihiko Odaki

Add ioctls and storage required for the virtio-net hash feature to TAP.

Signed-off-by: Akihiko Odaki 
---
 drivers/net/ipvlan/ipvtap.c |  2 +-
 drivers/net/macvtap.c   |  2 +-
 drivers/net/tap.c   | 70 +
 include/linux/if_tap.h  |  4 ++-
 4 files changed, 69 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ipvlan/ipvtap.c b/drivers/net/ipvlan/ipvtap.c
index 1afc4c47be73..305438abf7ae 100644
--- a/drivers/net/ipvlan/ipvtap.c
+++ b/drivers/net/ipvlan/ipvtap.c
@@ -114,7 +114,7 @@ static void ipvtap_dellink(struct net_device *dev,
struct ipvtap_dev *vlan = netdev_priv(dev);
 
netdev_rx_handler_unregister(dev);
-   tap_del_queues(&vlan->tap);
+   tap_del(&vlan->tap);
ipvlan_link_delete(dev, head);
 }
 
diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 29a5929d48e5..e72144d05ef4 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -122,7 +122,7 @@ static void macvtap_dellink(struct net_device *dev,
struct macvtap_dev *vlantap = netdev_priv(dev);
 
netdev_rx_handler_unregister(dev);
-   tap_del_queues(&vlantap->tap);
+   tap_del(&vlantap->tap);
macvlan_dellink(dev, head);
 }
 
diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 25c60ff2d3f2..2213a2aa83a8 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -49,6 +49,10 @@ struct major_info {
struct list_head next;
 };
 
+struct tap_skb_cb {
+   struct virtio_net_hash hash;
+};
+
 #define GOODCOPY_LEN 128
 
 static const struct proto_ops tap_socket_ops;
@@ -179,9 +183,20 @@ static void tap_put_queue(struct tap_queue *q)
sock_put(&q->sk);
 }
 
+static struct tap_skb_cb *tap_skb_cb(const struct sk_buff *skb)
+{
+   BUILD_BUG_ON(sizeof(skb->cb) < sizeof(struct tap_skb_cb));
+   return (struct tap_skb_cb *)skb->cb;
+}
+
+static struct virtio_net_hash *tap_add_hash(struct sk_buff *skb)
+{
+   return &tap_skb_cb(skb)->hash;
+}
+
 static const struct virtio_net_hash *tap_find_hash(const struct sk_buff *skb)
 {
-   return NULL;
+   return &tap_skb_cb(skb)->hash;
 }
 
 /*
@@ -194,6 +209,7 @@ static const struct virtio_net_hash *tap_find_hash(const 
struct sk_buff *skb)
 static struct tap_queue *tap_get_queue(struct tap_dev *tap,
   struct sk_buff *skb)
 {
+   struct flow_keys_basic keys_basic;
struct tap_queue *queue = NULL;
/* Access to taps array is protected by rcu, but access to numvtaps
 * isn't. Below we use it to lookup a queue, but treat it as a hint
@@ -201,17 +217,47 @@ static struct tap_queue *tap_get_queue(struct tap_dev 
*tap,
 * racing against queue removal.
 */
int numvtaps = READ_ONCE(tap->numvtaps);
+   struct tun_vnet_hash_container *vnet_hash = 
rcu_dereference(tap->vnet_hash);
__u32 rxq;
 
+   *tap_skb_cb(skb) = (struct tap_skb_cb) {
+   .hash = { .report = VIRTIO_NET_HASH_REPORT_NONE }
+   };
+
if (!numvtaps)
goto out;
 
if (numvtaps == 1)
goto single;
 
+   if (vnet_hash) {
+   if ((vnet_hash->common.flags & TUN_VNET_HASH_RSS)) {
+   rxq = tun_vnet_rss_select_queue(numvtaps, vnet_hash, 
skb, tap_add_hash);
+   queue = rcu_dereference(tap->taps[rxq]);
+   goto out;
+   }
+
+   if (!skb->l4_hash && !skb->sw_hash) {
+   struct flow_keys keys;
+
+   skb_flow_dissect_flow_keys(skb, &keys, 
FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL);
+   rxq = flow_hash_from_keys(&keys);
+   keys_basic = (struct flow_keys_basic) {
+   .control = keys.control,
+   .basic = keys.basic
+   };
+   } else {
+   skb_flow_dissect_flow_keys_basic(NULL, skb, 
&keys_basic, NULL, 0, 0, 0,
+
FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL);
+   rxq = skb->hash;
+   }
+   } else {
+   rxq = skb_get_hash(skb);
+   }
+
/* Check if we can use flow to select a queue */
-   rxq = skb_get_hash(skb);
if (rxq) {
+   tun_vnet_hash_report(vnet_hash, skb, &keys_basic, rxq, 
tap_add_hash);
queue = rcu_dereference(tap->taps[rxq % numvtaps]);
goto out;
}
@@ -234,10 +280,10 @@ static struct tap_queue *tap_get_queue(struct tap_dev 
*tap,
 
 /*
  * The net_device is going away, give up the reference
- * that it holds on all queues and safely set the pointer
- * from the queues to NULL.
+ * that it holds on all queues, safely set the pointer
+ * from the queues to NULL, and free vnet_hash.
  */
-void tap_del_queues(struct tap_dev *tap)
+void tap_del(struct tap_dev *tap)
 {
struct tap_

[PATCH net-next v11 08/10] selftest: tun: Add tests for virtio-net hashing

2025-03-17 Thread Akihiko Odaki

The added tests confirm tun can perform RSS for all supported hash types
to select the receive queue and report hash values.

Signed-off-by: Akihiko Odaki 
Tested-by: Lei Yang 
---
 tools/testing/selftests/net/Makefile |   2 +-
 tools/testing/selftests/net/tun.c| 455 ++-
 2 files changed, 447 insertions(+), 10 deletions(-)

diff --git a/tools/testing/selftests/net/Makefile 
b/tools/testing/selftests/net/Makefile
index 73ee88d6b043..9772f691a9a0 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -123,6 +123,6 @@ $(OUTPUT)/reuseport_bpf_numa: LDLIBS += -lnuma
 $(OUTPUT)/tcp_mmap: LDLIBS += -lpthread -lcrypto
 $(OUTPUT)/tcp_inq: LDLIBS += -lpthread
 $(OUTPUT)/bind_bhash: LDLIBS += -lpthread
-$(OUTPUT)/io_uring_zerocopy_tx: CFLAGS += -I../../../include/
+$(OUTPUT)/io_uring_zerocopy_tx $(OUTPUT)/tun: CFLAGS += -I../../../include/
 
 include bpf.mk
diff --git a/tools/testing/selftests/net/tun.c 
b/tools/testing/selftests/net/tun.c
index ad168c15c02d..dfb84da50d91 100644
--- a/tools/testing/selftests/net/tun.c
+++ b/tools/testing/selftests/net/tun.c
@@ -2,21 +2,38 @@
 
 #define _GNU_SOURCE
 
+#include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
 #include 
-#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
 #include 
+#include 
 #include 
 #include 
-#include 
-#include 
+#include 
+#include 
+#include 
+#include 
 
 #include "../kselftest_harness.h"
 
+#define TUN_HWADDR_SOURCE { 0x02, 0x00, 0x00, 0x00, 0x00, 0x00 }
+#define TUN_HWADDR_DEST { 0x02, 0x00, 0x00, 0x00, 0x00, 0x01 }
+#define TUN_IPADDR_SOURCE htonl((172 << 24) | (17 << 16) | 0)
+#define TUN_IPADDR_DEST htonl((172 << 24) | (17 << 16) | 1)
+
 static int tun_attach(int fd, char *dev)
 {
struct ifreq ifr;
@@ -39,7 +56,7 @@ static int tun_detach(int fd, char *dev)
return ioctl(fd, TUNSETQUEUE, (void *) &ifr);
 }
 
-static int tun_alloc(char *dev)
+static int tun_alloc(char *dev, short flags)
 {
struct ifreq ifr;
int fd, err;
@@ -52,7 +69,8 @@ static int tun_alloc(char *dev)
 
memset(&ifr, 0, sizeof(ifr));
strcpy(ifr.ifr_name, dev);
-   ifr.ifr_flags = IFF_TAP | IFF_NAPI | IFF_MULTI_QUEUE;
+   ifr.ifr_flags = flags | IFF_TAP | IFF_NAPI | IFF_NO_PI |
+   IFF_MULTI_QUEUE;
 
err = ioctl(fd, TUNSETIFF, (void *) &ifr);
if (err < 0) {
@@ -64,6 +82,20 @@ static int tun_alloc(char *dev)
return fd;
 }
 
+static bool tun_set_flags(int local_fd, const char *name, short flags)
+{
+   struct ifreq ifreq = { .ifr_flags = flags };
+
+   strcpy(ifreq.ifr_name, name);
+
+   if (ioctl(local_fd, SIOCSIFFLAGS, &ifreq)) {
+   perror("SIOCSIFFLAGS");
+   return false;
+   }
+
+   return true;
+}
+
 static int tun_delete(char *dev)
 {
struct {
@@ -102,6 +134,107 @@ static int tun_delete(char *dev)
return ret;
 }
 
+static uint32_t tun_sum(const void *buf, size_t len)
+{
+   const uint16_t *sbuf = buf;
+   uint32_t sum = 0;
+
+   while (len > 1) {
+   sum += *sbuf++;
+   len -= 2;
+   }
+
+   if (len)
+   sum += *(uint8_t *)sbuf;
+
+   return sum;
+}
+
+static uint16_t tun_build_ip_check(uint32_t sum)
+{
+   return ~((sum & 0x) + (sum >> 16));
+}
+
+static uint32_t tun_build_ip_pseudo_sum(const void *iphdr)
+{
+   uint16_t tot_len = ntohs(((struct iphdr *)iphdr)->tot_len);
+
+   return tun_sum((char *)iphdr + offsetof(struct iphdr, saddr), 8) +
+  htons(((struct iphdr *)iphdr)->protocol) +
+  htons(tot_len - sizeof(struct iphdr));
+}
+
+static uint32_t tun_build_ipv6_pseudo_sum(const void *ipv6hdr)
+{
+   return tun_sum((char *)ipv6hdr + offsetof(struct ipv6hdr, saddr), 32) +
+  ((struct ipv6hdr *)ipv6hdr)->payload_len +
+  htons(((struct ipv6hdr *)ipv6hdr)->nexthdr);
+}
+
+static void tun_build_iphdr(void *dest, uint16_t len, uint8_t protocol)
+{
+   struct iphdr iphdr = {
+   .ihl = sizeof(iphdr) / 4,
+   .version = 4,
+   .tot_len = htons(sizeof(iphdr) + len),
+   .ttl = 255,
+   .protocol = protocol,
+   .saddr = TUN_IPADDR_SOURCE,
+   .daddr = TUN_IPADDR_DEST
+   };
+
+   iphdr.check = tun_build_ip_check(tun_sum(&iphdr, sizeof(iphdr)));
+   memcpy(dest, &iphdr, sizeof(iphdr));
+}
+
+static void tun_build_ipv6hdr(void *dest, uint16_t len, uint8_t protocol)
+{
+   struct ipv6hdr ipv6hdr = {
+   .version = 6,
+   .payload_len = htons(len),
+   .nexthdr = protocol,
+   .saddr = {
+   .s6_addr32 = {
+   htonl(0x), 0, 0, TUN_IPADDR_SOURCE
+   }
+   },
+   .daddr = {
+   .s6_addr32 =

[PATCH net-next v11 10/10] vhost/net: Support VIRTIO_NET_F_HASH_REPORT

2025-03-17 Thread Akihiko Odaki

VIRTIO_NET_F_HASH_REPORT allows to report hash values calculated on the
host. When VHOST_NET_F_VIRTIO_NET_HDR is employed, it will report no
hash values (i.e., the hash_report member is always set to
VIRTIO_NET_HASH_REPORT_NONE). Otherwise, the values reported by the
underlying socket will be reported.

VIRTIO_NET_F_HASH_REPORT requires VIRTIO_F_VERSION_1.

Signed-off-by: Akihiko Odaki 
Tested-by: Lei Yang 
---
 drivers/vhost/net.c | 68 +++--
 1 file changed, 35 insertions(+), 33 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index b9b9e9d40951..fc5b43e43a06 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -73,6 +73,7 @@ enum {
VHOST_NET_FEATURES = VHOST_FEATURES |
 (1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
 (1ULL << VIRTIO_NET_F_MRG_RXBUF) |
+(1ULL << VIRTIO_NET_F_HASH_REPORT) |
 (1ULL << VIRTIO_F_ACCESS_PLATFORM) |
 (1ULL << VIRTIO_F_RING_RESET)
 };
@@ -1097,10 +1098,6 @@ static void handle_rx(struct vhost_net *net)
.msg_controllen = 0,
.msg_flags = MSG_DONTWAIT,
};
-   struct virtio_net_hdr hdr = {
-   .flags = 0,
-   .gso_type = VIRTIO_NET_HDR_GSO_NONE
-   };
size_t total_len = 0;
int err, mergeable;
s16 headcount;
@@ -1174,11 +1171,15 @@ static void handle_rx(struct vhost_net *net)
/* We don't need to be notified again. */
iov_iter_init(&msg.msg_iter, ITER_DEST, vq->iov, in, vhost_len);
fixup = msg.msg_iter;
-   if (unlikely((vhost_hlen))) {
-   /* We will supply the header ourselves
-* TODO: support TSO.
-*/
-   iov_iter_advance(&msg.msg_iter, vhost_hlen);
+   /*
+* Supply virtio_net_hdr if VHOST_NET_F_VIRTIO_NET_HDR
+* TODO: support TSO.
+*/
+   if (unlikely(vhost_hlen) &&
+   iov_iter_zero(vhost_hlen, &msg.msg_iter) != vhost_hlen) {
+   vq_err(vq, "Unable to write vnet_hdr at addr %p\n",
+  vq->iov->iov_base);
+   goto out;
}
err = sock->ops->recvmsg(sock, &msg,
 sock_len, MSG_DONTWAIT | MSG_TRUNC);
@@ -1191,30 +1192,24 @@ static void handle_rx(struct vhost_net *net)
vhost_discard_vq_desc(vq, headcount);
continue;
}
-   /* Supply virtio_net_hdr if VHOST_NET_F_VIRTIO_NET_HDR */
-   if (unlikely(vhost_hlen)) {
-   if (copy_to_iter(&hdr, sizeof(hdr),
-&fixup) != sizeof(hdr)) {
-   vq_err(vq, "Unable to write vnet_hdr "
-  "at addr %p\n", vq->iov->iov_base);
-   goto out;
-   }
-   } else {
-   /* Header came from socket; we'll need to patch
-* ->num_buffers over if VIRTIO_NET_F_MRG_RXBUF
-*/
-   iov_iter_advance(&fixup, sizeof(hdr));
-   }
/* TODO: Should check and handle checksum. */
 
+   /*
+* We'll need to patch ->num_buffers over if
+* VIRTIO_NET_F_MRG_RXBUF or VIRTIO_F_VERSION_1
+*/
num_buffers = cpu_to_vhost16(vq, headcount);
-   if (likely(set_num_buffers) &&
-   copy_to_iter(&num_buffers, sizeof num_buffers,
-&fixup) != sizeof num_buffers) {
-   vq_err(vq, "Failed num_buffers write");
-   vhost_discard_vq_desc(vq, headcount);
-   goto out;
+   if (likely(set_num_buffers)) {
+   iov_iter_advance(&fixup, offsetof(struct 
virtio_net_hdr_v1, num_buffers));
+
+   if (copy_to_iter(&num_buffers, sizeof(num_buffers),
+&fixup) != sizeof(num_buffers)) {
+   vq_err(vq, "Failed num_buffers write");
+   vhost_discard_vq_desc(vq, headcount);
+   goto out;
+   }
}
+
nvq->done_idx += headcount;
if (nvq->done_idx > VHOST_NET_BATCH)
vhost_net_signal_used(nvq);
@@ -1607,10 +1602,13 @@ static int vhost_net_set_features(struct vhost_net *n, 
u64 features)
size_t vhost_hlen, sock_hlen, hdr_len;
int i;
 
-   hdr_len = (features & ((1ULL << VIRTIO_NET_F_MRG_RXBUF) |
-

[PATCH net-next v11 07/10] selftest: tun: Test vnet ioctls without device

2025-03-17 Thread Akihiko Odaki

Ensure that vnet ioctls result in EBADFD when the underlying device is
deleted.

Signed-off-by: Akihiko Odaki 
Tested-by: Lei Yang 
---
 tools/testing/selftests/net/tun.c | 38 ++
 1 file changed, 38 insertions(+)

diff --git a/tools/testing/selftests/net/tun.c 
b/tools/testing/selftests/net/tun.c
index fa83918b62d1..ad168c15c02d 100644
--- a/tools/testing/selftests/net/tun.c
+++ b/tools/testing/selftests/net/tun.c
@@ -159,4 +159,42 @@ TEST_F(tun, reattach_close_delete) {
EXPECT_EQ(tun_delete(self->ifname), 0);
 }
 
+FIXTURE(tun_deleted)
+{
+   char ifname[IFNAMSIZ];
+   int fd;
+};
+
+FIXTURE_SETUP(tun_deleted)
+{
+   self->ifname[0] = 0;
+   self->fd = tun_alloc(self->ifname);
+   ASSERT_LE(0, self->fd);
+
+   ASSERT_EQ(0, tun_delete(self->ifname))
+   EXPECT_EQ(0, close(self->fd));
+}
+
+FIXTURE_TEARDOWN(tun_deleted)
+{
+   EXPECT_EQ(0, close(self->fd));
+}
+
+TEST_F(tun_deleted, getvnethdrsz)
+{
+   ASSERT_EQ(-1, ioctl(self->fd, TUNGETVNETHDRSZ));
+   EXPECT_EQ(EBADFD, errno);
+}
+
+TEST_F(tun_deleted, getvnethashcap)
+{
+   struct tun_vnet_hash cap;
+   int i = ioctl(self->fd, TUNGETVNETHASHCAP, &cap);
+
+   if (i == -1 && errno == EBADFD)
+   SKIP(return, "TUNGETVNETHASHCAP not supported");
+
+   EXPECT_EQ(0, i);
+}
+
 TEST_HARNESS_MAIN

-- 
2.48.1

Re: [PATCH net-next v9 3/6] tun: Introduce virtio-net hash feature

2025-03-17 Thread Akihiko Odaki


On 2025/03/17 10:12, Jason Wang wrote:

On Wed, Mar 12, 2025 at 1:03 PM Akihiko Odaki  wrote:


On 2025/03/12 11:35, Jason Wang wrote:

On Tue, Mar 11, 2025 at 2:11 PM Akihiko Odaki  wrote:


On 2025/03/11 9:38, Jason Wang wrote:

On Mon, Mar 10, 2025 at 3:45 PM Akihiko Odaki  wrote:


On 2025/03/10 12:55, Jason Wang wrote:

On Fri, Mar 7, 2025 at 7:01 PM Akihiko Odaki  wrote:


Hash reporting
==

Allow the guest to reuse the hash value to make receive steering
consistent between the host and guest, and to save hash computation.

RSS
===

RSS is a receive steering algorithm that can be negotiated to use with
virtio_net. Conventionally the hash calculation was done by the VMM.
However, computing the hash after the queue was chosen defeats the
purpose of RSS.

Another approach is to use eBPF steering program. This approach has
another downside: it cannot report the calculated hash due to the
restrictive nature of eBPF steering program.

Introduce the code to perform RSS to the kernel in order to overcome
thse challenges. An alternative solution is to extend the eBPF steering
program so that it will be able to report to the userspace, but I didn't
opt for it because extending the current mechanism of eBPF steering
program as is because it relies on legacy context rewriting, and
introducing kfunc-based eBPF will result in non-UAPI dependency while
the other relevant virtualization APIs such as KVM and vhost_net are
UAPIs.

Signed-off-by: Akihiko Odaki 
Tested-by: Lei Yang 
---
 Documentation/networking/tuntap.rst |   7 ++
 drivers/net/Kconfig |   1 +
 drivers/net/tap.c   |  68 ++-
 drivers/net/tun.c   |  98 +-
 drivers/net/tun_vnet.h  | 159 
++--
 include/linux/if_tap.h  |   2 +
 include/linux/skbuff.h  |   3 +
 include/uapi/linux/if_tun.h |  75 +
 net/core/skbuff.c   |   4 +
 9 files changed, 386 insertions(+), 31 deletions(-)

diff --git a/Documentation/networking/tuntap.rst 
b/Documentation/networking/tuntap.rst
index 
4d7087f727be5e37dfbf5066a9e9c872cc98898d..86b4ae8caa8ad062c1e558920be42ce0d4217465
 100644
--- a/Documentation/networking/tuntap.rst
+++ b/Documentation/networking/tuntap.rst
@@ -206,6 +206,13 @@ enable is true we enable it, otherwise we disable it::
   return ioctl(fd, TUNSETQUEUE, (void *)&ifr);
   }



[...]


+static inline long tun_vnet_ioctl_sethash(struct tun_vnet_hash_container __rcu 
**hashp,
+ bool can_rss, void __user *argp)


So again, can_rss seems to be tricky. Looking at its caller, it tires
to make eBPF and RSS mutually exclusive. I still don't understand why
we need this. Allow eBPF program to override some of the path seems to
be common practice.

What's more, we didn't try (or even can't) to make automq and eBPF to
be mutually exclusive. So I still didn't see what we gain from this
and it complicates the codes and may lead to ambiguous uAPI/behaviour.


automq and eBPF are mutually exclusive; automq is disabled when an eBPF
steering program is set so I followed the example here.


I meant from the view of uAPI, the kernel doesn't or can't reject eBPF
while using automq.

   > >>

We don't even have an interface for eBPF to let it fall back to another
alogirhtm.


It doesn't even need this, e.g XDP overrides the default receiving path.


I could make it fall back to RSS if the eBPF steeering
program is designed to fall back to automq when it returns e.g., -1. But
such an interface is currently not defined and defining one is out of
scope of this patch series.


Just to make sure we are on the same page, I meant we just need to
make the behaviour consistent: allow eBPF to override the behaviour of
both automq and rss.


That assumes eBPF takes precedence over RSS, which is not obvious to me.


Well, it's kind of obvious. Not speaking the eBPF selector, we have
other eBPF stuffs like skbedit etc.



Let's add an interface for the eBPF steering program to fall back to
another steering algorithm. I said it is out of scope before, but it
makes clear that the eBPF steering program takes precedence over other
algorithms and allows us to delete the code for the configuration
validation in this patch.


Fallback is out of scope but it's not what I meant.

I meant in the current uAPI take eBPF precedence over automq. It's
much more simpler to stick this precedence unless we see obvious
advanatge.


We still have three different design options that preserve the current
precedence:

1) Precedence order: eBPF -> RSS -> automq
2) Precedence order: RSS -> eBPF -> automq
3) Precedence order: eBPF OR RSS -> automq where eBPF and RSS are
mutually exclusive

I think this is a unique situation for this steering program and I could
not find another example in other eBPF stuffs.


As described above, queue ma

Re: [PATCH net-next v10 06/10] tap: Introduce virtio-net hash feature

2025-03-17 Thread Paolo Abeni

On 3/13/25 8:01 AM, Akihiko Odaki wrote:
> @@ -998,6 +1044,16 @@ static long tap_ioctl(struct file *file, unsigned int 
> cmd,
>   rtnl_unlock();
>   return ret;
>  
> + case TUNGETVNETHASHCAP:
> + return tun_vnet_ioctl_gethashcap(argp);
> +
> + case TUNSETVNETHASH:
> + rtnl_lock();
> + tap = rtnl_dereference(q->tap);
> + ret = tap ? tun_vnet_ioctl_sethash(&tap->vnet_hash, argp) : 
> -EBADFD;


Not really a review, but apparently this is causing intermittent memory
leak in self tests:

xx__-> echo scan > /sys/kernel/debug/kmemleak && cat
/sys/kernel/debug/kmemleak
unreferenced object 0x88800c6ec248 (size 8):
  comm "tap", pid 21124, jiffies 4299141559
  hex dump (first 8 bytes):
00 00 00 00 00 00 00 00  
  backtrace (crc 0):
__kmalloc_cache_noprof+0x2df/0x390
tun_vnet_ioctl_sethash+0xbf/0x3a0
tap_ioctl+0x6f2/0xc10
__x64_sys_ioctl+0x11f/0x180
do_syscall_64+0xc1/0x1d0
entry_SYSCALL_64_after_hwframe+0x77/0x7f

Could you please have a look?

Thanks!

Paolo

Re: [PATCH v4 1/2] compiler_types: Introduce __flex_counter() and family

2025-03-17 Thread Przemek Kitszel


On 3/15/25 04:15, Kees Cook wrote:

Introduce __flex_counter() which wraps __builtin_counted_by_ref(),
as newly introduced by GCC[1] and Clang[2]. Use of __flex_counter()
allows access to the counter member of a struct's flexible array member
when it has been annotated with __counted_by().

Introduce typeof_flex_counter(), can_set_flex_counter(), and
set_flex_counter() to provide the needed _Generic() wrappers to get sane
results out of __flex_counter().

For example, with:

struct foo {
int counter;
short array[] __counted_by(counter);
} *p;

__flex_counter(p->array) will resolve to: &p->counter

typeof_flex_counter(p->array) will resolve to "int". (If p->array was not
annotated, it would resolve to "size_t".)

can_set_flex_counter(p->array, COUNT) is the same as:

COUNT <= type_max(p->counter) && COUNT >= type_min(p->counter)

(If p->array was not annotated it would return true since everything
fits in size_t.)

set_flex_counter(p->array, COUNT) is the same as:

p->counter = COUNT;

(It is a no-op if p->array is not annotated with __counted_by().)

Signed-off-by: Kees Cook 


I agree that there is no suitable fallback handy, but I see counter
as integral part of the struct (in contrast to being merely annotation),
IOW, without set_flex_counter() doing the assignment, someone will
reference it later anyway, without any warning when kzalloc()'d

So, maybe BUILD_BUG() instead of no-op?


+#define set_flex_counter(FAM, COUNT)   \
+({ \
+   *_Generic(__flex_counter(FAM),  \
+ void *:  &(size_t){ 0 },  \
+ default: __flex_counter(FAM)) = (COUNT);  \
+})
+
  #endif /* __LINUX_OVERFLOW_H */

[PATCH v4 08/18] riscv: misaligned: move emulated access uniformity check in a function

2025-03-17 Thread Clément Léger

Split the code that check for the uniformity of misaligned accesses
performance on all cpus from check_unaligned_access_emulated_all_cpus()
to its own function which will be used for delegation check. No
functional changes intended.

Signed-off-by: Clément Léger 
Reviewed-by: Andrew Jones 
---
 arch/riscv/kernel/traps_misaligned.c | 18 +-
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/arch/riscv/kernel/traps_misaligned.c 
b/arch/riscv/kernel/traps_misaligned.c
index 8175b3449b73..3c77fc78fe4f 100644
--- a/arch/riscv/kernel/traps_misaligned.c
+++ b/arch/riscv/kernel/traps_misaligned.c
@@ -672,10 +672,20 @@ static int 
cpu_online_check_unaligned_access_emulated(unsigned int cpu)
return 0;
 }
 
-bool check_unaligned_access_emulated_all_cpus(void)
+static bool all_cpus_unaligned_scalar_access_emulated(void)
 {
int cpu;
 
+   for_each_online_cpu(cpu)
+   if (per_cpu(misaligned_access_speed, cpu) !=
+   RISCV_HWPROBE_MISALIGNED_SCALAR_EMULATED)
+   return false;
+
+   return true;
+}
+
+bool check_unaligned_access_emulated_all_cpus(void)
+{
/*
 * We can only support PR_UNALIGN controls if all CPUs have misaligned
 * accesses emulated since tasks requesting such control can run on any
@@ -683,10 +693,8 @@ bool check_unaligned_access_emulated_all_cpus(void)
 */
on_each_cpu(check_unaligned_access_emulated, NULL, 1);
 
-   for_each_online_cpu(cpu)
-   if (per_cpu(misaligned_access_speed, cpu)
-   != RISCV_HWPROBE_MISALIGNED_SCALAR_EMULATED)
-   return false;
+   if (!all_cpus_unaligned_scalar_access_emulated())
+   return false;
 
unaligned_ctl = true;
return true;
-- 
2.47.2

Re: [PATCH v4 1/2] compiler_types: Introduce __flex_counter() and family

2025-03-17 Thread Kees Cook

On Mon, Mar 17, 2025 at 10:43:38AM +0100, Przemek Kitszel wrote:
> On 3/17/25 10:26, Przemek Kitszel wrote:
> > On 3/15/25 04:15, Kees Cook wrote:
> > > Introduce __flex_counter() which wraps __builtin_counted_by_ref(),
> > > as newly introduced by GCC[1] and Clang[2]. Use of __flex_counter()
> > > allows access to the counter member of a struct's flexible array member
> > > when it has been annotated with __counted_by().
> > > 
> > > Introduce typeof_flex_counter(), can_set_flex_counter(), and
> > > set_flex_counter() to provide the needed _Generic() wrappers to get sane
> > > results out of __flex_counter().
> > > 
> > > For example, with:
> > > 
> > > struct foo {
> > >     int counter;
> > >     short array[] __counted_by(counter);
> > > } *p;
> > > 
> > > __flex_counter(p->array) will resolve to: &p->counter
> > > 
> > > typeof_flex_counter(p->array) will resolve to "int". (If p->array was not
> > > annotated, it would resolve to "size_t".)
> > > 
> > > can_set_flex_counter(p->array, COUNT) is the same as:
> > > 
> > > COUNT <= type_max(p->counter) && COUNT >= type_min(p->counter)
> > > 
> > > (If p->array was not annotated it would return true since everything
> > > fits in size_t.)
> > > 
> > > set_flex_counter(p->array, COUNT) is the same as:
> > > 
> > > p->counter = COUNT;
> > > 
> > > (It is a no-op if p->array is not annotated with __counted_by().)
> > > 
> > > Signed-off-by: Kees Cook 
> > 
> > I agree that there is no suitable fallback handy, but I see counter
> > as integral part of the struct (in contrast to being merely annotation),
> > IOW, without set_flex_counter() doing the assignment, someone will
> > reference it later anyway, without any warning when kzalloc()'d
> > 
> > So, maybe BUILD_BUG() instead of no-op?
> 
> I get that so far this is only used as an internal helper (in the next
> patch), so for me it would be also fine to just add __ prefix:
> __set_flex_counter(), at least until the following is true:
>  "manual initialization of the flexible array counter is still
> required (at some point) after allocation as not all compiler versions
> support the __counted_by annotation yet"

Yeah, that's fair. I will rename set_... and can_set_...

Thought FWIW I'm not sure we'll ever want a BUILD_BUG_ON() just because
there will be flex arrays with future annotations that can't have their
counter set (e.g. annotations that indicate globals, expressions, etc --
support for these cases is coming, if slowly[1]).

-Kees

[1] lng thread 
https://gcc.gnu.org/pipermail/gcc-patches/2025-March/677024.html

-- 
Kees Cook

Re: [PATCH v8 12/14] iommu/arm-smmu-v3: Introduce struct arm_smmu_vmaster

2025-03-17 Thread Jason Gunthorpe

On Tue, Mar 11, 2025 at 10:43:08AM -0700, Nicolin Chen wrote:
> > > +int arm_smmu_attach_prepare_vmaster(struct arm_smmu_attach_state *state,
> > > + struct arm_smmu_nested_domain 
> > > *nested_domain)
> > > +{
> > > + struct arm_smmu_vmaster *vmaster;
> > > + unsigned long vsid;
> > > + int ret;
> > > +
> > > + iommu_group_mutex_assert(state->master->dev);
> > > +
> > > + /* Skip invalid vSTE */
> > > + if (!(nested_domain->ste[0] & cpu_to_le64(STRTAB_STE_0_V)))
> > > + return 0;
> > 
> > Ok, and we don't need to set 'state->vmaster' in this case because we
> > only report stage-1 faults back to the vSMMU?
> 
> This is a good question that I didn't ask myself hard enough..
> 
> I think we should probably drop it. An invalid STE should trigger
> a C_BAD_STE event that is in the supported vEVENT list. I'll run
> some test before removing this line from v9.

It won't trigger C_BAD_STE, recall Robin was opposed to thatm so we have this:

static void arm_smmu_make_nested_domain_ste(
struct arm_smmu_ste *target, struct arm_smmu_master *master,
struct arm_smmu_nested_domain *nested_domain, bool ats_enabled)
{
unsigned int cfg =
FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(nested_domain->ste[0]));

/*
 * Userspace can request a non-valid STE through the nesting interface.
 * We relay that into an abort physical STE with the intention that
 * C_BAD_STE for this SID can be generated to userspace.
 */
if (!(nested_domain->ste[0] & cpu_to_le64(STRTAB_STE_0_V)))
cfg = STRTAB_STE_0_CFG_ABORT;

So, in the case of a non-valid STE, and a device access, the HW will
generate one of the translation faults and that will be forwarded.

Some software component will have to transform those fault events into
C_BAD_STE for the VM.

I imagined userspace would do this, but it could be done in the kernel
too. Regardless, I think Will is right and the the viommu should be
set even in this case to capture the events.

Jason

[PATCH v4 12/18] riscv: misaligned: use get_user() instead of __get_user()

2025-03-17 Thread Clément Léger

Now that we can safely handle user memory accesses while in the
misaligned access handlers, use get_user() instead of __get_user() to
have user memory access checks.

Signed-off-by: Clément Léger 
---
 arch/riscv/kernel/traps_misaligned.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/riscv/kernel/traps_misaligned.c 
b/arch/riscv/kernel/traps_misaligned.c
index 0fb663ac200f..90466a171f58 100644
--- a/arch/riscv/kernel/traps_misaligned.c
+++ b/arch/riscv/kernel/traps_misaligned.c
@@ -269,7 +269,7 @@ static unsigned long get_f32_rs(unsigned long insn, u8 
fp_reg_offset,
int __ret;  \
\
if (user_mode(regs)) {  \
-   __ret = __get_user(insn, (type __user *) insn_addr); \
+   __ret = get_user(insn, (type __user *) insn_addr); \
} else {\
insn = *(type *)insn_addr;  \
__ret = 0;  \
-- 
2.47.2

[PATCH v4 13/18] Documentation/sysctl: add riscv to unaligned-trap supported archs

2025-03-17 Thread Clément Léger

riscv supports the "unaligned-trap" sysctl variable, add it to the list
of supported architectures.

Signed-off-by: Clément Léger 
---
 Documentation/admin-guide/sysctl/kernel.rst | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/kernel.rst 
b/Documentation/admin-guide/sysctl/kernel.rst
index dd49a89a62d3..a38e91c4d92c 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -1595,8 +1595,8 @@ unaligned-trap
 
 On architectures where unaligned accesses cause traps, and where this
 feature is supported (``CONFIG_SYSCTL_ARCH_UNALIGN_ALLOW``; currently,
-``arc``, ``parisc`` and ``loongarch``), controls whether unaligned traps
-are caught and emulated (instead of failing).
+``arc``, ``parisc``, ``loongarch`` and ``riscv``), controls whether unaligned
+traps are caught and emulated (instead of failing).
 
 = 
 0 Do not emulate unaligned accesses.
-- 
2.47.2

[PATCH v4 17/18] RISC-V: KVM: add support for FWFT SBI extension

2025-03-17 Thread Clément Léger

Add basic infrastructure to support the FWFT extension in KVM.

Signed-off-by: Clément Léger 
Reviewed-by: Andrew Jones 
---
 arch/riscv/include/asm/kvm_host.h  |   4 +
 arch/riscv/include/asm/kvm_vcpu_sbi.h  |   1 +
 arch/riscv/include/asm/kvm_vcpu_sbi_fwft.h |  29 +++
 arch/riscv/include/uapi/asm/kvm.h  |   1 +
 arch/riscv/kvm/Makefile|   1 +
 arch/riscv/kvm/vcpu_sbi.c  |   4 +
 arch/riscv/kvm/vcpu_sbi_fwft.c | 216 +
 7 files changed, 256 insertions(+)
 create mode 100644 arch/riscv/include/asm/kvm_vcpu_sbi_fwft.h
 create mode 100644 arch/riscv/kvm/vcpu_sbi_fwft.c

diff --git a/arch/riscv/include/asm/kvm_host.h 
b/arch/riscv/include/asm/kvm_host.h
index bb93d2995ea2..c0db61ba691a 100644
--- a/arch/riscv/include/asm/kvm_host.h
+++ b/arch/riscv/include/asm/kvm_host.h
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -281,6 +282,9 @@ struct kvm_vcpu_arch {
/* Performance monitoring context */
struct kvm_pmu pmu_context;
 
+   /* Firmware feature SBI extension context */
+   struct kvm_sbi_fwft fwft_context;
+
/* 'static' configurations which are set only once */
struct kvm_vcpu_config cfg;
 
diff --git a/arch/riscv/include/asm/kvm_vcpu_sbi.h 
b/arch/riscv/include/asm/kvm_vcpu_sbi.h
index cb68b3a57c8f..ffd03fed0c06 100644
--- a/arch/riscv/include/asm/kvm_vcpu_sbi.h
+++ b/arch/riscv/include/asm/kvm_vcpu_sbi.h
@@ -98,6 +98,7 @@ extern const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_hsm;
 extern const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_dbcn;
 extern const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_susp;
 extern const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_sta;
+extern const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_fwft;
 extern const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_experimental;
 extern const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_vendor;
 
diff --git a/arch/riscv/include/asm/kvm_vcpu_sbi_fwft.h 
b/arch/riscv/include/asm/kvm_vcpu_sbi_fwft.h
new file mode 100644
index ..9ba841355758
--- /dev/null
+++ b/arch/riscv/include/asm/kvm_vcpu_sbi_fwft.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (c) 2025 Rivos Inc.
+ *
+ * Authors:
+ * Clément Léger 
+ */
+
+#ifndef __KVM_VCPU_RISCV_FWFT_H
+#define __KVM_VCPU_RISCV_FWFT_H
+
+#include 
+
+struct kvm_sbi_fwft_feature;
+
+struct kvm_sbi_fwft_config {
+   const struct kvm_sbi_fwft_feature *feature;
+   bool supported;
+   unsigned long flags;
+};
+
+/* FWFT data structure per vcpu */
+struct kvm_sbi_fwft {
+   struct kvm_sbi_fwft_config *configs;
+};
+
+#define vcpu_to_fwft(vcpu) (&(vcpu)->arch.fwft_context)
+
+#endif /* !__KVM_VCPU_RISCV_FWFT_H */
diff --git a/arch/riscv/include/uapi/asm/kvm.h 
b/arch/riscv/include/uapi/asm/kvm.h
index f06bc5efcd79..fa6eee1caf41 100644
--- a/arch/riscv/include/uapi/asm/kvm.h
+++ b/arch/riscv/include/uapi/asm/kvm.h
@@ -202,6 +202,7 @@ enum KVM_RISCV_SBI_EXT_ID {
KVM_RISCV_SBI_EXT_DBCN,
KVM_RISCV_SBI_EXT_STA,
KVM_RISCV_SBI_EXT_SUSP,
+   KVM_RISCV_SBI_EXT_FWFT,
KVM_RISCV_SBI_EXT_MAX,
 };
 
diff --git a/arch/riscv/kvm/Makefile b/arch/riscv/kvm/Makefile
index 4e0bba91d284..06e2d52a9b88 100644
--- a/arch/riscv/kvm/Makefile
+++ b/arch/riscv/kvm/Makefile
@@ -26,6 +26,7 @@ kvm-y += vcpu_onereg.o
 kvm-$(CONFIG_RISCV_PMU_SBI) += vcpu_pmu.o
 kvm-y += vcpu_sbi.o
 kvm-y += vcpu_sbi_base.o
+kvm-y += vcpu_sbi_fwft.o
 kvm-y += vcpu_sbi_hsm.o
 kvm-$(CONFIG_RISCV_PMU_SBI) += vcpu_sbi_pmu.o
 kvm-y += vcpu_sbi_replace.o
diff --git a/arch/riscv/kvm/vcpu_sbi.c b/arch/riscv/kvm/vcpu_sbi.c
index 50be079b5528..0748810c0252 100644
--- a/arch/riscv/kvm/vcpu_sbi.c
+++ b/arch/riscv/kvm/vcpu_sbi.c
@@ -78,6 +78,10 @@ static const struct kvm_riscv_sbi_extension_entry sbi_ext[] 
= {
.ext_idx = KVM_RISCV_SBI_EXT_STA,
.ext_ptr = &vcpu_sbi_ext_sta,
},
+   {
+   .ext_idx = KVM_RISCV_SBI_EXT_FWFT,
+   .ext_ptr = &vcpu_sbi_ext_fwft,
+   },
{
.ext_idx = KVM_RISCV_SBI_EXT_EXPERIMENTAL,
.ext_ptr = &vcpu_sbi_ext_experimental,
diff --git a/arch/riscv/kvm/vcpu_sbi_fwft.c b/arch/riscv/kvm/vcpu_sbi_fwft.c
new file mode 100644
index ..8a7cfe1fe7a7
--- /dev/null
+++ b/arch/riscv/kvm/vcpu_sbi_fwft.c
@@ -0,0 +1,216 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2025 Rivos Inc.
+ *
+ * Authors:
+ * Clément Léger 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct kvm_sbi_fwft_feature {
+   /**
+* @id: Feature ID
+*/
+   enum sbi_fwft_feature_t id;
+
+   /**
+* @supported: Check if the feature is supported on the vcpu
+*
+* This callback is optional, if not provided the feature is assumed to
+* be supported
+*/
+   bool (*supported)(struct kvm_

[PATCH v4 16/18] RISC-V: KVM: add SBI extension reset callback

2025-03-17 Thread Clément Léger

Currently, only the STA extension needed a reset function but that's
going to be the case for FWFT as well. Add a reset callback that can be
implemented by SBI extensions.

Signed-off-by: Clément Léger 
Reviewed-by: Andrew Jones 
---
 arch/riscv/include/asm/kvm_host.h |  1 -
 arch/riscv/include/asm/kvm_vcpu_sbi.h |  2 ++
 arch/riscv/kvm/vcpu.c |  2 +-
 arch/riscv/kvm/vcpu_sbi.c | 24 
 arch/riscv/kvm/vcpu_sbi_sta.c |  3 ++-
 5 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/arch/riscv/include/asm/kvm_host.h 
b/arch/riscv/include/asm/kvm_host.h
index cc33e35cd628..bb93d2995ea2 100644
--- a/arch/riscv/include/asm/kvm_host.h
+++ b/arch/riscv/include/asm/kvm_host.h
@@ -409,7 +409,6 @@ void __kvm_riscv_vcpu_power_on(struct kvm_vcpu *vcpu);
 void kvm_riscv_vcpu_power_on(struct kvm_vcpu *vcpu);
 bool kvm_riscv_vcpu_stopped(struct kvm_vcpu *vcpu);
 
-void kvm_riscv_vcpu_sbi_sta_reset(struct kvm_vcpu *vcpu);
 void kvm_riscv_vcpu_record_steal_time(struct kvm_vcpu *vcpu);
 
 #endif /* __RISCV_KVM_HOST_H__ */
diff --git a/arch/riscv/include/asm/kvm_vcpu_sbi.h 
b/arch/riscv/include/asm/kvm_vcpu_sbi.h
index bcb90757b149..cb68b3a57c8f 100644
--- a/arch/riscv/include/asm/kvm_vcpu_sbi.h
+++ b/arch/riscv/include/asm/kvm_vcpu_sbi.h
@@ -57,6 +57,7 @@ struct kvm_vcpu_sbi_extension {
 */
int (*init)(struct kvm_vcpu *vcpu);
void (*deinit)(struct kvm_vcpu *vcpu);
+   void (*reset)(struct kvm_vcpu *vcpu);
 };
 
 void kvm_riscv_vcpu_sbi_forward(struct kvm_vcpu *vcpu, struct kvm_run *run);
@@ -78,6 +79,7 @@ bool riscv_vcpu_supports_sbi_ext(struct kvm_vcpu *vcpu, int 
idx);
 int kvm_riscv_vcpu_sbi_ecall(struct kvm_vcpu *vcpu, struct kvm_run *run);
 void kvm_riscv_vcpu_sbi_init(struct kvm_vcpu *vcpu);
 void kvm_riscv_vcpu_sbi_deinit(struct kvm_vcpu *vcpu);
+void kvm_riscv_vcpu_sbi_reset(struct kvm_vcpu *vcpu);
 
 int kvm_riscv_vcpu_get_reg_sbi_sta(struct kvm_vcpu *vcpu, unsigned long 
reg_num,
   unsigned long *reg_val);
diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
index 877bcc85c067..542747e2c7f5 100644
--- a/arch/riscv/kvm/vcpu.c
+++ b/arch/riscv/kvm/vcpu.c
@@ -94,7 +94,7 @@ static void kvm_riscv_reset_vcpu(struct kvm_vcpu *vcpu)
vcpu->arch.hfence_tail = 0;
memset(vcpu->arch.hfence_queue, 0, sizeof(vcpu->arch.hfence_queue));
 
-   kvm_riscv_vcpu_sbi_sta_reset(vcpu);
+   kvm_riscv_vcpu_sbi_reset(vcpu);
 
/* Reset the guest CSRs for hotplug usecase */
if (loaded)
diff --git a/arch/riscv/kvm/vcpu_sbi.c b/arch/riscv/kvm/vcpu_sbi.c
index 3139f171c20f..50be079b5528 100644
--- a/arch/riscv/kvm/vcpu_sbi.c
+++ b/arch/riscv/kvm/vcpu_sbi.c
@@ -536,3 +536,27 @@ void kvm_riscv_vcpu_sbi_deinit(struct kvm_vcpu *vcpu)
ext->deinit(vcpu);
}
 }
+
+void kvm_riscv_vcpu_sbi_reset(struct kvm_vcpu *vcpu)
+{
+   struct kvm_vcpu_sbi_context *scontext = &vcpu->arch.sbi_context;
+   const struct kvm_riscv_sbi_extension_entry *entry;
+   const struct kvm_vcpu_sbi_extension *ext;
+   int idx, i;
+
+   for (i = 0; i < ARRAY_SIZE(sbi_ext); i++) {
+   entry = &sbi_ext[i];
+   ext = entry->ext_ptr;
+   idx = entry->ext_idx;
+
+   if (idx < 0 || idx >= ARRAY_SIZE(scontext->ext_status))
+   continue;
+
+   if (scontext->ext_status[idx] != 
KVM_RISCV_SBI_EXT_STATUS_ENABLED ||
+   !ext->reset)
+   continue;
+
+   ext->reset(vcpu);
+   }
+}
+
diff --git a/arch/riscv/kvm/vcpu_sbi_sta.c b/arch/riscv/kvm/vcpu_sbi_sta.c
index 5f35427114c1..cc6cb7c8f0e4 100644
--- a/arch/riscv/kvm/vcpu_sbi_sta.c
+++ b/arch/riscv/kvm/vcpu_sbi_sta.c
@@ -16,7 +16,7 @@
 #include 
 #include 
 
-void kvm_riscv_vcpu_sbi_sta_reset(struct kvm_vcpu *vcpu)
+static void kvm_riscv_vcpu_sbi_sta_reset(struct kvm_vcpu *vcpu)
 {
vcpu->arch.sta.shmem = INVALID_GPA;
vcpu->arch.sta.last_steal = 0;
@@ -156,6 +156,7 @@ const struct kvm_vcpu_sbi_extension vcpu_sbi_ext_sta = {
.extid_end = SBI_EXT_STA,
.handler = kvm_sbi_ext_sta_handler,
.probe = kvm_sbi_ext_sta_probe,
+   .reset = kvm_riscv_vcpu_sbi_sta_reset,
 };
 
 int kvm_riscv_vcpu_get_reg_sbi_sta(struct kvm_vcpu *vcpu,
-- 
2.47.2

[PATCH v4 00/18] riscv: add SBI FWFT misaligned exception delegation support

2025-03-17 Thread Clément Léger

The SBI Firmware Feature extension allows the S-mode to request some
specific features (either hardware or software) to be enabled. This
series uses this extension to request misaligned access exception
delegation to S-mode in order to let the kernel handle it. It also adds
support for the KVM FWFT SBI extension based on the misaligned access
handling infrastructure.

FWFT SBI extension is part of the SBI V3.0 specifications [1]. It can be
tested using the qemu provided at [2] which contains the series from
[3]. kvm-unit-tests [4] can be used inside kvm to tests the correct
delegation of misaligned exceptions. Upstream OpenSBI can be used.

Note: Since SBI V3.0 is not yet ratified, FWFT extension API is split
between interface only and implementation, allowing to pick only the
interface which do not have hard dependencies on SBI.

The tests can be run using the included kselftest:

$ qemu-system-riscv64 \
-cpu rv64,trap-misaligned-access=true,v=true \
-M virt \
-m 1024M \
-bios fw_dynamic.bin \
-kernel Image
 ...

 # ./misaligned
 TAP version 13
 1..23
 # Starting 23 tests from 1 test cases.
 #  RUN   global.gp_load_lh ...
 #OK  global.gp_load_lh
 ok 1 global.gp_load_lh
 #  RUN   global.gp_load_lhu ...
 #OK  global.gp_load_lhu
 ok 2 global.gp_load_lhu
 #  RUN   global.gp_load_lw ...
 #OK  global.gp_load_lw
 ok 3 global.gp_load_lw
 #  RUN   global.gp_load_lwu ...
 #OK  global.gp_load_lwu
 ok 4 global.gp_load_lwu
 #  RUN   global.gp_load_ld ...
 #OK  global.gp_load_ld
 ok 5 global.gp_load_ld
 #  RUN   global.gp_load_c_lw ...
 #OK  global.gp_load_c_lw
 ok 6 global.gp_load_c_lw
 #  RUN   global.gp_load_c_ld ...
 #OK  global.gp_load_c_ld
 ok 7 global.gp_load_c_ld
 #  RUN   global.gp_load_c_ldsp ...
 #OK  global.gp_load_c_ldsp
 ok 8 global.gp_load_c_ldsp
 #  RUN   global.gp_load_sh ...
 #OK  global.gp_load_sh
 ok 9 global.gp_load_sh
 #  RUN   global.gp_load_sw ...
 #OK  global.gp_load_sw
 ok 10 global.gp_load_sw
 #  RUN   global.gp_load_sd ...
 #OK  global.gp_load_sd
 ok 11 global.gp_load_sd
 #  RUN   global.gp_load_c_sw ...
 #OK  global.gp_load_c_sw
 ok 12 global.gp_load_c_sw
 #  RUN   global.gp_load_c_sd ...
 #OK  global.gp_load_c_sd
 ok 13 global.gp_load_c_sd
 #  RUN   global.gp_load_c_sdsp ...
 #OK  global.gp_load_c_sdsp
 ok 14 global.gp_load_c_sdsp
 #  RUN   global.fpu_load_flw ...
 #OK  global.fpu_load_flw
 ok 15 global.fpu_load_flw
 #  RUN   global.fpu_load_fld ...
 #OK  global.fpu_load_fld
 ok 16 global.fpu_load_fld
 #  RUN   global.fpu_load_c_fld ...
 #OK  global.fpu_load_c_fld
 ok 17 global.fpu_load_c_fld
 #  RUN   global.fpu_load_c_fldsp ...
 #OK  global.fpu_load_c_fldsp
 ok 18 global.fpu_load_c_fldsp
 #  RUN   global.fpu_store_fsw ...
 #OK  global.fpu_store_fsw
 ok 19 global.fpu_store_fsw
 #  RUN   global.fpu_store_fsd ...
 #OK  global.fpu_store_fsd
 ok 20 global.fpu_store_fsd
 #  RUN   global.fpu_store_c_fsd ...
 #OK  global.fpu_store_c_fsd
 ok 21 global.fpu_store_c_fsd
 #  RUN   global.fpu_store_c_fsdsp ...
 #OK  global.fpu_store_c_fsdsp
 ok 22 global.fpu_store_c_fsdsp
 #  RUN   global.gen_sigbus ...
 [12797.988647] misaligned[618]: unhandled signal 7 code 0x1 at 
0x00014dc0 in misaligned[4dc0,1+76000]
 [12797.988990] CPU: 0 UID: 0 PID: 618 Comm: misaligned Not tainted 
6.13.0-rc6-8-g4ec4468967c9-dirty #51
 [12797.989169] Hardware name: riscv-virtio,qemu (DT)
 [12797.989264] epc : 00014dc0 ra : 00014d00 sp : 
7fffe165d100
 [12797.989407]  gp : 0008f6e8 tp : 00095760 t0 : 
0008
 [12797.989544]  t1 : 000965d8 t2 : 0008e830 s0 : 
7fffe165d160
 [12797.989692]  s1 : 001a a0 :  a1 : 
0002
 [12797.989831]  a2 :  a3 :  a4 : 
deadbeef
 [12797.989964]  a5 : 0008ef61 a6 : 626769735f6e a7 : 
f000
 [12797.990094]  s2 : 0001 s3 : 7fffe165d838 s4 : 
7fffe165d848
 [12797.990238]  s5 : 001a s6 : 00010442 s7 : 
00010200
 [12797.990391]  s8 : 003a s9 : 00094508 s10: 

 [12797.990526]  s11: 67460668 t3 : 7fffe165d070 t4 : 
000965d0
 [12797.990656]  t5 : fefefefefefefeff t6 : 0073
 [12797.990756] status: 00024020 badaddr: 0008ef61 cause: 
0006
 [12797.990911] Code: 8793 8791 3423 fcf4 3783 fc84 c737 dead 0713 eef7 (c398) 
0001
 #OK  global.gen_sigbus
 ok 23 global.gen_sigbus
 # PASSED: 23 / 23 tests pas

[PATCH v4 03/18] riscv: sbi: add FWFT extension interface

2025-03-17 Thread Clément Léger

This SBI extensions enables supervisor mode to control feature that are
under M-mode control (For instance, Svadu menvcfg ADUE bit, Ssdbltrp
DTE, etc). Add an interface to set local features for a specific cpu
mask as well as for the online cpu mask.

Signed-off-by: Clément Léger 
---
 arch/riscv/include/asm/sbi.h | 20 +++
 arch/riscv/kernel/sbi.c  | 69 
 2 files changed, 89 insertions(+)

diff --git a/arch/riscv/include/asm/sbi.h b/arch/riscv/include/asm/sbi.h
index d11d22717b49..1cecfa82c2e5 100644
--- a/arch/riscv/include/asm/sbi.h
+++ b/arch/riscv/include/asm/sbi.h
@@ -503,6 +503,26 @@ int sbi_remote_hfence_vvma_asid(const struct cpumask 
*cpu_mask,
unsigned long asid);
 long sbi_probe_extension(int ext);
 
+int sbi_fwft_local_set_cpumask(const cpumask_t *mask, u32 feature,
+  unsigned long value, unsigned long flags);
+/**
+ * sbi_fwft_local_set() - Set a feature on all online cpus
+ * @feature: The feature to be set
+ * @value: The feature value to be set
+ * @flags: FWFT feature set flags
+ *
+ * Return: 0 on success, appropriate linux error code otherwise.
+ */
+ static inline int sbi_fwft_local_set(u32 feature, unsigned long value,
+ unsigned long flags)
+ {
+return sbi_fwft_local_set_cpumask(cpu_online_mask, feature, value,
+  flags);
+ }
+
+int sbi_fwft_get(u32 feature, unsigned long *value);
+int sbi_fwft_set(u32 feature, unsigned long value, unsigned long flags);
+
 /* Check if current SBI specification version is 0.1 or not */
 static inline int sbi_spec_is_0_1(void)
 {
diff --git a/arch/riscv/kernel/sbi.c b/arch/riscv/kernel/sbi.c
index 1989b8cade1b..d41a5642be24 100644
--- a/arch/riscv/kernel/sbi.c
+++ b/arch/riscv/kernel/sbi.c
@@ -299,6 +299,75 @@ static int __sbi_rfence_v02(int fid, const struct cpumask 
*cpu_mask,
return 0;
 }
 
+/**
+ * sbi_fwft_get() - Get a feature for the local hart
+ * @feature: The feature ID to be set
+ * @value: Will contain the feature value on success
+ *
+ * Return: 0 on success, appropriate linux error code otherwise.
+ */
+int sbi_fwft_get(u32 feature, unsigned long *value)
+{
+   return -EOPNOTSUPP;
+}
+
+/**
+ * sbi_fwft_set() - Set a feature on the local hart
+ * @feature: The feature ID to be set
+ * @value: The feature value to be set
+ * @flags: FWFT feature set flags
+ *
+ * Return: 0 on success, appropriate linux error code otherwise.
+ */
+int sbi_fwft_set(u32 feature, unsigned long value, unsigned long flags)
+{
+   return -EOPNOTSUPP;
+}
+
+struct fwft_set_req {
+   u32 feature;
+   unsigned long value;
+   unsigned long flags;
+   atomic_t error;
+};
+
+static void cpu_sbi_fwft_set(void *arg)
+{
+   struct fwft_set_req *req = arg;
+   int ret;
+
+   ret = sbi_fwft_set(req->feature, req->value, req->flags);
+   if (ret)
+   atomic_set(&req->error, ret);
+}
+
+/**
+ * sbi_fwft_local_set() - Set a feature for the specified cpumask
+ * @mask: CPU mask of cpus that need the feature to be set
+ * @feature: The feature ID to be set
+ * @value: The feature value to be set
+ * @flags: FWFT feature set flags
+ *
+ * Return: 0 on success, appropriate linux error code otherwise.
+ */
+int sbi_fwft_local_set_cpumask(const cpumask_t *mask, u32 feature,
+  unsigned long value, unsigned long flags)
+{
+   struct fwft_set_req req = {
+   .feature = feature,
+   .value = value,
+   .flags = flags,
+   .error = ATOMIC_INIT(0),
+   };
+
+   if (feature & SBI_FWFT_GLOBAL_FEATURE_BIT)
+   return -EINVAL;
+
+   on_each_cpu_mask(mask, cpu_sbi_fwft_set, &req, 1);
+
+   return atomic_read(&req.error);
+}
+
 /**
  * sbi_set_timer() - Program the timer for next timer event.
  * @stime_value: The value after which next timer event should fire.
-- 
2.47.2

[PATCH v4 05/18] riscv: misaligned: request misaligned exception from SBI

2025-03-17 Thread Clément Léger

Now that the kernel can handle misaligned accesses in S-mode, request
misaligned access exception delegation from SBI. This uses the FWFT SBI
extension defined in SBI version 3.0.

Signed-off-by: Clément Léger 
Reviewed-by: Andrew Jones 
---
 arch/riscv/include/asm/cpufeature.h|  3 +-
 arch/riscv/kernel/traps_misaligned.c   | 77 +-
 arch/riscv/kernel/unaligned_access_speed.c | 11 +++-
 3 files changed, 86 insertions(+), 5 deletions(-)

diff --git a/arch/riscv/include/asm/cpufeature.h 
b/arch/riscv/include/asm/cpufeature.h
index 569140d6e639..ad7d26788e6a 100644
--- a/arch/riscv/include/asm/cpufeature.h
+++ b/arch/riscv/include/asm/cpufeature.h
@@ -64,8 +64,9 @@ void __init riscv_user_isa_enable(void);
_RISCV_ISA_EXT_DATA(_name, _id, _sub_exts, ARRAY_SIZE(_sub_exts), 
_validate)
 
 bool check_unaligned_access_emulated_all_cpus(void);
+void unaligned_access_init(void);
+int cpu_online_unaligned_access_init(unsigned int cpu);
 #if defined(CONFIG_RISCV_SCALAR_MISALIGNED)
-void check_unaligned_access_emulated(struct work_struct *work __always_unused);
 void unaligned_emulation_finish(void);
 bool unaligned_ctl_available(void);
 DECLARE_PER_CPU(long, misaligned_access_speed);
diff --git a/arch/riscv/kernel/traps_misaligned.c 
b/arch/riscv/kernel/traps_misaligned.c
index 7cc108aed74e..fa7f100b95bd 100644
--- a/arch/riscv/kernel/traps_misaligned.c
+++ b/arch/riscv/kernel/traps_misaligned.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #define INSN_MATCH_LB  0x3
@@ -635,7 +636,7 @@ bool check_vector_unaligned_access_emulated_all_cpus(void)
 
 static bool unaligned_ctl __read_mostly;
 
-void check_unaligned_access_emulated(struct work_struct *work __always_unused)
+static void check_unaligned_access_emulated(struct work_struct *work 
__always_unused)
 {
int cpu = smp_processor_id();
long *mas_ptr = per_cpu_ptr(&misaligned_access_speed, cpu);
@@ -646,6 +647,13 @@ void check_unaligned_access_emulated(struct work_struct 
*work __always_unused)
__asm__ __volatile__ (
"   "REG_L" %[tmp], 1(%[ptr])\n"
: [tmp] "=r" (tmp_val) : [ptr] "r" (&tmp_var) : "memory");
+}
+
+static int cpu_online_check_unaligned_access_emulated(unsigned int cpu)
+{
+   long *mas_ptr = per_cpu_ptr(&misaligned_access_speed, cpu);
+
+   check_unaligned_access_emulated(NULL);
 
/*
 * If unaligned_ctl is already set, this means that we detected that all
@@ -654,9 +662,10 @@ void check_unaligned_access_emulated(struct work_struct 
*work __always_unused)
 */
if (unlikely(unaligned_ctl && (*mas_ptr != 
RISCV_HWPROBE_MISALIGNED_SCALAR_EMULATED))) {
pr_crit("CPU misaligned accesses non homogeneous (expected all 
emulated)\n");
-   while (true)
-   cpu_relax();
+   return -EINVAL;
}
+
+   return 0;
 }
 
 bool check_unaligned_access_emulated_all_cpus(void)
@@ -688,4 +697,66 @@ bool check_unaligned_access_emulated_all_cpus(void)
 {
return false;
 }
+static int cpu_online_check_unaligned_access_emulated(unsigned int cpu)
+{
+   return 0;
+}
 #endif
+
+#ifdef CONFIG_RISCV_SBI
+
+static bool misaligned_traps_delegated;
+
+static int cpu_online_sbi_unaligned_setup(unsigned int cpu)
+{
+   if (sbi_fwft_set(SBI_FWFT_MISALIGNED_EXC_DELEG, 1, 0) &&
+   misaligned_traps_delegated) {
+   pr_crit("Misaligned trap delegation non homogeneous (expected 
delegated)");
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
+static void unaligned_sbi_request_delegation(void)
+{
+   int ret;
+
+   ret = sbi_fwft_local_set(SBI_FWFT_MISALIGNED_EXC_DELEG, 1, 0);
+   if (ret)
+   return;
+
+   misaligned_traps_delegated = true;
+   pr_info("SBI misaligned access exception delegation ok\n");
+   /*
+* Note that we don't have to take any specific action here, if
+* the delegation is successful, then
+* check_unaligned_access_emulated() will verify that indeed the
+* platform traps on misaligned accesses.
+*/
+}
+
+void unaligned_access_init(void)
+{
+   if (sbi_probe_extension(SBI_EXT_FWFT) > 0)
+   unaligned_sbi_request_delegation();
+}
+#else
+void unaligned_access_init(void) {}
+
+static int cpu_online_sbi_unaligned_setup(unsigned int cpu __always_unused)
+{
+   return 0;
+}
+#endif
+
+int cpu_online_unaligned_access_init(unsigned int cpu)
+{
+   int ret;
+
+   ret = cpu_online_sbi_unaligned_setup(cpu);
+   if (ret)
+   return ret;
+
+   return cpu_online_check_unaligned_access_emulated(cpu);
+}
diff --git a/arch/riscv/kernel/unaligned_access_speed.c 
b/arch/riscv/kernel/unaligned_access_speed.c
index 91f189cf1611..2f3aba073297 100644
--- a/arch/riscv/kernel/unaligned_access_speed.c
+++ b/arch/riscv/kernel/unaligned_access_speed.c
@@ -188,13 +188,20 @@ 
arch_i

[PATCH v4 10/18] riscv: misaligned: factorize trap handling

2025-03-17 Thread Clément Léger

misaligned accesses traps are not nmi and should be treated as normal
one using irqentry_enter()/exit(). Since both load/store and user/kernel
should use almost the same path and that we are going to add some code
around that, factorize it.

Signed-off-by: Clément Léger 
---
 arch/riscv/kernel/traps.c | 49 ---
 1 file changed, 20 insertions(+), 29 deletions(-)

diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
index 8ff8e8b36524..55d9f3450398 100644
--- a/arch/riscv/kernel/traps.c
+++ b/arch/riscv/kernel/traps.c
@@ -198,47 +198,38 @@ asmlinkage __visible __trap_section void 
do_trap_insn_illegal(struct pt_regs *re
 DO_ERROR_INFO(do_trap_load_fault,
SIGSEGV, SEGV_ACCERR, "load access fault");
 
-asmlinkage __visible __trap_section void do_trap_load_misaligned(struct 
pt_regs *regs)
+enum misaligned_access_type {
+   MISALIGNED_STORE,
+   MISALIGNED_LOAD,
+};
+
+static void do_trap_misaligned(struct pt_regs *regs, enum 
misaligned_access_type type)
 {
-   if (user_mode(regs)) {
-   irqentry_enter_from_user_mode(regs);
+   irqentry_state_t state = irqentry_enter(regs);
 
+   if (type ==  MISALIGNED_LOAD) {
if (handle_misaligned_load(regs))
do_trap_error(regs, SIGBUS, BUS_ADRALN, regs->epc,
- "Oops - load address misaligned");
-
-   irqentry_exit_to_user_mode(regs);
+ "Oops - load address misaligned");
} else {
-   irqentry_state_t state = irqentry_nmi_enter(regs);
-
-   if (handle_misaligned_load(regs))
+   if (handle_misaligned_store(regs))
do_trap_error(regs, SIGBUS, BUS_ADRALN, regs->epc,
- "Oops - load address misaligned");
-
-   irqentry_nmi_exit(regs, state);
+ "Oops - store (or AMO) address 
misaligned");
}
+
+   irqentry_exit(regs, state);
 }
 
-asmlinkage __visible __trap_section void do_trap_store_misaligned(struct 
pt_regs *regs)
+asmlinkage __visible __trap_section void do_trap_load_misaligned(struct 
pt_regs *regs)
 {
-   if (user_mode(regs)) {
-   irqentry_enter_from_user_mode(regs);
-
-   if (handle_misaligned_store(regs))
-   do_trap_error(regs, SIGBUS, BUS_ADRALN, regs->epc,
-   "Oops - store (or AMO) address misaligned");
-
-   irqentry_exit_to_user_mode(regs);
-   } else {
-   irqentry_state_t state = irqentry_nmi_enter(regs);
-
-   if (handle_misaligned_store(regs))
-   do_trap_error(regs, SIGBUS, BUS_ADRALN, regs->epc,
-   "Oops - store (or AMO) address misaligned");
+   do_trap_misaligned(regs, MISALIGNED_LOAD);
+}
 
-   irqentry_nmi_exit(regs, state);
-   }
+asmlinkage __visible __trap_section void do_trap_store_misaligned(struct 
pt_regs *regs)
+{
+   do_trap_misaligned(regs, MISALIGNED_STORE);
 }
+
 DO_ERROR_INFO(do_trap_store_fault,
SIGSEGV, SEGV_ACCERR, "store (or AMO) access fault");
 DO_ERROR_INFO(do_trap_ecall_s,
-- 
2.47.2

[PATCH v4 09/18] riscv: misaligned: add a function to check misalign trap delegability

2025-03-17 Thread Clément Léger

Checking for the delegability of the misaligned access trap is needed
for the KVM FWFT extension implementation. Add a function to get the
delegability of the misaligned trap exception.

Signed-off-by: Clément Léger 
Reviewed-by: Andrew Jones 
---
 arch/riscv/include/asm/cpufeature.h  |  5 +
 arch/riscv/kernel/traps_misaligned.c | 17 +++--
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/arch/riscv/include/asm/cpufeature.h 
b/arch/riscv/include/asm/cpufeature.h
index ad7d26788e6a..8b97cba99fc3 100644
--- a/arch/riscv/include/asm/cpufeature.h
+++ b/arch/riscv/include/asm/cpufeature.h
@@ -69,12 +69,17 @@ int cpu_online_unaligned_access_init(unsigned int cpu);
 #if defined(CONFIG_RISCV_SCALAR_MISALIGNED)
 void unaligned_emulation_finish(void);
 bool unaligned_ctl_available(void);
+bool misaligned_traps_can_delegate(void);
 DECLARE_PER_CPU(long, misaligned_access_speed);
 #else
 static inline bool unaligned_ctl_available(void)
 {
return false;
 }
+static inline bool misaligned_traps_can_delegate(void)
+{
+   return false;
+}
 #endif
 
 bool check_vector_unaligned_access_emulated_all_cpus(void);
diff --git a/arch/riscv/kernel/traps_misaligned.c 
b/arch/riscv/kernel/traps_misaligned.c
index 3c77fc78fe4f..0fb663ac200f 100644
--- a/arch/riscv/kernel/traps_misaligned.c
+++ b/arch/riscv/kernel/traps_misaligned.c
@@ -715,10 +715,10 @@ static int 
cpu_online_check_unaligned_access_emulated(unsigned int cpu)
 }
 #endif
 
-#ifdef CONFIG_RISCV_SBI
-
 static bool misaligned_traps_delegated;
 
+#ifdef CONFIG_RISCV_SBI
+
 static int cpu_online_sbi_unaligned_setup(unsigned int cpu)
 {
if (sbi_fwft_set(SBI_FWFT_MISALIGNED_EXC_DELEG, 1, 0) &&
@@ -760,6 +760,7 @@ static int cpu_online_sbi_unaligned_setup(unsigned int cpu 
__always_unused)
 {
return 0;
 }
+
 #endif
 
 int cpu_online_unaligned_access_init(unsigned int cpu)
@@ -772,3 +773,15 @@ int cpu_online_unaligned_access_init(unsigned int cpu)
 
return cpu_online_check_unaligned_access_emulated(cpu);
 }
+
+bool misaligned_traps_can_delegate(void)
+{
+   /*
+* Either we successfully requested misaligned traps delegation for all
+* CPUS or the SBI does not implemented FWFT extension but delegated the
+* exception by default.
+*/
+   return misaligned_traps_delegated ||
+  all_cpus_unaligned_scalar_access_emulated();
+}
+EXPORT_SYMBOL_GPL(misaligned_traps_can_delegate);
-- 
2.47.2

[PATCH v4 15/18] RISC-V: KVM: add SBI extension init()/deinit() functions

2025-03-17 Thread Clément Léger

The FWFT SBI extension will need to dynamically allocate memory and do
init time specific initialization. Add an init/deinit callbacks that
allows to do so.

Signed-off-by: Clément Léger 
---
 arch/riscv/include/asm/kvm_vcpu_sbi.h |  9 +
 arch/riscv/kvm/vcpu.c |  2 ++
 arch/riscv/kvm/vcpu_sbi.c | 26 ++
 3 files changed, 37 insertions(+)

diff --git a/arch/riscv/include/asm/kvm_vcpu_sbi.h 
b/arch/riscv/include/asm/kvm_vcpu_sbi.h
index 4ed6203cdd30..bcb90757b149 100644
--- a/arch/riscv/include/asm/kvm_vcpu_sbi.h
+++ b/arch/riscv/include/asm/kvm_vcpu_sbi.h
@@ -49,6 +49,14 @@ struct kvm_vcpu_sbi_extension {
 
/* Extension specific probe function */
unsigned long (*probe)(struct kvm_vcpu *vcpu);
+
+   /*
+* Init/deinit function called once during VCPU init/destroy. These
+* might be use if the SBI extensions need to allocate or do specific
+* init time only configuration.
+*/
+   int (*init)(struct kvm_vcpu *vcpu);
+   void (*deinit)(struct kvm_vcpu *vcpu);
 };
 
 void kvm_riscv_vcpu_sbi_forward(struct kvm_vcpu *vcpu, struct kvm_run *run);
@@ -69,6 +77,7 @@ const struct kvm_vcpu_sbi_extension *kvm_vcpu_sbi_find_ext(
 bool riscv_vcpu_supports_sbi_ext(struct kvm_vcpu *vcpu, int idx);
 int kvm_riscv_vcpu_sbi_ecall(struct kvm_vcpu *vcpu, struct kvm_run *run);
 void kvm_riscv_vcpu_sbi_init(struct kvm_vcpu *vcpu);
+void kvm_riscv_vcpu_sbi_deinit(struct kvm_vcpu *vcpu);
 
 int kvm_riscv_vcpu_get_reg_sbi_sta(struct kvm_vcpu *vcpu, unsigned long 
reg_num,
   unsigned long *reg_val);
diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
index 60d684c76c58..877bcc85c067 100644
--- a/arch/riscv/kvm/vcpu.c
+++ b/arch/riscv/kvm/vcpu.c
@@ -185,6 +185,8 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
 
 void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
 {
+   kvm_riscv_vcpu_sbi_deinit(vcpu);
+
/* Cleanup VCPU AIA context */
kvm_riscv_vcpu_aia_deinit(vcpu);
 
diff --git a/arch/riscv/kvm/vcpu_sbi.c b/arch/riscv/kvm/vcpu_sbi.c
index d1c83a77735e..3139f171c20f 100644
--- a/arch/riscv/kvm/vcpu_sbi.c
+++ b/arch/riscv/kvm/vcpu_sbi.c
@@ -508,5 +508,31 @@ void kvm_riscv_vcpu_sbi_init(struct kvm_vcpu *vcpu)
scontext->ext_status[idx] = ext->default_disabled ?
KVM_RISCV_SBI_EXT_STATUS_DISABLED :
KVM_RISCV_SBI_EXT_STATUS_ENABLED;
+
+   if (ext->init && ext->init(vcpu) != 0)
+   scontext->ext_status[idx] = 
KVM_RISCV_SBI_EXT_STATUS_UNAVAILABLE;
+   }
+}
+
+void kvm_riscv_vcpu_sbi_deinit(struct kvm_vcpu *vcpu)
+{
+   struct kvm_vcpu_sbi_context *scontext = &vcpu->arch.sbi_context;
+   const struct kvm_riscv_sbi_extension_entry *entry;
+   const struct kvm_vcpu_sbi_extension *ext;
+   int idx, i;
+
+   for (i = 0; i < ARRAY_SIZE(sbi_ext); i++) {
+   entry = &sbi_ext[i];
+   ext = entry->ext_ptr;
+   idx = entry->ext_idx;
+
+   if (idx < 0 || idx >= ARRAY_SIZE(scontext->ext_status))
+   continue;
+
+   if (scontext->ext_status[idx] == 
KVM_RISCV_SBI_EXT_STATUS_UNAVAILABLE ||
+   !ext->deinit)
+   continue;
+
+   ext->deinit(vcpu);
}
 }
-- 
2.47.2

[PATCH v4 02/18] riscv: sbi: add new SBI error mappings

2025-03-17 Thread Clément Léger

A few new errors have been added with SBI V3.0, maps them as close as
possible to errno values.

Signed-off-by: Clément Léger 
---
 arch/riscv/include/asm/sbi.h | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/riscv/include/asm/sbi.h b/arch/riscv/include/asm/sbi.h
index bb077d0c912f..d11d22717b49 100644
--- a/arch/riscv/include/asm/sbi.h
+++ b/arch/riscv/include/asm/sbi.h
@@ -536,11 +536,20 @@ static inline int sbi_err_map_linux_errno(int err)
case SBI_SUCCESS:
return 0;
case SBI_ERR_DENIED:
+   case SBI_ERR_DENIED_LOCKED:
return -EPERM;
case SBI_ERR_INVALID_PARAM:
+   case SBI_ERR_INVALID_STATE:
+   case SBI_ERR_BAD_RANGE:
return -EINVAL;
case SBI_ERR_INVALID_ADDRESS:
return -EFAULT;
+   case SBI_ERR_NO_SHMEM:
+   return -ENOMEM;
+   case SBI_ERR_TIMEOUT:
+   return -ETIME;
+   case SBI_ERR_IO:
+   return -EIO;
case SBI_ERR_NOT_SUPPORTED:
case SBI_ERR_FAILURE:
default:
-- 
2.47.2

[PATCH v4 04/18] riscv: sbi: add SBI FWFT extension calls

2025-03-17 Thread Clément Léger

Add FWFT extension calls. This will be ratified in SBI V3.0 hence, it is
provided as a separate commit that can be left out if needed.

Signed-off-by: Clément Léger 
---
 arch/riscv/kernel/sbi.c | 30 --
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/arch/riscv/kernel/sbi.c b/arch/riscv/kernel/sbi.c
index d41a5642be24..54d9ceb7b723 100644
--- a/arch/riscv/kernel/sbi.c
+++ b/arch/riscv/kernel/sbi.c
@@ -299,6 +299,8 @@ static int __sbi_rfence_v02(int fid, const struct cpumask 
*cpu_mask,
return 0;
 }
 
+static bool sbi_fwft_supported;
+
 /**
  * sbi_fwft_get() - Get a feature for the local hart
  * @feature: The feature ID to be set
@@ -308,7 +310,15 @@ static int __sbi_rfence_v02(int fid, const struct cpumask 
*cpu_mask,
  */
 int sbi_fwft_get(u32 feature, unsigned long *value)
 {
-   return -EOPNOTSUPP;
+   struct sbiret ret;
+
+   if (!sbi_fwft_supported)
+   return -EOPNOTSUPP;
+
+   ret = sbi_ecall(SBI_EXT_FWFT, SBI_EXT_FWFT_GET,
+   feature, 0, 0, 0, 0, 0);
+
+   return sbi_err_map_linux_errno(ret.error);
 }
 
 /**
@@ -321,7 +331,15 @@ int sbi_fwft_get(u32 feature, unsigned long *value)
  */
 int sbi_fwft_set(u32 feature, unsigned long value, unsigned long flags)
 {
-   return -EOPNOTSUPP;
+   struct sbiret ret;
+
+   if (!sbi_fwft_supported)
+   return -EOPNOTSUPP;
+
+   ret = sbi_ecall(SBI_EXT_FWFT, SBI_EXT_FWFT_SET,
+   feature, value, flags, 0, 0, 0);
+
+   return sbi_err_map_linux_errno(ret.error);
 }
 
 struct fwft_set_req {
@@ -360,6 +378,9 @@ int sbi_fwft_local_set_cpumask(const cpumask_t *mask, u32 
feature,
.error = ATOMIC_INIT(0),
};
 
+   if (!sbi_fwft_supported)
+   return -EOPNOTSUPP;
+
if (feature & SBI_FWFT_GLOBAL_FEATURE_BIT)
return -EINVAL;
 
@@ -691,6 +712,11 @@ void __init sbi_init(void)
pr_info("SBI DBCN extension detected\n");
sbi_debug_console_available = true;
}
+   if ((sbi_spec_version >= sbi_mk_version(3, 0)) &&
+   (sbi_probe_extension(SBI_EXT_FWFT) > 0)) {
+   pr_info("SBI FWFT extension detected\n");
+   sbi_fwft_supported = true;
+   }
} else {
__sbi_set_timer = __sbi_set_timer_v01;
__sbi_send_ipi  = __sbi_send_ipi_v01;
-- 
2.47.2

[PATCH v4 01/18] riscv: add Firmware Feature (FWFT) SBI extensions definitions

2025-03-17 Thread Clément Léger

The Firmware Features extension (FWFT) was added as part of the SBI 3.0
specification. Add SBI definitions to use this extension.

Signed-off-by: Clément Léger 
Reviewed-by: Samuel Holland 
Tested-by: Samuel Holland 
Reviewed-by: Deepak Gupta 
Reviewed-by: Andrew Jones 
---
 arch/riscv/include/asm/sbi.h | 33 +
 1 file changed, 33 insertions(+)

diff --git a/arch/riscv/include/asm/sbi.h b/arch/riscv/include/asm/sbi.h
index 3d250824178b..bb077d0c912f 100644
--- a/arch/riscv/include/asm/sbi.h
+++ b/arch/riscv/include/asm/sbi.h
@@ -35,6 +35,7 @@ enum sbi_ext_id {
SBI_EXT_DBCN = 0x4442434E,
SBI_EXT_STA = 0x535441,
SBI_EXT_NACL = 0x4E41434C,
+   SBI_EXT_FWFT = 0x46574654,
 
/* Experimentals extensions must lie within this range */
SBI_EXT_EXPERIMENTAL_START = 0x0800,
@@ -402,6 +403,33 @@ enum sbi_ext_nacl_feature {
 #define SBI_NACL_SHMEM_SRET_X(__i) ((__riscv_xlen / 8) * (__i))
 #define SBI_NACL_SHMEM_SRET_X_LAST 31
 
+/* SBI function IDs for FW feature extension */
+#define SBI_EXT_FWFT_SET   0x0
+#define SBI_EXT_FWFT_GET   0x1
+
+enum sbi_fwft_feature_t {
+   SBI_FWFT_MISALIGNED_EXC_DELEG   = 0x0,
+   SBI_FWFT_LANDING_PAD= 0x1,
+   SBI_FWFT_SHADOW_STACK   = 0x2,
+   SBI_FWFT_DOUBLE_TRAP= 0x3,
+   SBI_FWFT_PTE_AD_HW_UPDATING = 0x4,
+   SBI_FWFT_POINTER_MASKING_PMLEN  = 0x5,
+   SBI_FWFT_LOCAL_RESERVED_START   = 0x6,
+   SBI_FWFT_LOCAL_RESERVED_END = 0x3fff,
+   SBI_FWFT_LOCAL_PLATFORM_START   = 0x4000,
+   SBI_FWFT_LOCAL_PLATFORM_END = 0x7fff,
+
+   SBI_FWFT_GLOBAL_RESERVED_START  = 0x8000,
+   SBI_FWFT_GLOBAL_RESERVED_END= 0xbfff,
+   SBI_FWFT_GLOBAL_PLATFORM_START  = 0xc000,
+   SBI_FWFT_GLOBAL_PLATFORM_END= 0x,
+};
+
+#define SBI_FWFT_PLATFORM_FEATURE_BIT  BIT(30)
+#define SBI_FWFT_GLOBAL_FEATURE_BITBIT(31)
+
+#define SBI_FWFT_SET_FLAG_LOCK BIT(0)
+
 /* SBI spec version fields */
 #define SBI_SPEC_VERSION_DEFAULT   0x1
 #define SBI_SPEC_VERSION_MAJOR_SHIFT   24
@@ -419,6 +447,11 @@ enum sbi_ext_nacl_feature {
 #define SBI_ERR_ALREADY_STARTED -7
 #define SBI_ERR_ALREADY_STOPPED -8
 #define SBI_ERR_NO_SHMEM   -9
+#define SBI_ERR_INVALID_STATE  -10
+#define SBI_ERR_BAD_RANGE  -11
+#define SBI_ERR_TIMEOUT-12
+#define SBI_ERR_IO -13
+#define SBI_ERR_DENIED_LOCKED  -14
 
 extern unsigned long sbi_spec_version;
 struct sbiret {
-- 
2.47.2

[PATCH v4 07/18] riscv: misaligned: use correct CONFIG_ ifdef for misaligned_access_speed

2025-03-17 Thread Clément Léger

misaligned_access_speed is defined under CONFIG_RISCV_SCALAR_MISALIGNED
but was used under CONFIG_RISCV_PROBE_UNALIGNED_ACCESS. Fix that by
using the correct config option.

Signed-off-by: Clément Léger 
Reviewed-by: Andrew Jones 
---
 arch/riscv/kernel/traps_misaligned.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/riscv/kernel/traps_misaligned.c 
b/arch/riscv/kernel/traps_misaligned.c
index 4584f2e1d39d..8175b3449b73 100644
--- a/arch/riscv/kernel/traps_misaligned.c
+++ b/arch/riscv/kernel/traps_misaligned.c
@@ -362,7 +362,7 @@ static int handle_scalar_misaligned_load(struct pt_regs 
*regs)
 
perf_sw_event(PERF_COUNT_SW_ALIGNMENT_FAULTS, 1, regs, addr);
 
-#ifdef CONFIG_RISCV_PROBE_UNALIGNED_ACCESS
+#ifdef CONFIG_RISCV_SCALAR_MISALIGNED
*this_cpu_ptr(&misaligned_access_speed) = 
RISCV_HWPROBE_MISALIGNED_SCALAR_EMULATED;
 #endif
 
-- 
2.47.2

[PATCH v4 06/18] riscv: misaligned: use on_each_cpu() for scalar misaligned access probing

2025-03-17 Thread Clément Léger

schedule_on_each_cpu() was used without any good reason while documented
as very slow. This call was in the boot path, so better use
on_each_cpu() for scalar misaligned checking. Vector misaligned check
still needs to use schedule_on_each_cpu() since it requires irqs to be
enabled but that's less of a problem since this code is ran in a kthread.
Add a comment to explicit that.

Signed-off-by: Clément Léger 
Reviewed-by: Andrew Jones 
---
 arch/riscv/kernel/traps_misaligned.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/riscv/kernel/traps_misaligned.c 
b/arch/riscv/kernel/traps_misaligned.c
index fa7f100b95bd..4584f2e1d39d 100644
--- a/arch/riscv/kernel/traps_misaligned.c
+++ b/arch/riscv/kernel/traps_misaligned.c
@@ -616,6 +616,10 @@ bool check_vector_unaligned_access_emulated_all_cpus(void)
return false;
}
 
+   /*
+* While being documented as very slow, schedule_on_each_cpu() is used 
since
+* kernel_vector_begin() expects irqs to be enabled or it will panic()
+*/
schedule_on_each_cpu(check_vector_unaligned_access_emulated);
 
for_each_online_cpu(cpu)
@@ -636,7 +640,7 @@ bool check_vector_unaligned_access_emulated_all_cpus(void)
 
 static bool unaligned_ctl __read_mostly;
 
-static void check_unaligned_access_emulated(struct work_struct *work 
__always_unused)
+static void check_unaligned_access_emulated(void *arg __always_unused)
 {
int cpu = smp_processor_id();
long *mas_ptr = per_cpu_ptr(&misaligned_access_speed, cpu);
@@ -677,7 +681,7 @@ bool check_unaligned_access_emulated_all_cpus(void)
 * accesses emulated since tasks requesting such control can run on any
 * CPU.
 */
-   schedule_on_each_cpu(check_unaligned_access_emulated);
+   on_each_cpu(check_unaligned_access_emulated, NULL, 1);
 
for_each_online_cpu(cpu)
if (per_cpu(misaligned_access_speed, cpu)
-- 
2.47.2

[PATCH v4 11/18] riscv: misaligned: enable IRQs while handling misaligned accesses

2025-03-17 Thread Clément Léger

We can safely reenable IRQs if they were enabled in the previous
context. This allows to access user memory that could potentially
trigger a page fault.

Signed-off-by: Clément Léger 
---
 arch/riscv/kernel/traps.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/riscv/kernel/traps.c b/arch/riscv/kernel/traps.c
index 55d9f3450398..3eecc2addc41 100644
--- a/arch/riscv/kernel/traps.c
+++ b/arch/riscv/kernel/traps.c
@@ -206,6 +206,11 @@ enum misaligned_access_type {
 static void do_trap_misaligned(struct pt_regs *regs, enum 
misaligned_access_type type)
 {
irqentry_state_t state = irqentry_enter(regs);
+   bool enable_irqs = !regs_irqs_disabled(regs);
+
+   /* Enable interrupts if they were enabled in the interrupted context. */
+   if (enable_irqs)
+   local_irq_enable();
 
if (type ==  MISALIGNED_LOAD) {
if (handle_misaligned_load(regs))
@@ -217,6 +222,9 @@ static void do_trap_misaligned(struct pt_regs *regs, enum 
misaligned_access_type
  "Oops - store (or AMO) address 
misaligned");
}
 
+   if (enable_irqs)
+   local_irq_disable();
+
irqentry_exit(regs, state);
 }
 
-- 
2.47.2

[PATCH v4 14/18] selftests: riscv: add misaligned access testing

2025-03-17 Thread Clément Léger

Now that the kernel can emulate misaligned access and control its
behavior, add a selftest for that. This selftest tests all the currently
emulated instruction (except for the RV32 compressed ones which are left
as a future exercise for a RV32 user). For the FPU instructions, all the
FPU registers are tested.

Signed-off-by: Clément Léger 
---
 .../selftests/riscv/misaligned/.gitignore |   1 +
 .../selftests/riscv/misaligned/Makefile   |  12 +
 .../selftests/riscv/misaligned/common.S   |  33 +++
 .../testing/selftests/riscv/misaligned/fpu.S  | 180 +
 tools/testing/selftests/riscv/misaligned/gp.S | 103 +++
 .../selftests/riscv/misaligned/misaligned.c   | 254 ++
 6 files changed, 583 insertions(+)
 create mode 100644 tools/testing/selftests/riscv/misaligned/.gitignore
 create mode 100644 tools/testing/selftests/riscv/misaligned/Makefile
 create mode 100644 tools/testing/selftests/riscv/misaligned/common.S
 create mode 100644 tools/testing/selftests/riscv/misaligned/fpu.S
 create mode 100644 tools/testing/selftests/riscv/misaligned/gp.S
 create mode 100644 tools/testing/selftests/riscv/misaligned/misaligned.c

diff --git a/tools/testing/selftests/riscv/misaligned/.gitignore 
b/tools/testing/selftests/riscv/misaligned/.gitignore
new file mode 100644
index ..5eff15a1f981
--- /dev/null
+++ b/tools/testing/selftests/riscv/misaligned/.gitignore
@@ -0,0 +1 @@
+misaligned
diff --git a/tools/testing/selftests/riscv/misaligned/Makefile 
b/tools/testing/selftests/riscv/misaligned/Makefile
new file mode 100644
index ..1aa40110c50d
--- /dev/null
+++ b/tools/testing/selftests/riscv/misaligned/Makefile
@@ -0,0 +1,12 @@
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (C) 2021 ARM Limited
+# Originally tools/testing/arm64/abi/Makefile
+
+CFLAGS += -I$(top_srcdir)/tools/include
+
+TEST_GEN_PROGS := misaligned
+
+include ../../lib.mk
+
+$(OUTPUT)/misaligned: misaligned.c fpu.S gp.S
+   $(CC) -g3 -static -o$@ -march=rv64imafdc $(CFLAGS) $(LDFLAGS) $^
diff --git a/tools/testing/selftests/riscv/misaligned/common.S 
b/tools/testing/selftests/riscv/misaligned/common.S
new file mode 100644
index ..8fa00035bd5d
--- /dev/null
+++ b/tools/testing/selftests/riscv/misaligned/common.S
@@ -0,0 +1,33 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (c) 2025 Rivos Inc.
+ *
+ * Authors:
+ * Clément Léger 
+ */
+
+.macro lb_sb temp, offset, src, dst
+   lb \temp, \offset(\src)
+   sb \temp, \offset(\dst)
+.endm
+
+.macro copy_long_to temp, src, dst
+   lb_sb \temp, 0, \src, \dst,
+   lb_sb \temp, 1, \src, \dst,
+   lb_sb \temp, 2, \src, \dst,
+   lb_sb \temp, 3, \src, \dst,
+   lb_sb \temp, 4, \src, \dst,
+   lb_sb \temp, 5, \src, \dst,
+   lb_sb \temp, 6, \src, \dst,
+   lb_sb \temp, 7, \src, \dst,
+.endm
+
+.macro sp_stack_prologue offset
+   addi sp, sp, -8
+   sub sp, sp, \offset
+.endm
+
+.macro sp_stack_epilogue offset
+   add sp, sp, \offset
+   addi sp, sp, 8
+.endm
diff --git a/tools/testing/selftests/riscv/misaligned/fpu.S 
b/tools/testing/selftests/riscv/misaligned/fpu.S
new file mode 100644
index ..d008bff58310
--- /dev/null
+++ b/tools/testing/selftests/riscv/misaligned/fpu.S
@@ -0,0 +1,180 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (c) 2025 Rivos Inc.
+ *
+ * Authors:
+ * Clément Léger 
+ */
+
+#include "common.S"
+
+#define CASE_ALIGN 4
+
+.macro fpu_load_inst fpreg, inst, precision, load_reg
+.align CASE_ALIGN
+   \inst \fpreg, 0(\load_reg)
+   fmv.\precision fa0, \fpreg
+   j 2f
+.endm
+
+#define flw(__fpreg) fpu_load_inst __fpreg, flw, s, s1
+#define fld(__fpreg) fpu_load_inst __fpreg, fld, d, s1
+#define c_flw(__fpreg) fpu_load_inst __fpreg, c.flw, s, s1
+#define c_fld(__fpreg) fpu_load_inst __fpreg, c.fld, d, s1
+#define c_fldsp(__fpreg) fpu_load_inst __fpreg, c.fldsp, d, sp
+
+.macro fpu_store_inst fpreg, inst, precision, store_reg
+.align CASE_ALIGN
+   fmv.\precision \fpreg, fa0
+   \inst \fpreg, 0(\store_reg)
+   j 2f
+.endm
+
+#define fsw(__fpreg) fpu_store_inst __fpreg, fsw, s, s1
+#define fsd(__fpreg) fpu_store_inst __fpreg, fsd, d, s1
+#define c_fsw(__fpreg) fpu_store_inst __fpreg, c.fsw, s, s1
+#define c_fsd(__fpreg) fpu_store_inst __fpreg, c.fsd, d, s1
+#define c_fsdsp(__fpreg) fpu_store_inst __fpreg, c.fsdsp, d, sp
+
+.macro fp_test_prologue
+   move s1, a1
+   /*
+* Compute jump offset to store the correct FP register since we don't
+* have indirect FP register access (or at least we don't use this
+* extension so that works on all archs)
+*/
+   sll t0, a0, CASE_ALIGN
+   la t2, 1f
+   add t0, t0, t2
+   jr t0
+.align CASE_ALIGN
+1:
+.endm
+
+.macro fp_test_prologue_compressed
+   /* FP registers for compressed instructions starts from 8 to 16 */
+   addi a0, a0, -8
+   fp_test_prologue
+.endm
+
+#define fp_test

[PATCH v4 18/18] RISC-V: KVM: add support for SBI_FWFT_MISALIGNED_DELEG

2025-03-17 Thread Clément Léger

SBI_FWFT_MISALIGNED_DELEG needs hedeleg to be modified to delegate
misaligned load/store exceptions. Save and restore it during CPU
load/put.

Signed-off-by: Clément Léger 
Reviewed-by: Deepak Gupta 
Reviewed-by: Andrew Jones 
---
 arch/riscv/kvm/vcpu.c  |  3 +++
 arch/riscv/kvm/vcpu_sbi_fwft.c | 36 ++
 2 files changed, 39 insertions(+)

diff --git a/arch/riscv/kvm/vcpu.c b/arch/riscv/kvm/vcpu.c
index 542747e2c7f5..d98e379945c3 100644
--- a/arch/riscv/kvm/vcpu.c
+++ b/arch/riscv/kvm/vcpu.c
@@ -646,6 +646,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 {
void *nsh;
struct kvm_vcpu_csr *csr = &vcpu->arch.guest_csr;
+   struct kvm_vcpu_config *cfg = &vcpu->arch.cfg;
 
vcpu->cpu = -1;
 
@@ -671,6 +672,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
csr->vstval = nacl_csr_read(nsh, CSR_VSTVAL);
csr->hvip = nacl_csr_read(nsh, CSR_HVIP);
csr->vsatp = nacl_csr_read(nsh, CSR_VSATP);
+   cfg->hedeleg = nacl_csr_read(nsh, CSR_HEDELEG);
} else {
csr->vsstatus = csr_read(CSR_VSSTATUS);
csr->vsie = csr_read(CSR_VSIE);
@@ -681,6 +683,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
csr->vstval = csr_read(CSR_VSTVAL);
csr->hvip = csr_read(CSR_HVIP);
csr->vsatp = csr_read(CSR_VSATP);
+   cfg->hedeleg = csr_read(CSR_HEDELEG);
}
 }
 
diff --git a/arch/riscv/kvm/vcpu_sbi_fwft.c b/arch/riscv/kvm/vcpu_sbi_fwft.c
index 8a7cfe1fe7a7..b0556d66e775 100644
--- a/arch/riscv/kvm/vcpu_sbi_fwft.c
+++ b/arch/riscv/kvm/vcpu_sbi_fwft.c
@@ -14,6 +14,8 @@
 #include 
 #include 
 
+#define MIS_DELEG (BIT_ULL(EXC_LOAD_MISALIGNED) | 
BIT_ULL(EXC_STORE_MISALIGNED))
+
 struct kvm_sbi_fwft_feature {
/**
 * @id: Feature ID
@@ -68,7 +70,41 @@ static bool kvm_fwft_is_defined_feature(enum 
sbi_fwft_feature_t feature)
return false;
 }
 
+static bool kvm_sbi_fwft_misaligned_delegation_supported(struct kvm_vcpu *vcpu)
+{
+   return misaligned_traps_can_delegate();
+}
+
+static long kvm_sbi_fwft_set_misaligned_delegation(struct kvm_vcpu *vcpu,
+   struct kvm_sbi_fwft_config *conf,
+   unsigned long value)
+{
+   if (value == 1)
+   csr_set(CSR_HEDELEG, MIS_DELEG);
+   else if (value == 0)
+   csr_clear(CSR_HEDELEG, MIS_DELEG);
+   else
+   return SBI_ERR_INVALID_PARAM;
+
+   return SBI_SUCCESS;
+}
+
+static long kvm_sbi_fwft_get_misaligned_delegation(struct kvm_vcpu *vcpu,
+   struct kvm_sbi_fwft_config *conf,
+   unsigned long *value)
+{
+   *value = (csr_read(CSR_HEDELEG) & MIS_DELEG) != 0;
+
+   return SBI_SUCCESS;
+}
+
 static const struct kvm_sbi_fwft_feature features[] = {
+   {
+   .id = SBI_FWFT_MISALIGNED_EXC_DELEG,
+   .supported = kvm_sbi_fwft_misaligned_delegation_supported,
+   .set = kvm_sbi_fwft_set_misaligned_delegation,
+   .get = kvm_sbi_fwft_get_misaligned_delegation,
+   },
 };
 
 static struct kvm_sbi_fwft_config *
-- 
2.47.2

Re: [PATCH v8 12/14] iommu/arm-smmu-v3: Introduce struct arm_smmu_vmaster

2025-03-17 Thread Nicolin Chen

On Mon, Mar 17, 2025 at 12:44:23PM -0300, Jason Gunthorpe wrote:
> On Tue, Mar 11, 2025 at 10:43:08AM -0700, Nicolin Chen wrote:
> > > > +int arm_smmu_attach_prepare_vmaster(struct arm_smmu_attach_state 
> > > > *state,
> > > > +   struct arm_smmu_nested_domain 
> > > > *nested_domain)
> > > > +{
> > > > +   struct arm_smmu_vmaster *vmaster;
> > > > +   unsigned long vsid;
> > > > +   int ret;
> > > > +
> > > > +   iommu_group_mutex_assert(state->master->dev);
> > > > +
> > > > +   /* Skip invalid vSTE */
> > > > +   if (!(nested_domain->ste[0] & cpu_to_le64(STRTAB_STE_0_V)))
> > > > +   return 0;
> > > 
> > > Ok, and we don't need to set 'state->vmaster' in this case because we
> > > only report stage-1 faults back to the vSMMU?
> > 
> > This is a good question that I didn't ask myself hard enough..
> > 
> > I think we should probably drop it. An invalid STE should trigger
> > a C_BAD_STE event that is in the supported vEVENT list. I'll run
> > some test before removing this line from v9.
> 
> It won't trigger C_BAD_STE, recall Robin was opposed to thatm so we have this:
> 
> static void arm_smmu_make_nested_domain_ste(
>   struct arm_smmu_ste *target, struct arm_smmu_master *master,
>   struct arm_smmu_nested_domain *nested_domain, bool ats_enabled)
> {
>   unsigned int cfg =
>   FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(nested_domain->ste[0]));
> 
>   /*
>* Userspace can request a non-valid STE through the nesting interface.
>* We relay that into an abort physical STE with the intention that
>* C_BAD_STE for this SID can be generated to userspace.
>*/
>   if (!(nested_domain->ste[0] & cpu_to_le64(STRTAB_STE_0_V)))
>   cfg = STRTAB_STE_0_CFG_ABORT;
> 
> So, in the case of a non-valid STE, and a device access, the HW will
> generate one of the translation faults and that will be forwarded.
> 
> Some software component will have to transform those fault events into
> C_BAD_STE for the VM.

Hmm, double checked the spec. It does say that C_BAD_STE would be
triggered:

" V, bit [0] STE Valid.
  [...]
  Device transactions that select an STE with this field configured
  to 0 are terminated with an abort reported back to the device and
  a C_BAD_STE event is recorded."

I also did a hack test unsetting the V bit in the kernel. Then, the
HW did report C_BAD_STE (0x4) back to the VM (via vEVENTQ).

Thanks
Nicolin

Re: [PATCH v9 00/14] iommufd: Add vIOMMU infrastructure (Part-3: vEVENTQ)

2025-03-17 Thread Jason Gunthorpe

On Tue, Mar 11, 2025 at 12:44:18PM -0700, Nicolin Chen wrote:
> As the vIOMMU infrastructure series part-3, this introduces a new vEVENTQ
> object. The existing FAULT object provides a nice notification pathway to
> the user space with a queue already, so let vEVENTQ reuse that.
> 
> Mimicing the HWPT structure, add a common EVENTQ structure to support its
> derivatives: IOMMUFD_OBJ_FAULT (existing) and IOMMUFD_OBJ_VEVENTQ (new).
> 
> An IOMMUFD_CMD_VEVENTQ_ALLOC is introduced to allocate vEVENTQ object for
> vIOMMUs. One vIOMMU can have multiple vEVENTQs in different types but can
> not support multiple vEVENTQs in the same type.
> 
> The forwarding part is fairly simple but might need to replace a physical
> device ID with a virtual device ID in a driver-level event data structure.
> So, this also adds some helpers for drivers to use.
> 
> As usual, this series comes with the selftest coverage for this new ioctl
> and with a real world use case in the ARM SMMUv3 driver.

> Nicolin Chen (14):
>   iommufd/fault: Move two fault functions out of the header
>   iommufd/fault: Add an iommufd_fault_init() helper
>   iommufd: Abstract an iommufd_eventq from iommufd_fault
>   iommufd: Rename fault.c to eventq.c
>   iommufd: Add IOMMUFD_OBJ_VEVENTQ and IOMMUFD_CMD_VEVENTQ_ALLOC
>   iommufd/viommu: Add iommufd_viommu_get_vdev_id helper
>   iommufd/viommu: Add iommufd_viommu_report_event helper
>   iommufd/selftest: Require vdev_id when attaching to a nested domain
>   iommufd/selftest: Add IOMMU_TEST_OP_TRIGGER_VEVENT for vEVENTQ
> coverage
>   iommufd/selftest: Add IOMMU_VEVENTQ_ALLOC test coverage
>   Documentation: userspace-api: iommufd: Update FAULT and VEVENTQ
>   iommu/arm-smmu-v3: Introduce struct arm_smmu_vmaster
>   iommu/arm-smmu-v3: Report events that belong to devices attached to
> vIOMMU
>   iommu/arm-smmu-v3: Set MEV bit in nested STE for DoS mitigations

Applied, thanks

Jason

Re: [PATCH v8 12/14] iommu/arm-smmu-v3: Introduce struct arm_smmu_vmaster

2025-03-17 Thread Jason Gunthorpe

On Mon, Mar 17, 2025 at 11:49:14AM -0700, Nicolin Chen wrote:
> On Mon, Mar 17, 2025 at 12:44:23PM -0300, Jason Gunthorpe wrote:
> > On Tue, Mar 11, 2025 at 10:43:08AM -0700, Nicolin Chen wrote:
> > > > > +int arm_smmu_attach_prepare_vmaster(struct arm_smmu_attach_state 
> > > > > *state,
> > > > > + struct arm_smmu_nested_domain 
> > > > > *nested_domain)
> > > > > +{
> > > > > + struct arm_smmu_vmaster *vmaster;
> > > > > + unsigned long vsid;
> > > > > + int ret;
> > > > > +
> > > > > + iommu_group_mutex_assert(state->master->dev);
> > > > > +
> > > > > + /* Skip invalid vSTE */
> > > > > + if (!(nested_domain->ste[0] & cpu_to_le64(STRTAB_STE_0_V)))
> > > > > + return 0;
> > > > 
> > > > Ok, and we don't need to set 'state->vmaster' in this case because we
> > > > only report stage-1 faults back to the vSMMU?
> > > 
> > > This is a good question that I didn't ask myself hard enough..
> > > 
> > > I think we should probably drop it. An invalid STE should trigger
> > > a C_BAD_STE event that is in the supported vEVENT list. I'll run
> > > some test before removing this line from v9.
> > 
> > It won't trigger C_BAD_STE, recall Robin was opposed to thatm so we have 
> > this:
> > 
> > static void arm_smmu_make_nested_domain_ste(
> > struct arm_smmu_ste *target, struct arm_smmu_master *master,
> > struct arm_smmu_nested_domain *nested_domain, bool ats_enabled)
> > {
> > unsigned int cfg =
> > FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(nested_domain->ste[0]));
> > 
> > /*
> >  * Userspace can request a non-valid STE through the nesting interface.
> >  * We relay that into an abort physical STE with the intention that
> >  * C_BAD_STE for this SID can be generated to userspace.
> >  */
> > if (!(nested_domain->ste[0] & cpu_to_le64(STRTAB_STE_0_V)))
> > cfg = STRTAB_STE_0_CFG_ABORT;
> > 
> > So, in the case of a non-valid STE, and a device access, the HW will
> > generate one of the translation faults and that will be forwarded.
> > 
> > Some software component will have to transform those fault events into
> > C_BAD_STE for the VM.
> 
> Hmm, double checked the spec. It does say that C_BAD_STE would be
> triggered:
> 
> " V, bit [0] STE Valid.
>   [...]
>   Device transactions that select an STE with this field configured
>   to 0 are terminated with an abort reported back to the device and
>   a C_BAD_STE event is recorded."
> 
> I also did a hack test unsetting the V bit in the kernel. Then, the
> HW did report C_BAD_STE (0x4) back to the VM (via vEVENTQ).

Yes, I expect that C_BAD_STE will forward just fine.

But, as above, it should never be generated by HW because the
hypervisor kernel will never install a bad STE, we detect that and
convert it to abort.

Jason

44 matches

Mail list logo