Re: [PATCH RFC 3/4] barriers: convert a control to a data dependency

2019-01-06 Thread Jason Wang



On 2019/1/3 上午4:57, Michael S. Tsirkin wrote:

It's not uncommon to have two access two unrelated memory locations in a
specific order.  At the moment one has to use a memory barrier for this.

However, if the first access was a read and the second used an address
depending on the first one we would have a data dependency and no
barrier would be necessary.

This adds a new interface: dependent_ptr_mb which does exactly this: it
returns a pointer with a data dependency on the supplied value.

Signed-off-by: Michael S. Tsirkin 
---
  Documentation/memory-barriers.txt | 20 
  arch/alpha/include/asm/barrier.h  |  1 +
  include/asm-generic/barrier.h | 18 ++
  include/linux/compiler.h  |  4 
  4 files changed, 43 insertions(+)

diff --git a/Documentation/memory-barriers.txt 
b/Documentation/memory-barriers.txt
index c1d913944ad8..9dbaa2e1dbf6 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -691,6 +691,18 @@ case what's actually required is:
p = READ_ONCE(b);
}
  
+Alternatively, a control dependency can be converted to a data dependency,

+e.g.:
+
+   q = READ_ONCE(a);
+   if (q) {
+   b = dependent_ptr_mb(b, q);
+   p = READ_ONCE(b);
+   }
+
+Note how the result of dependent_ptr_mb must be used with the following
+accesses in order to have an effect.
+
  However, stores are not speculated.  This means that ordering -is- provided
  for load-store control dependencies, as in the following example:
  
@@ -836,6 +848,12 @@ out-guess your code.  More generally, although READ_ONCE() does force

  the compiler to actually emit code for a given load, it does not force
  the compiler to use the results.
  
+Converting to a data dependency helps with this too:

+
+   q = READ_ONCE(a);
+   b = dependent_ptr_mb(b, q);
+   WRITE_ONCE(b, 1);
+
  In addition, control dependencies apply only to the then-clause and
  else-clause of the if-statement in question.  In particular, it does
  not necessarily apply to code following the if-statement:
@@ -875,6 +893,8 @@ to the CPU containing it.  See the section on "Multicopy 
atomicity"
  for more information.
  
  
+

+
  In summary:
  
(*) Control dependencies can order prior loads against later stores.

diff --git a/arch/alpha/include/asm/barrier.h b/arch/alpha/include/asm/barrier.h
index 92ec486a4f9e..b4934e8c551b 100644
--- a/arch/alpha/include/asm/barrier.h
+++ b/arch/alpha/include/asm/barrier.h
@@ -59,6 +59,7 @@
   * as Alpha, "y" could be set to 3 and "x" to 0.  Use rmb()
   * in cases like this where there are no data dependencies.
   */
+#define ARCH_NEEDS_READ_BARRIER_DEPENDS 1
  #define read_barrier_depends() __asm__ __volatile__("mb": : :"memory")
  
  #ifdef CONFIG_SMP

diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index 2cafdbb9ae4c..fa2e2ef72b68 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -70,6 +70,24 @@
  #define __smp_read_barrier_depends()  read_barrier_depends()
  #endif
  
+#if defined(COMPILER_HAS_OPTIMIZER_HIDE_VAR) && \

+   !defined(ARCH_NEEDS_READ_BARRIER_DEPENDS)
+
+#define dependent_ptr_mb(ptr, val) ({  \
+   long dependent_ptr_mb_val = (long)(val);\
+   long dependent_ptr_mb_ptr = (long)(ptr) - dependent_ptr_mb_val; \
+   \
+   BUILD_BUG_ON(sizeof(val) > sizeof(long));\
+   OPTIMIZER_HIDE_VAR(dependent_ptr_mb_val);   \
+   (typeof(ptr))(dependent_ptr_mb_ptr + dependent_ptr_mb_val); \
+})
+
+#else
+
+#define dependent_ptr_mb(ptr, val) ({ mb(); (ptr); })



So for the example of patch 4, we'd better fall back to rmb() or need a 
dependent_ptr_rmb()?


Thanks



+
+#endif
+
  #ifdef CONFIG_SMP
  
  #ifndef smp_mb

diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 6601d39e8c48..f599c30f1b28 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -152,9 +152,13 @@ void ftrace_likely_update(struct ftrace_likely_data *f, 
int val,
  #endif
  
  #ifndef OPTIMIZER_HIDE_VAR

+
  /* Make the optimizer believe the variable can be manipulated arbitrarily. */
  #define OPTIMIZER_HIDE_VAR(var)   
\
__asm__ ("" : "=rm" (var) : "0" (var))
+
+#define COMPILER_HAS_OPTIMIZER_HIDE_VAR 1
+
  #endif
  
  /* Not-quite-unique ID. */


Re: [PATCH RFC 3/4] barriers: convert a control to a data dependency

2019-01-06 Thread Jason Wang



On 2019/1/7 下午12:23, Michael S. Tsirkin wrote:

On Mon, Jan 07, 2019 at 11:58:23AM +0800, Jason Wang wrote:

On 2019/1/3 上午4:57, Michael S. Tsirkin wrote:

It's not uncommon to have two access two unrelated memory locations in a
specific order.  At the moment one has to use a memory barrier for this.

However, if the first access was a read and the second used an address
depending on the first one we would have a data dependency and no
barrier would be necessary.

This adds a new interface: dependent_ptr_mb which does exactly this: it
returns a pointer with a data dependency on the supplied value.

Signed-off-by: Michael S. Tsirkin
---
   Documentation/memory-barriers.txt | 20 
   arch/alpha/include/asm/barrier.h  |  1 +
   include/asm-generic/barrier.h | 18 ++
   include/linux/compiler.h  |  4 
   4 files changed, 43 insertions(+)

diff --git a/Documentation/memory-barriers.txt 
b/Documentation/memory-barriers.txt
index c1d913944ad8..9dbaa2e1dbf6 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -691,6 +691,18 @@ case what's actually required is:
p = READ_ONCE(b);
}
+Alternatively, a control dependency can be converted to a data dependency,
+e.g.:
+
+   q = READ_ONCE(a);
+   if (q) {
+   b = dependent_ptr_mb(b, q);
+   p = READ_ONCE(b);
+   }
+
+Note how the result of dependent_ptr_mb must be used with the following
+accesses in order to have an effect.
+
   However, stores are not speculated.  This means that ordering -is- provided
   for load-store control dependencies, as in the following example:
@@ -836,6 +848,12 @@ out-guess your code.  More generally, although READ_ONCE() 
does force
   the compiler to actually emit code for a given load, it does not force
   the compiler to use the results.
+Converting to a data dependency helps with this too:
+
+   q = READ_ONCE(a);
+   b = dependent_ptr_mb(b, q);
+   WRITE_ONCE(b, 1);
+
   In addition, control dependencies apply only to the then-clause and
   else-clause of the if-statement in question.  In particular, it does
   not necessarily apply to code following the if-statement:
@@ -875,6 +893,8 @@ to the CPU containing it.  See the section on "Multicopy 
atomicity"
   for more information.
+
+
   In summary:
 (*) Control dependencies can order prior loads against later stores.
diff --git a/arch/alpha/include/asm/barrier.h b/arch/alpha/include/asm/barrier.h
index 92ec486a4f9e..b4934e8c551b 100644
--- a/arch/alpha/include/asm/barrier.h
+++ b/arch/alpha/include/asm/barrier.h
@@ -59,6 +59,7 @@
* as Alpha, "y" could be set to 3 and "x" to 0.  Use rmb()
* in cases like this where there are no data dependencies.
*/
+#define ARCH_NEEDS_READ_BARRIER_DEPENDS 1
   #define read_barrier_depends() __asm__ __volatile__("mb": : :"memory")
   #ifdef CONFIG_SMP
diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index 2cafdbb9ae4c..fa2e2ef72b68 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -70,6 +70,24 @@
   #define __smp_read_barrier_depends() read_barrier_depends()
   #endif
+#if defined(COMPILER_HAS_OPTIMIZER_HIDE_VAR) && \
+   !defined(ARCH_NEEDS_READ_BARRIER_DEPENDS)
+
+#define dependent_ptr_mb(ptr, val) ({  \
+   long dependent_ptr_mb_val = (long)(val);\
+   long dependent_ptr_mb_ptr = (long)(ptr) - dependent_ptr_mb_val; \
+   \
+   BUILD_BUG_ON(sizeof(val) > sizeof(long));\
+   OPTIMIZER_HIDE_VAR(dependent_ptr_mb_val);   \
+   (typeof(ptr))(dependent_ptr_mb_ptr + dependent_ptr_mb_val); \
+})
+
+#else
+
+#define dependent_ptr_mb(ptr, val) ({ mb(); (ptr); })

So for the example of patch 4, we'd better fall back to rmb() or need a
dependent_ptr_rmb()?

Thanks

You mean for strongly ordered architectures like Intel?
Yes, maybe it makes sense to have dependent_ptr_smp_rmb,
dependent_ptr_dma_rmb and dependent_ptr_virt_rmb.

mb variant is unused right now so I'll remove it.




Yes.

Thanks




Re: [PATCH RFC v5 07/10] tun: Introduce virtio-net RSS

2024-10-09 Thread Jason Wang
On Tue, Oct 8, 2024 at 2:55 PM Akihiko Odaki  wrote:
>
> RSS is a receive steering algorithm that can be negotiated to use with
> virtio_net. Conventionally the hash calculation was done by the VMM.
> However, computing the hash after the queue was chosen defeats the
> purpose of RSS.
>
> Another approach is to use eBPF steering program. This approach has
> another downside: it cannot report the calculated hash due to the
> restrictive nature of eBPF steering program.
>
> Introduce the code to perform RSS to the kernel in order to overcome
> thse challenges. An alternative solution is to extend the eBPF steering
> program so that it will be able to report to the userspace, but I didn't
> opt for it because extending the current mechanism of eBPF steering
> program as is because it relies on legacy context rewriting, and
> introducing kfunc-based eBPF will result in non-UAPI dependency while
> the other relevant virtualization APIs such as KVM and vhost_net are
> UAPIs.
>
> Signed-off-by: Akihiko Odaki 
> ---
>  drivers/net/tap.c   | 11 +-
>  drivers/net/tun.c   | 57 ---
>  drivers/net/tun_vnet.h  | 96 
> +
>  include/linux/if_tap.h  |  4 +-
>  include/uapi/linux/if_tun.h | 27 +
>  5 files changed, 169 insertions(+), 26 deletions(-)
>
> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
> index 5e2fbe63ca47..a58b83285af4 100644
> --- a/drivers/net/tap.c
> +++ b/drivers/net/tap.c
> @@ -207,6 +207,7 @@ static struct tap_queue *tap_get_queue(struct tap_dev 
> *tap,
>  * racing against queue removal.
>  */
> int numvtaps = READ_ONCE(tap->numvtaps);
> +   struct tun_vnet_hash_container *vnet_hash = 
> rcu_dereference(tap->vnet_hash);
> __u32 rxq;
>
> *tap_add_hash(skb) = (struct virtio_net_hash) { .report = 
> VIRTIO_NET_HASH_REPORT_NONE };
> @@ -217,6 +218,12 @@ static struct tap_queue *tap_get_queue(struct tap_dev 
> *tap,
> if (numvtaps == 1)
> goto single;
>
> +   if (vnet_hash && (vnet_hash->common.flags & TUN_VNET_HASH_RSS)) {
> +   rxq = tun_vnet_rss_select_queue(numvtaps, vnet_hash, skb, 
> tap_add_hash);
> +   queue = rcu_dereference(tap->taps[rxq]);
> +   goto out;
> +   }
> +
> if (!skb->l4_hash && !skb->sw_hash) {
> struct flow_keys keys;
>
> @@ -234,7 +241,7 @@ static struct tap_queue *tap_get_queue(struct tap_dev 
> *tap,
>
> /* Check if we can use flow to select a queue */
> if (rxq) {
> -   tun_vnet_hash_report(&tap->vnet_hash, skb, &keys_basic, rxq, 
> tap_add_hash);
> +   tun_vnet_hash_report(vnet_hash, skb, &keys_basic, rxq, 
> tap_add_hash);
> queue = rcu_dereference(tap->taps[rxq % numvtaps]);
> goto out;
> }
> @@ -1058,7 +1065,7 @@ static long tap_ioctl(struct file *file, unsigned int 
> cmd,
> tap = rtnl_dereference(q->tap);
> ret = tun_vnet_ioctl(&q->vnet_hdr_sz, &q->flags,
>  tap ? &tap->vnet_hash : NULL, -EINVAL,
> -cmd, sp);
> +true, cmd, sp);
> rtnl_unlock();
> return ret;
> }
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 27308417b834..18528568aed7 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -209,7 +209,7 @@ struct tun_struct {
> struct bpf_prog __rcu *xdp_prog;
> struct tun_prog __rcu *steering_prog;
> struct tun_prog __rcu *filter_prog;
> -   struct tun_vnet_hash vnet_hash;
> +   struct tun_vnet_hash_container __rcu *vnet_hash;
> struct ethtool_link_ksettings link_ksettings;
> /* init args */
> struct file *file;
> @@ -468,7 +468,9 @@ static const struct virtio_net_hash *tun_find_hash(const 
> struct sk_buff *skb)
>   * the userspace application move between processors, we may get a
>   * different rxq no. here.
>   */
> -static u16 tun_automq_select_queue(struct tun_struct *tun, struct sk_buff 
> *skb)
> +static u16 tun_automq_select_queue(struct tun_struct *tun,
> +  const struct tun_vnet_hash_container 
> *vnet_hash,
> +  struct sk_buff *skb)
>  {
> struct flow_keys keys;
> struct flow_keys_basic keys_basic;
> @@ -493,7 +495,7 @@ static u16 tun_automq_select_queue(struct tun_struct 
> *tun, struct sk_buff *skb)
> .control = keys.control,
> .basic = keys.basic
> };
> -   tun_vnet_hash_report(&tun->vnet_hash, skb, &keys_basic, skb->l4_hash 
> ? skb->hash : txq,
> +   tun_vnet_hash_report(vnet_hash, skb, &keys_basic, skb->l4_hash ? 
> skb->hash : txq,
>  tun_add_hash);
>
> return txq;
> @@ -523,10 +525,17 @@ static u16 tun_select_queue(st

Re: [PATCH RFC v5 05/10] tun: Pad virtio header with zero

2024-10-09 Thread Jason Wang
On Tue, Oct 8, 2024 at 2:55 PM Akihiko Odaki  wrote:
>
> tun used to simply advance iov_iter when it needs to pad virtio header,
> which leaves the garbage in the buffer as is. This is especially
> problematic when tun starts to allow enabling the hash reporting
> feature; even if the feature is enabled, the packet may lack a hash
> value and may contain a hole in the virtio header because the packet
> arrived before the feature gets enabled or does not contain the
> header fields to be hashed. If the hole is not filled with zero, it is
> impossible to tell if the packet lacks a hash value.
>
> In theory, a user of tun can fill the buffer with zero before calling
> read() to avoid such a problem, but leaving the garbage in the buffer is
> awkward anyway so fill the buffer in tun.
>
> Signed-off-by: Akihiko Odaki 

This sounds like an independent fix that is worth going to -net first.

Thanks

> ---
>  drivers/net/tun_vnet.h | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/tun_vnet.h b/drivers/net/tun_vnet.h
> index 7c7f3f6d85e9..c40bde0fdf8c 100644
> --- a/drivers/net/tun_vnet.h
> +++ b/drivers/net/tun_vnet.h
> @@ -138,7 +138,8 @@ static inline int tun_vnet_hdr_put(int sz, struct 
> iov_iter *iter,
> if (copy_to_iter(hdr, sizeof(*hdr), iter) != sizeof(*hdr))
> return -EFAULT;
>
> -   iov_iter_advance(iter, sz - sizeof(*hdr));
> +   if (iov_iter_zero(sz - sizeof(*hdr), iter) != sz - sizeof(*hdr))
> +   return -EFAULT;
>
> return 0;
>  }
>
> --
> 2.46.2
>




Re: [PATCH RFC v5 06/10] tun: Introduce virtio-net hash reporting feature

2024-10-09 Thread Jason Wang
On Tue, Oct 8, 2024 at 2:55 PM Akihiko Odaki  wrote:
>
> Allow the guest to reuse the hash value to make receive steering
> consistent between the host and guest, and to save hash computation.
>
> Signed-off-by: Akihiko Odaki 

I wonder if this would cause overhead when hash reporting is not enabled?

> ---
>  Documentation/networking/tuntap.rst |   7 +++
>  drivers/net/Kconfig |   1 +
>  drivers/net/tap.c   |  45 ++--

Tile should be for tap as well or is this just for tun?

>  drivers/net/tun.c   |  46 
>  drivers/net/tun_vnet.h  | 102 
> +++-
>  include/linux/if_tap.h  |   2 +
>  include/uapi/linux/if_tun.h |  48 +
>  7 files changed, 223 insertions(+), 28 deletions(-)
>
> diff --git a/Documentation/networking/tuntap.rst 
> b/Documentation/networking/tuntap.rst
> index 4d7087f727be..86b4ae8caa8a 100644
> --- a/Documentation/networking/tuntap.rst
> +++ b/Documentation/networking/tuntap.rst
> @@ -206,6 +206,13 @@ enable is true we enable it, otherwise we disable it::
>return ioctl(fd, TUNSETQUEUE, (void *)&ifr);
>}
>
> +3.4 Reference
> +-
> +
> +``linux/if_tun.h`` defines the interface described below:
> +
> +.. kernel-doc:: include/uapi/linux/if_tun.h
> +
>  Universal TUN/TAP device driver Frequently Asked Question
>  =
>
> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
> index 9920b3a68ed1..e2a7bd703550 100644
> --- a/drivers/net/Kconfig
> +++ b/drivers/net/Kconfig
> @@ -395,6 +395,7 @@ config TUN
> tristate "Universal TUN/TAP device driver support"
> depends on INET
> select CRC32
> +   select SKB_EXTENSIONS

Then we need this for macvtap at least as well?

> help
>   TUN/TAP provides packet reception and transmission for user space
>   programs.  It can be viewed as a simple Point-to-Point or Ethernet
> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
> index 9a34ceed0c2c..5e2fbe63ca47 100644
> --- a/drivers/net/tap.c
> +++ b/drivers/net/tap.c
> @@ -179,6 +179,16 @@ static void tap_put_queue(struct tap_queue *q)
> sock_put(&q->sk);
>  }
>
> +static struct virtio_net_hash *tap_add_hash(struct sk_buff *skb)
> +{
> +   return (struct virtio_net_hash *)skb->cb;

Any reason that tap uses skb->cb but not skb extensions? (And is it
safe to use that without cloning?)

> +}
> +
> +static const struct virtio_net_hash *tap_find_hash(const struct sk_buff *skb)
> +{
> +   return (const struct virtio_net_hash *)skb->cb;
> +}
> +
>  /*
>   * Select a queue based on the rxq of the device on which this packet
>   * arrived. If the incoming device is not mq, calculate a flow hash
> @@ -189,6 +199,7 @@ static void tap_put_queue(struct tap_queue *q)
>  static struct tap_queue *tap_get_queue(struct tap_dev *tap,
>struct sk_buff *skb)
>  {
> +   struct flow_keys_basic keys_basic;
> struct tap_queue *queue = NULL;
> /* Access to taps array is protected by rcu, but access to numvtaps
>  * isn't. Below we use it to lookup a queue, but treat it as a hint
> @@ -198,15 +209,32 @@ static struct tap_queue *tap_get_queue(struct tap_dev 
> *tap,
> int numvtaps = READ_ONCE(tap->numvtaps);
> __u32 rxq;
>
> +   *tap_add_hash(skb) = (struct virtio_net_hash) { .report = 
> VIRTIO_NET_HASH_REPORT_NONE };
> +
> if (!numvtaps)
> goto out;
>
> if (numvtaps == 1)
> goto single;
>
> +   if (!skb->l4_hash && !skb->sw_hash) {
> +   struct flow_keys keys;
> +
> +   skb_flow_dissect_flow_keys(skb, &keys, 
> FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL);
> +   rxq = flow_hash_from_keys(&keys);
> +   keys_basic = (struct flow_keys_basic) {
> +   .control = keys.control,
> +   .basic = keys.basic
> +   };
> +   } else {
> +   skb_flow_dissect_flow_keys_basic(NULL, skb, &keys_basic, 
> NULL, 0, 0, 0,
> +
> FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL);
> +   rxq = skb->hash;
> +   }
> +
> /* Check if we can use flow to select a queue */
> -   rxq = skb_get_hash(skb);
> if (rxq) {
> +   tun_vnet_hash_report(&tap->vnet_hash, skb, &keys_basic, rxq, 
> tap_add_hash);
> queue = rcu_dereference(tap->taps[rxq % numvtaps]);
> goto out;
> }
> @@ -713,15 +741,16 @@ static ssize_t tap_put_user(struct tap_queue *q,
> int total;
>
> if (q->flags & IFF_VNET_HDR) {
> -   struct virtio_net_hdr vnet_hdr;
> +   struct virtio_net_hdr_v1_hash vnet_hdr;
>
> vnet_hdr_len = READ_ONCE(q->vnet_hdr_sz);
>
> -   ret = tun_vnet_hdr_from_

Re: [PATCH RFC v5 07/10] tun: Introduce virtio-net RSS

2024-10-18 Thread Jason Wang
On Sat, Oct 12, 2024 at 6:29 PM Akihiko Odaki  wrote:
>
> On 2024/10/09 17:14, Jason Wang wrote:
> > On Tue, Oct 8, 2024 at 2:55 PM Akihiko Odaki  
> > wrote:
> >>
> >> RSS is a receive steering algorithm that can be negotiated to use with
> >> virtio_net. Conventionally the hash calculation was done by the VMM.
> >> However, computing the hash after the queue was chosen defeats the
> >> purpose of RSS.
> >>
> >> Another approach is to use eBPF steering program. This approach has
> >> another downside: it cannot report the calculated hash due to the
> >> restrictive nature of eBPF steering program.
> >>
> >> Introduce the code to perform RSS to the kernel in order to overcome
> >> thse challenges. An alternative solution is to extend the eBPF steering
> >> program so that it will be able to report to the userspace, but I didn't
> >> opt for it because extending the current mechanism of eBPF steering
> >> program as is because it relies on legacy context rewriting, and
> >> introducing kfunc-based eBPF will result in non-UAPI dependency while
> >> the other relevant virtualization APIs such as KVM and vhost_net are
> >> UAPIs.
> >>
> >> Signed-off-by: Akihiko Odaki 
> >> ---
> >>   drivers/net/tap.c   | 11 +-
> >>   drivers/net/tun.c   | 57 ---
> >>   drivers/net/tun_vnet.h  | 96 
> >> +
> >>   include/linux/if_tap.h  |  4 +-
> >>   include/uapi/linux/if_tun.h | 27 +
> >>   5 files changed, 169 insertions(+), 26 deletions(-)
> >>

[...]

> >> diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h
> >> index d11e79b4e0dc..4887f97500a8 100644
> >> --- a/include/uapi/linux/if_tun.h
> >> +++ b/include/uapi/linux/if_tun.h
> >> @@ -75,6 +75,14 @@
> >>*
> >>* The argument is a pointer to &struct tun_vnet_hash.
> >>*
> >> + * The argument is a pointer to the compound of the following in order if
> >> + * %TUN_VNET_HASH_RSS is set:
> >> + *
> >> + * 1. &struct tun_vnet_hash
> >> + * 2. &struct tun_vnet_hash_rss
> >> + * 3. Indirection table
> >> + * 4. Key
> >> + *
> >
> > Let's try not modify uAPI. We can introduce new ioctl if necessary.
>
> 2, 3, and 4 are new additions. Adding a separate ioctl for them means we
> need to call two ioctls to configure RSS and it is hard to design the
> interactions with them.
>
> For example, if we set TUN_VNET_HASH_RSS with TUNSETVNETHASH before
> setting struct tun_vnet_hash_rss with another ioctl, tuntap will enable
> RSS with undefined parameters. Setting struct tun_vnet_hash_rss with
> TUN_VNET_HASH_RSS unset also sounds unreasnoable.
>
> Letting the new ioctl set TUN_VNET_HASH_RSS does not help either.
> TUNSETVNETHASH still sets the bitmask of allowed hash types so RSS will
> depend on two ioctls.

I meant let's avoid introducing an ioctl with function 1 in one patch,
and adding 2,3,4 in exactly the same ioctl in the following. It breaks
the uABI consistency and bisection.

We can add all in one patch.

Thanks




Re: [PATCH RFC v4 7/9] tun: Introduce virtio-net RSS

2024-09-24 Thread Jason Wang
On Tue, Sep 24, 2024 at 5:01 PM Akihiko Odaki  wrote:
>
> RSS is a receive steering algorithm that can be negotiated to use with
> virtio_net. Conventionally the hash calculation was done by the VMM.
> However, computing the hash after the queue was chosen defeats the
> purpose of RSS.
>
> Another approach is to use eBPF steering program. This approach has
> another downside: it cannot report the calculated hash due to the
> restrictive nature of eBPF steering program.
>
> Introduce the code to perform RSS to the kernel in order to overcome
> thse challenges. An alternative solution is to extend the eBPF steering
> program so that it will be able to report to the userspace, but I didn't
> opt for it because extending the current mechanism of eBPF steering
> program as is because it relies on legacy context rewriting, and
> introducing kfunc-based eBPF will result in non-UAPI dependency while
> the other relevant virtualization APIs such as KVM and vhost_net are
> UAPIs.
>
> Signed-off-by: Akihiko Odaki 

If we decide to go this way, we need to make it reusable for macvtap as well.

Thanks




Re: [PATCH RFC v4 0/9] tun: Introduce virtio-net hashing feature

2024-09-24 Thread Jason Wang
On Tue, Sep 24, 2024 at 5:01 PM Akihiko Odaki  wrote:
>
> virtio-net have two usage of hashes: one is RSS and another is hash
> reporting. Conventionally the hash calculation was done by the VMM.
> However, computing the hash after the queue was chosen defeats the
> purpose of RSS.
>
> Another approach is to use eBPF steering program. This approach has
> another downside: it cannot report the calculated hash due to the
> restrictive nature of eBPF.
>
> Introduce the code to compute hashes to the kernel in order to overcome
> thse challenges.
>
> An alternative solution is to extend the eBPF steering program so that it
> will be able to report to the userspace, but it is based on context
> rewrites, which is in feature freeze. We can adopt kfuncs, but they will
> not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM
> and vhost_net).
>

I wonder if we could clone the skb and reuse some to store the hash,
then the steering eBPF program can access these fields without
introducing full RSS in the kernel?

Thanks




Re: [PATCH RFC v4 0/9] tun: Introduce virtio-net hashing feature

2024-09-26 Thread Jason Wang
On Fri, Sep 27, 2024 at 10:11 AM Akihiko Odaki  wrote:
>
> On 2024/09/25 12:30, Jason Wang wrote:
> > On Tue, Sep 24, 2024 at 5:01 PM Akihiko Odaki  
> > wrote:
> >>
> >> virtio-net have two usage of hashes: one is RSS and another is hash
> >> reporting. Conventionally the hash calculation was done by the VMM.
> >> However, computing the hash after the queue was chosen defeats the
> >> purpose of RSS.
> >>
> >> Another approach is to use eBPF steering program. This approach has
> >> another downside: it cannot report the calculated hash due to the
> >> restrictive nature of eBPF.
> >>
> >> Introduce the code to compute hashes to the kernel in order to overcome
> >> thse challenges.
> >>
> >> An alternative solution is to extend the eBPF steering program so that it
> >> will be able to report to the userspace, but it is based on context
> >> rewrites, which is in feature freeze. We can adopt kfuncs, but they will
> >> not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM
> >> and vhost_net).
> >>
> >
> > I wonder if we could clone the skb and reuse some to store the hash,
> > then the steering eBPF program can access these fields without
> > introducing full RSS in the kernel?
>
> I don't get how cloning the skb can solve the issue.
>
> We can certainly implement Toeplitz function in the kernel or even with
> tc-bpf to store a hash value that can be used for eBPF steering program
> and virtio hash reporting. However we don't have a means of storing a
> hash type, which is specific to virtio hash reporting and lacks a
> corresponding skb field.

I may miss something but looking at sk_filter_is_valid_access(). It
looks to me we can make use of skb->cb[0..4]?

Thanks

>
> Regards,
> Akihiko Odaki
>




Re: [PATCH RFC v4 0/9] tun: Introduce virtio-net hashing feature

2024-09-28 Thread Jason Wang
On Fri, Sep 27, 2024 at 3:51 PM Akihiko Odaki  wrote:
>
> On 2024/09/27 13:31, Jason Wang wrote:
> > On Fri, Sep 27, 2024 at 10:11 AM Akihiko Odaki  
> > wrote:
> >>
> >> On 2024/09/25 12:30, Jason Wang wrote:
> >>> On Tue, Sep 24, 2024 at 5:01 PM Akihiko Odaki  
> >>> wrote:
> >>>>
> >>>> virtio-net have two usage of hashes: one is RSS and another is hash
> >>>> reporting. Conventionally the hash calculation was done by the VMM.
> >>>> However, computing the hash after the queue was chosen defeats the
> >>>> purpose of RSS.
> >>>>
> >>>> Another approach is to use eBPF steering program. This approach has
> >>>> another downside: it cannot report the calculated hash due to the
> >>>> restrictive nature of eBPF.
> >>>>
> >>>> Introduce the code to compute hashes to the kernel in order to overcome
> >>>> thse challenges.
> >>>>
> >>>> An alternative solution is to extend the eBPF steering program so that it
> >>>> will be able to report to the userspace, but it is based on context
> >>>> rewrites, which is in feature freeze. We can adopt kfuncs, but they will
> >>>> not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM
> >>>> and vhost_net).
> >>>>
> >>>
> >>> I wonder if we could clone the skb and reuse some to store the hash,
> >>> then the steering eBPF program can access these fields without
> >>> introducing full RSS in the kernel?
> >>
> >> I don't get how cloning the skb can solve the issue.
> >>
> >> We can certainly implement Toeplitz function in the kernel or even with
> >> tc-bpf to store a hash value that can be used for eBPF steering program
> >> and virtio hash reporting. However we don't have a means of storing a
> >> hash type, which is specific to virtio hash reporting and lacks a
> >> corresponding skb field.
> >
> > I may miss something but looking at sk_filter_is_valid_access(). It
> > looks to me we can make use of skb->cb[0..4]?
>
> I didn't opt to using cb. Below is the rationale:
>
> cb is for tail call so it means we reuse the field for a different
> purpose. The context rewrite allows adding a field without increasing
> the size of the underlying storage (the real sk_buff) so we should add a
> new field instead of reusing an existing field to avoid confusion.
>
> We are however no longer allowed to add a new field. In my
> understanding, this is because it is an UAPI, and eBPF maintainers found
> it is difficult to maintain its stability.
>
> Reusing cb for hash reporting is a workaround to avoid having a new
> field, but it does not solve the underlying problem (i.e., keeping eBPF
> as stable as UAPI is unreasonably hard). In my opinion, adding an ioctl
> is a reasonable option to keep the API as stable as other virtualization
> UAPIs while respecting the underlying intention of the context rewrite
> feature freeze.

Fair enough.

Btw, I remember DPDK implements tuntap RSS via eBPF as well (probably
via cls or other). It might worth to see if anything we miss here.

Thanks

>
> Regards,
> Akihiko Odaki
>




Re: [PATCH v2 3/3] tun: Set num_buffers for virtio 1.0

2025-01-09 Thread Jason Wang
On Thu, Jan 9, 2025 at 2:59 PM Akihiko Odaki  wrote:
>
> The specification says the device MUST set num_buffers to 1 if
> VIRTIO_NET_F_MRG_RXBUF has not been negotiated.

Have we agreed on how to fix the spec or not?

As I replied in the spec patch, if we just remove this "MUST", it
looks like we are all fine?

Thanks




Re: [PATCH v2 2/3] tun: Pad virtio header with zero

2025-01-09 Thread Jason Wang
On Thu, Jan 9, 2025 at 2:59 PM Akihiko Odaki  wrote:
>
> tun used to simply advance iov_iter when it needs to pad virtio header,
> which leaves the garbage in the buffer as is. This is especially
> problematic when tun starts to allow enabling the hash reporting
> feature; even if the feature is enabled, the packet may lack a hash
> value and may contain a hole in the virtio header because the packet
> arrived before the feature gets enabled or does not contain the
> header fields to be hashed. If the hole is not filled with zero, it is
> impossible to tell if the packet lacks a hash value.

I'm not sure I will get here, could we do this in the series of hash reporting?

>
> In theory, a user of tun can fill the buffer with zero before calling
> read() to avoid such a problem, but leaving the garbage in the buffer is
> awkward anyway so fill the buffer in tun.
>
> Signed-off-by: Akihiko Odaki 
> ---
>  drivers/net/tun_vnet.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/tun_vnet.c b/drivers/net/tun_vnet.c
> index fe842df9e9ef..ffb2186facd3 100644
> --- a/drivers/net/tun_vnet.c
> +++ b/drivers/net/tun_vnet.c
> @@ -138,7 +138,8 @@ int tun_vnet_hdr_put(int sz, struct iov_iter *iter,
> if (copy_to_iter(hdr, sizeof(*hdr), iter) != sizeof(*hdr))
> return -EFAULT;
>
> -   iov_iter_advance(iter, sz - sizeof(*hdr));
> +   if (iov_iter_zero(sz - sizeof(*hdr), iter) != sz - sizeof(*hdr))
> +   return -EFAULT;
>
> return 0;

There're various callers of iov_iter_advance(), do we need to fix them all?

Thanks

>  }

>
> --
> 2.47.1
>




Re: [PATCH v2 1/3] tun: Unify vnet implementation

2025-01-09 Thread Jason Wang
On Thu, Jan 9, 2025 at 2:59 PM Akihiko Odaki  wrote:
>
> Both tun and tap exposes the same set of virtio-net-related features.
> Unify their implementations to ease future changes.
>
> Signed-off-by: Akihiko Odaki 
> ---
>  MAINTAINERS|   1 +
>  drivers/net/Kconfig|   5 ++
>  drivers/net/Makefile   |   1 +
>  drivers/net/tap.c  | 172 ++--
>  drivers/net/tun.c  | 208 
> -
>  drivers/net/tun_vnet.c | 186 +++
>  drivers/net/tun_vnet.h |  24 ++
>  7 files changed, 273 insertions(+), 324 deletions(-)
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 910305c11e8a..1be8a452d11f 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -23903,6 +23903,7 @@ F:  Documentation/networking/tuntap.rst
>  F: arch/um/os-Linux/drivers/
>  F: drivers/net/tap.c
>  F: drivers/net/tun.c
> +F: drivers/net/tun_vnet.h
>
>  TURBOCHANNEL SUBSYSTEM
>  M: "Maciej W. Rozycki" 
> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
> index 1fd5acdc73c6..255c8f9f1d7c 100644
> --- a/drivers/net/Kconfig
> +++ b/drivers/net/Kconfig
> @@ -395,6 +395,7 @@ config TUN
> tristate "Universal TUN/TAP device driver support"
> depends on INET
> select CRC32
> +   select TUN_VNET

I don't think we need a dedicated Kconfig option here.

Btw, fixes should come first as it simplifies the backporting.

Thanks




Re: [PATCH net v3 9/9] tap: Use tun's vnet-related code

2025-01-19 Thread Jason Wang
On Fri, Jan 17, 2025 at 6:35 PM Akihiko Odaki  wrote:
>
> On 2025/01/17 18:23, Willem de Bruijn wrote:
> > Akihiko Odaki wrote:
> >> tun and tap implements the same vnet-related features so reuse the code.
> >>
> >> Signed-off-by: Akihiko Odaki 
> >> ---
> >>   drivers/net/Kconfig|   1 +
> >>   drivers/net/Makefile   |   6 +-
> >>   drivers/net/tap.c  | 152 
> >> +
> >>   drivers/net/tun_vnet.c |   5 ++
> >>   4 files changed, 24 insertions(+), 140 deletions(-)
> >>
> >> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
> >> index 1fd5acdc73c6..c420418473fc 100644
> >> --- a/drivers/net/Kconfig
> >> +++ b/drivers/net/Kconfig
> >> @@ -395,6 +395,7 @@ config TUN
> >>  tristate "Universal TUN/TAP device driver support"
> >>  depends on INET
> >>  select CRC32
> >> +select TAP
> >>  help
> >>TUN/TAP provides packet reception and transmission for user space
> >>programs.  It can be viewed as a simple Point-to-Point or Ethernet
> >> diff --git a/drivers/net/Makefile b/drivers/net/Makefile
> >> index bb8eb3053772..2275309a97ee 100644
> >> --- a/drivers/net/Makefile
> >> +++ b/drivers/net/Makefile
> >> @@ -29,9 +29,9 @@ obj-y += mdio/
> >>   obj-y += pcs/
> >>   obj-$(CONFIG_RIONET) += rionet.o
> >>   obj-$(CONFIG_NET_TEAM) += team/
> >> -obj-$(CONFIG_TUN) += tun-drv.o
> >> -tun-drv-y := tun.o tun_vnet.o
> >> -obj-$(CONFIG_TAP) += tap.o
> >> +obj-$(CONFIG_TUN) += tun.o
> >
> > Is reversing the previous changes to tun.ko intentional?
> >
> > Perhaps the previous approach with a new CONFIG_TUN_VNET is preferable
> > over this. In particular over making TUN select TAP, a new dependency.
>
> Jason, you also commented about CONFIG_TUN_VNET for the previous
> version. Do you prefer the old approach, or the new one? (Or if you have
> another idea, please tell me.)

Ideally, if we can make TUN select TAP that would be better. But there
are some subtle differences in the multi queue implementation. We will
end up with some useless code for TUN unless we can unify the multi
queue logic. It might not be worth it to change the TUN's multi queue
logic so having a new file seems to be better.

Thanks


>
> >
> >> +obj-$(CONFIG_TAP) += tap-drv.o
> >> +tap-drv-y := tap.o tun_vnet.o
> >>   obj-$(CONFIG_VETH) += veth.o
> >>   obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
> >>   obj-$(CONFIG_VXLAN) += vxlan/
>




Re: [PATCH v2 3/3] tun: Set num_buffers for virtio 1.0

2025-01-19 Thread Jason Wang
On Thu, Jan 16, 2025 at 1:30 PM Akihiko Odaki  wrote:
>
> On 2025/01/16 10:06, Jason Wang wrote:
> > On Wed, Jan 15, 2025 at 1:07 PM Akihiko Odaki  
> > wrote:
> >>
> >> On 2025/01/13 12:04, Jason Wang wrote:
> >>> On Fri, Jan 10, 2025 at 7:12 PM Akihiko Odaki  
> >>> wrote:
> >>>>
> >>>> On 2025/01/10 19:23, Michael S. Tsirkin wrote:
> >>>>> On Fri, Jan 10, 2025 at 11:27:13AM +0800, Jason Wang wrote:
> >>>>>> On Thu, Jan 9, 2025 at 2:59 PM Akihiko Odaki 
> >>>>>>  wrote:
> >>>>>>>
> >>>>>>> The specification says the device MUST set num_buffers to 1 if
> >>>>>>> VIRTIO_NET_F_MRG_RXBUF has not been negotiated.
> >>>>>>
> >>>>>> Have we agreed on how to fix the spec or not?
> >>>>>>
> >>>>>> As I replied in the spec patch, if we just remove this "MUST", it
> >>>>>> looks like we are all fine?
> >>>>>>
> >>>>>> Thanks
> >>>>>
> >>>>> We should replace MUST with SHOULD but it is not all fine,
> >>>>> ignoring SHOULD is a quality of implementation issue.
> >>>>>
> >>>
> >>> So is this something that the driver should notice?
> >>>
> >>>>
> >>>> Should we really replace it? It would mean that a driver conformant with
> >>>> the current specification may not be compatible with a device conformant
> >>>> with the future specification.
> >>>
> >>> I don't get this. We are talking about devices and we want to relax so
> >>> it should compatibile.
> >>
> >>
> >> The problem is:
> >> 1) On the device side, the num_buffers can be left uninitialized due to 
> >> bugs
> >> 2) On the driver side, the specification allows assuming the num_buffers
> >> is set to one.
> >>
> >> Relaxing the device requirement will replace "due to bugs" with
> >> "according to the specification" in 1). It still contradicts with 2) so
> >> does not fix compatibility.
> >
> > Just to clarify I meant we can simply remove the following:
> >
> > """
> > The device MUST use only a single descriptor if VIRTIO_NET_F_MRG_RXBUF
> > was not negotiated. Note: This means that num_buffers will always be 1
> > if VIRTIO_NET_F_MRG_RXBUF is not negotiated.
> > """
> >
> > And
> >
> > """
> > If VIRTIO_NET_F_MRG_RXBUF has not been negotiated, the device MUST set
> > num_buffers to 1.
> > """
> >
> > This seems easier as it reflects the fact where some devices don't set
> > it. And it eases the transitional device as it doesn't need to have
> > any special care.
>
> That can potentially break existing drivers that are compliant with the
> current and assumes the num_buffers is set to 1.

Those drivers are already 'broken'. Aren't they?

Thanks

>
> Regards,
> Akihiko Odaki
>
> >
> > Then we don't need any driver normative so I don't see any conflict.
> >
> > Michael suggests we use "SHOULD", but if this is something that the
> > driver needs to be aware of I don't know how "SHOULD" can help a lot
> > or not.
> >
> >>
> >> Instead, we should make the driver requirement stricter to change 2).
> >> That is what "[PATCH v3] virtio-net: Ignore num_buffers when unused" does:
> >> https://lore.kernel.org/r/20250110-reserved-v3-1-2ade0a5d2...@daynix.com
> >>
> >>>
> >>>>
> >>>> We are going to fix all implementations known to buggy (QEMU and Linux)
> >>>> anyway so I think it's just fine to leave that part of specification as 
> >>>> is.
> >>>
> >>> I don't think we can fix it all.
> >>
> >> It essentially only requires storing 16 bits. There are details we need
> >> to work out, but it should be possible to fix.
> >
> > I meant it's not realistic to fix all the hypervisors. Note that
> > modern devices have been implemented for about a decade so we may have
> > too many versions of various hypervisors. (E.g DPDK seems to stick
> > with the same behaviour of the current kernel).
>  > >>
> >> Regards,
> >> Akihiko Odaki
> >>
> >
> > Thanks
> >
>




Re: [PATCH v2 3/3] tun: Set num_buffers for virtio 1.0

2025-01-12 Thread Jason Wang
On Fri, Jan 10, 2025 at 7:12 PM Akihiko Odaki  wrote:
>
> On 2025/01/10 19:23, Michael S. Tsirkin wrote:
> > On Fri, Jan 10, 2025 at 11:27:13AM +0800, Jason Wang wrote:
> >> On Thu, Jan 9, 2025 at 2:59 PM Akihiko Odaki  
> >> wrote:
> >>>
> >>> The specification says the device MUST set num_buffers to 1 if
> >>> VIRTIO_NET_F_MRG_RXBUF has not been negotiated.
> >>
> >> Have we agreed on how to fix the spec or not?
> >>
> >> As I replied in the spec patch, if we just remove this "MUST", it
> >> looks like we are all fine?
> >>
> >> Thanks
> >
> > We should replace MUST with SHOULD but it is not all fine,
> > ignoring SHOULD is a quality of implementation issue.
> >

So is this something that the driver should notice?

>
> Should we really replace it? It would mean that a driver conformant with
> the current specification may not be compatible with a device conformant
> with the future specification.

I don't get this. We are talking about devices and we want to relax so
it should compatibile.

>
> We are going to fix all implementations known to buggy (QEMU and Linux)
> anyway so I think it's just fine to leave that part of specification as is.

I don't think we can fix it all.

Thanks

>




Re: [PATCH v2 3/3] tun: Set num_buffers for virtio 1.0

2025-01-15 Thread Jason Wang
On Wed, Jan 15, 2025 at 1:07 PM Akihiko Odaki  wrote:
>
> On 2025/01/13 12:04, Jason Wang wrote:
> > On Fri, Jan 10, 2025 at 7:12 PM Akihiko Odaki  
> > wrote:
> >>
> >> On 2025/01/10 19:23, Michael S. Tsirkin wrote:
> >>> On Fri, Jan 10, 2025 at 11:27:13AM +0800, Jason Wang wrote:
> >>>> On Thu, Jan 9, 2025 at 2:59 PM Akihiko Odaki  
> >>>> wrote:
> >>>>>
> >>>>> The specification says the device MUST set num_buffers to 1 if
> >>>>> VIRTIO_NET_F_MRG_RXBUF has not been negotiated.
> >>>>
> >>>> Have we agreed on how to fix the spec or not?
> >>>>
> >>>> As I replied in the spec patch, if we just remove this "MUST", it
> >>>> looks like we are all fine?
> >>>>
> >>>> Thanks
> >>>
> >>> We should replace MUST with SHOULD but it is not all fine,
> >>> ignoring SHOULD is a quality of implementation issue.
> >>>
> >
> > So is this something that the driver should notice?
> >
> >>
> >> Should we really replace it? It would mean that a driver conformant with
> >> the current specification may not be compatible with a device conformant
> >> with the future specification.
> >
> > I don't get this. We are talking about devices and we want to relax so
> > it should compatibile.
>
>
> The problem is:
> 1) On the device side, the num_buffers can be left uninitialized due to bugs
> 2) On the driver side, the specification allows assuming the num_buffers
> is set to one.
>
> Relaxing the device requirement will replace "due to bugs" with
> "according to the specification" in 1). It still contradicts with 2) so
> does not fix compatibility.

Just to clarify I meant we can simply remove the following:

"""
The device MUST use only a single descriptor if VIRTIO_NET_F_MRG_RXBUF
was not negotiated. Note: This means that num_buffers will always be 1
if VIRTIO_NET_F_MRG_RXBUF is not negotiated.
"""

And

"""
If VIRTIO_NET_F_MRG_RXBUF has not been negotiated, the device MUST set
num_buffers to 1.
"""

This seems easier as it reflects the fact where some devices don't set
it. And it eases the transitional device as it doesn't need to have
any special care.

Then we don't need any driver normative so I don't see any conflict.

Michael suggests we use "SHOULD", but if this is something that the
driver needs to be aware of I don't know how "SHOULD" can help a lot
or not.

>
> Instead, we should make the driver requirement stricter to change 2).
> That is what "[PATCH v3] virtio-net: Ignore num_buffers when unused" does:
> https://lore.kernel.org/r/20250110-reserved-v3-1-2ade0a5d2...@daynix.com
>
> >
> >>
> >> We are going to fix all implementations known to buggy (QEMU and Linux)
> >> anyway so I think it's just fine to leave that part of specification as is.
> >
> > I don't think we can fix it all.
>
> It essentially only requires storing 16 bits. There are details we need
> to work out, but it should be possible to fix.

I meant it's not realistic to fix all the hypervisors. Note that
modern devices have been implemented for about a decade so we may have
too many versions of various hypervisors. (E.g DPDK seems to stick
with the same behaviour of the current kernel).

>
> Regards,
> Akihiko Odaki
>

Thanks




Re: [PATCH net-next] tun: Pad virtio headers

2025-02-16 Thread Jason Wang
On Sat, Feb 15, 2025 at 1:25 PM Akihiko Odaki  wrote:
>
> On 2025/02/14 0:43, Michael S. Tsirkin wrote:
> > On Thu, Feb 13, 2025 at 06:23:55PM +0900, Akihiko Odaki wrote:
> >> On 2025/02/13 16:18, Michael S. Tsirkin wrote:
> >>>
> >>> Commit log needs some work.
> >>>
> >>> So my understanding is, this patch does not do much functionally,
> >>> but makes adding the hash feature easier. OK.
> >>>
> >>> On Thu, Feb 13, 2025 at 03:54:06PM +0900, Akihiko Odaki wrote:
>  tun used to simply advance iov_iter when it needs to pad virtio header,
>  which leaves the garbage in the buffer as is. This is especially
>  problematic
> >>>
> >>> I think you mean "this will become especially problematic"
> >>>
>  when tun starts to allow enabling the hash reporting
>  feature; even if the feature is enabled, the packet may lack a hash
>  value and may contain a hole in the virtio header because the packet
>  arrived before the feature gets enabled or does not contain the
>  header fields to be hashed. If the hole is not filled with zero, it is
>  impossible to tell if the packet lacks a hash value.
> 
>  In theory, a user of tun can fill the buffer with zero before calling
>  read() to avoid such a problem, but leaving the garbage in the buffer is
>  awkward anyway so fill the buffer in tun.
> >>>
> >>>
> >>> What is missing here is description of what the patch does.
> >>> I think it is
> >>> "Replace advancing the iterator with writing zeros".
> >>>
> >>> There could be performance cost to the dirtying extra cache lines, though.
> >>> Could you try checking that please?
> >>
> >> It will not dirty extra cache lines; an explanation follows later. Because
> >> of that, any benchmark are likely to show only noises, but if you have an
> >> idea of workloads that should be tested, please tell me.
> >
> > pktgen usually
>
> I tried it but it didn't give meaningful results so I may be doing
> wrong. It didn't show an obvious performance regression at least. I ran
> the following procedure:
>
> 1. create a veth pair
> 2. connect it to macvtap
> 3. run Fedora 41 on QEMU with vhost=on
> 4. run samples/pktgen/pktgen_sample01_simple.sh for the other end of veth
> 5. observe the rx packet rate of macvtap with ifstat for 60 seconds
>
> ifstat showed that it received:
> 532K packets / 60 seconds without this patch
> 578K packets / 60 seconds with this patch

The pps seems to be too poor.

If you want to test, I would suggest to use:

pktgen on the host with DPDK testpmd + virtio user:

https://doc.dpdk.org/guides/howto/virtio_user_as_exception_path.html

Thanks

>
> This is 8.6 % uplift, not degradation. I guess it's just a noise.
>
> Below are actual commands I ran:
>
> The commands I set up the veth pair and macvtap is as follows:
>
> ip link add veth_host type veth peer name veth_guest
> ip link set veth_host up
> ip link set veth_guest up
> ip link add link macvtap0 link veth_guest type macvtap
> ip link set macvtap0 address 02:00:00:01:00:00 mtu 1486 up
> ip address add 10.0.2.0 dev veth_host
> ip route add 10.0.2.1 dev veth_host
>
> The command for the pktgen is:
> samples/pktgen/pktgen_sample01_simple.sh -i veth_host -d 10.0.2.1 -m
> 02:00:00:01:00:00 -n 0
>
> After I started pktgen, I ran: ifstat -d 60 macvtap0
> I waited 60 seconds, and observed the rx rate with: ifstat -as macvtap0
>
> >
> >
> >
> >>>
> >>> I think we should mention the risks of the patch, too.
> >>> Maybe:
> >>>
> >>> Also in theory, a user might have initialized the buffer
> >>> to some non-zero value, expecting tun to skip writing it.
> >>> As this was never a documented feature, this seems unlikely.
> 
> 
>  The specification also says the device MUST set num_buffers to 1 when
>  the field is present so set it when the specified header size is big
>  enough to contain the field.
> >>>
> >>> This part I dislike. tun has no idea what the number of buffers is.
> >>> Why 1 specifically?
> >>
> >> That's a valid point. I rewrote the commit log to clarify, but perhaps we
> >> can drop the code to set the num_buffers as "[PATCH] vhost/net: Set
> >> num_buffers for virtio 1.0" already landed.
> >
> >
> > I think I'd prefer that second option. it allows userspace
> > to reliably detect the new behaviour, by setting the value
> > to != 0.
>
> I'll leave num_buffers zero in the next version.
>
> >
> >
> >>
> >> Below is the rewritten commit log, which incorporates your suggestions and
> >> is extended to cover the performance implication and reason the num_buffers
> >> initialization:
> >>
> >> tun simply advances iov_iter when it needs to pad virtio header,
> >> which leaves the garbage in the buffer as is. This will become
> >> especially problematic when tun starts to allow enabling the hash
> >> reporting feature; even if the feature is enabled, the packet may lack a
> >> hash value and may contain a hole in the virtio header because the
> >> packet arrived before the feature get

Re: [PATCH net-next v9 3/6] tun: Introduce virtio-net hash feature

2025-03-19 Thread Jason Wang
On Wed, Mar 19, 2025 at 1:29 PM Akihiko Odaki  wrote:
>
> On 2025/03/19 9:58, Jason Wang wrote:
> > On Tue, Mar 18, 2025 at 6:10 PM Akihiko Odaki  
> > wrote:
> >>
> >> On 2025/03/18 9:15, Jason Wang wrote:
> >>> On Mon, Mar 17, 2025 at 3:07 PM Akihiko Odaki  
> >>> wrote:
> >>>>
> >>>> On 2025/03/17 10:12, Jason Wang wrote:
> >>>>> On Wed, Mar 12, 2025 at 1:03 PM Akihiko Odaki 
> >>>>>  wrote:
> >>>>>>
> >>>>>> On 2025/03/12 11:35, Jason Wang wrote:
> >>>>>>> On Tue, Mar 11, 2025 at 2:11 PM Akihiko Odaki 
> >>>>>>>  wrote:
> >>>>>>>>
> >>>>>>>> On 2025/03/11 9:38, Jason Wang wrote:
> >>>>>>>>> On Mon, Mar 10, 2025 at 3:45 PM Akihiko Odaki 
> >>>>>>>>>  wrote:
> >>>>>>>>>>
> >>>>>>>>>> On 2025/03/10 12:55, Jason Wang wrote:
> >>>>>>>>>>> On Fri, Mar 7, 2025 at 7:01 PM Akihiko Odaki 
> >>>>>>>>>>>  wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hash reporting
> >>>>>>>>>>>> ==
> >>>>>>>>>>>>
> >>>>>>>>>>>> Allow the guest to reuse the hash value to make receive steering
> >>>>>>>>>>>> consistent between the host and guest, and to save hash 
> >>>>>>>>>>>> computation.
> >>>>>>>>>>>>
> >>>>>>>>>>>> RSS
> >>>>>>>>>>>> ===
> >>>>>>>>>>>>
> >>>>>>>>>>>> RSS is a receive steering algorithm that can be negotiated to 
> >>>>>>>>>>>> use with
> >>>>>>>>>>>> virtio_net. Conventionally the hash calculation was done by the 
> >>>>>>>>>>>> VMM.
> >>>>>>>>>>>> However, computing the hash after the queue was chosen defeats 
> >>>>>>>>>>>> the
> >>>>>>>>>>>> purpose of RSS.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Another approach is to use eBPF steering program. This approach 
> >>>>>>>>>>>> has
> >>>>>>>>>>>> another downside: it cannot report the calculated hash due to the
> >>>>>>>>>>>> restrictive nature of eBPF steering program.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Introduce the code to perform RSS to the kernel in order to 
> >>>>>>>>>>>> overcome
> >>>>>>>>>>>> thse challenges. An alternative solution is to extend the eBPF 
> >>>>>>>>>>>> steering
> >>>>>>>>>>>> program so that it will be able to report to the userspace, but 
> >>>>>>>>>>>> I didn't
> >>>>>>>>>>>> opt for it because extending the current mechanism of eBPF 
> >>>>>>>>>>>> steering
> >>>>>>>>>>>> program as is because it relies on legacy context rewriting, and
> >>>>>>>>>>>> introducing kfunc-based eBPF will result in non-UAPI dependency 
> >>>>>>>>>>>> while
> >>>>>>>>>>>> the other relevant virtualization APIs such as KVM and vhost_net 
> >>>>>>>>>>>> are
> >>>>>>>>>>>> UAPIs.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Signed-off-by: Akihiko Odaki 
> >>>>>>>>>>>> Tested-by: Lei Yang 
> >>>>>>>>>>>> ---
> >>>>>>>>>>>>Documentation/networking/tuntap.rst |   7 ++
> >>>>>>>>>>>>drivers/net/Kconfig |   1 +
> >>>>>>>>>>>>drivers/net/tap.c

Re: [PATCH net-next v9 3/6] tun: Introduce virtio-net hash feature

2025-03-20 Thread Jason Wang
On Thu, Mar 20, 2025 at 1:33 PM Akihiko Odaki  wrote:
>
> On 2025/03/20 10:31, Jason Wang wrote:
> > On Wed, Mar 19, 2025 at 1:29 PM Akihiko Odaki  
> > wrote:
> >>
> >> On 2025/03/19 9:58, Jason Wang wrote:
> >>> On Tue, Mar 18, 2025 at 6:10 PM Akihiko Odaki  
> >>> wrote:
> >>>>
> >>>> On 2025/03/18 9:15, Jason Wang wrote:
> >>>>> On Mon, Mar 17, 2025 at 3:07 PM Akihiko Odaki 
> >>>>>  wrote:
> >>>>>>
> >>>>>> On 2025/03/17 10:12, Jason Wang wrote:
> >>>>>>> On Wed, Mar 12, 2025 at 1:03 PM Akihiko Odaki 
> >>>>>>>  wrote:
> >>>>>>>>
> >>>>>>>> On 2025/03/12 11:35, Jason Wang wrote:
> >>>>>>>>> On Tue, Mar 11, 2025 at 2:11 PM Akihiko Odaki 
> >>>>>>>>>  wrote:
> >>>>>>>>>>
> >>>>>>>>>> On 2025/03/11 9:38, Jason Wang wrote:
> >>>>>>>>>>> On Mon, Mar 10, 2025 at 3:45 PM Akihiko Odaki 
> >>>>>>>>>>>  wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 2025/03/10 12:55, Jason Wang wrote:
> >>>>>>>>>>>>> On Fri, Mar 7, 2025 at 7:01 PM Akihiko Odaki 
> >>>>>>>>>>>>>  wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hash reporting
> >>>>>>>>>>>>>> ==
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Allow the guest to reuse the hash value to make receive 
> >>>>>>>>>>>>>> steering
> >>>>>>>>>>>>>> consistent between the host and guest, and to save hash 
> >>>>>>>>>>>>>> computation.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> RSS
> >>>>>>>>>>>>>> ===
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> RSS is a receive steering algorithm that can be negotiated to 
> >>>>>>>>>>>>>> use with
> >>>>>>>>>>>>>> virtio_net. Conventionally the hash calculation was done by 
> >>>>>>>>>>>>>> the VMM.
> >>>>>>>>>>>>>> However, computing the hash after the queue was chosen defeats 
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>> purpose of RSS.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Another approach is to use eBPF steering program. This 
> >>>>>>>>>>>>>> approach has
> >>>>>>>>>>>>>> another downside: it cannot report the calculated hash due to 
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>> restrictive nature of eBPF steering program.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Introduce the code to perform RSS to the kernel in order to 
> >>>>>>>>>>>>>> overcome
> >>>>>>>>>>>>>> thse challenges. An alternative solution is to extend the eBPF 
> >>>>>>>>>>>>>> steering
> >>>>>>>>>>>>>> program so that it will be able to report to the userspace, 
> >>>>>>>>>>>>>> but I didn't
> >>>>>>>>>>>>>> opt for it because extending the current mechanism of eBPF 
> >>>>>>>>>>>>>> steering
> >>>>>>>>>>>>>> program as is because it relies on legacy context rewriting, 
> >>>>>>>>>>>>>> and
> >>>>>>>>>>>>>> introducing kfunc-based eBPF will result in non-UAPI 
> >>>>>>>>>>>>>> dependency while
> >>>>>>

Re: [PATCH net-next v9 3/6] tun: Introduce virtio-net hash feature

2025-03-16 Thread Jason Wang
On Wed, Mar 12, 2025 at 1:03 PM Akihiko Odaki  wrote:
>
> On 2025/03/12 11:35, Jason Wang wrote:
> > On Tue, Mar 11, 2025 at 2:11 PM Akihiko Odaki  
> > wrote:
> >>
> >> On 2025/03/11 9:38, Jason Wang wrote:
> >>> On Mon, Mar 10, 2025 at 3:45 PM Akihiko Odaki  
> >>> wrote:
> >>>>
> >>>> On 2025/03/10 12:55, Jason Wang wrote:
> >>>>> On Fri, Mar 7, 2025 at 7:01 PM Akihiko Odaki  
> >>>>> wrote:
> >>>>>>
> >>>>>> Hash reporting
> >>>>>> ==
> >>>>>>
> >>>>>> Allow the guest to reuse the hash value to make receive steering
> >>>>>> consistent between the host and guest, and to save hash computation.
> >>>>>>
> >>>>>> RSS
> >>>>>> ===
> >>>>>>
> >>>>>> RSS is a receive steering algorithm that can be negotiated to use with
> >>>>>> virtio_net. Conventionally the hash calculation was done by the VMM.
> >>>>>> However, computing the hash after the queue was chosen defeats the
> >>>>>> purpose of RSS.
> >>>>>>
> >>>>>> Another approach is to use eBPF steering program. This approach has
> >>>>>> another downside: it cannot report the calculated hash due to the
> >>>>>> restrictive nature of eBPF steering program.
> >>>>>>
> >>>>>> Introduce the code to perform RSS to the kernel in order to overcome
> >>>>>> thse challenges. An alternative solution is to extend the eBPF steering
> >>>>>> program so that it will be able to report to the userspace, but I 
> >>>>>> didn't
> >>>>>> opt for it because extending the current mechanism of eBPF steering
> >>>>>> program as is because it relies on legacy context rewriting, and
> >>>>>> introducing kfunc-based eBPF will result in non-UAPI dependency while
> >>>>>> the other relevant virtualization APIs such as KVM and vhost_net are
> >>>>>> UAPIs.
> >>>>>>
> >>>>>> Signed-off-by: Akihiko Odaki 
> >>>>>> Tested-by: Lei Yang 
> >>>>>> ---
> >>>>>> Documentation/networking/tuntap.rst |   7 ++
> >>>>>> drivers/net/Kconfig |   1 +
> >>>>>> drivers/net/tap.c   |  68 ++-
> >>>>>> drivers/net/tun.c   |  98 +-
> >>>>>> drivers/net/tun_vnet.h  | 159 
> >>>>>> ++--
> >>>>>> include/linux/if_tap.h  |   2 +
> >>>>>> include/linux/skbuff.h  |   3 +
> >>>>>> include/uapi/linux/if_tun.h |  75 +
> >>>>>> net/core/skbuff.c   |   4 +
> >>>>>> 9 files changed, 386 insertions(+), 31 deletions(-)
> >>>>>>
> >>>>>> diff --git a/Documentation/networking/tuntap.rst 
> >>>>>> b/Documentation/networking/tuntap.rst
> >>>>>> index 
> >>>>>> 4d7087f727be5e37dfbf5066a9e9c872cc98898d..86b4ae8caa8ad062c1e558920be42ce0d4217465
> >>>>>>  100644
> >>>>>> --- a/Documentation/networking/tuntap.rst
> >>>>>> +++ b/Documentation/networking/tuntap.rst
> >>>>>> @@ -206,6 +206,13 @@ enable is true we enable it, otherwise we disable 
> >>>>>> it::
> >>>>>>   return ioctl(fd, TUNSETQUEUE, (void *)&ifr);
> >>>>>>   }
> >>>>>>
> >
> > [...]
> >
> >>>>>> +static inline long tun_vnet_ioctl_sethash(struct 
> >>>>>> tun_vnet_hash_container __rcu **hashp,
> >>>>>> + bool can_rss, void __user 
> >>>>>> *argp)
> >>>>>
> >>>>> So again, can_rss seems to be tricky. Looking at its caller, it tires
> >>>>> to make eBPF and RSS mutually exclusive. I still don't understand why
> >>>>> we need this. Allow eBPF program to override some of the path seems to
> >>>>> be 

Re: [PATCH net-next v9 3/6] tun: Introduce virtio-net hash feature

2025-03-16 Thread Jason Wang
On Wed, Mar 12, 2025 at 1:55 PM Akihiko Odaki  wrote:
>
> On 2025/03/12 11:59, Jason Wang wrote:
> > On Tue, Mar 11, 2025 at 2:17 PM Akihiko Odaki  
> > wrote:
> >>
> >> On 2025/03/11 9:38, Jason Wang wrote:
> >>> On Mon, Mar 10, 2025 at 3:45 PM Akihiko Odaki  
> >>> wrote:
> >>>>
> >>>> On 2025/03/10 12:55, Jason Wang wrote:
> >>>>> On Fri, Mar 7, 2025 at 7:01 PM Akihiko Odaki  
> >>>>> wrote:
> >>>>>>
> >>>>>> Hash reporting
> >>>>>> ==
> >>>>>>
> >>>>>> Allow the guest to reuse the hash value to make receive steering
> >>>>>> consistent between the host and guest, and to save hash computation.
> >>>>>>
> >>>>>> RSS
> >>>>>> ===
> >>>>>>
> >>>>>> RSS is a receive steering algorithm that can be negotiated to use with
> >>>>>> virtio_net. Conventionally the hash calculation was done by the VMM.
> >>>>>> However, computing the hash after the queue was chosen defeats the
> >>>>>> purpose of RSS.
> >>>>>>
> >>>>>> Another approach is to use eBPF steering program. This approach has
> >>>>>> another downside: it cannot report the calculated hash due to the
> >>>>>> restrictive nature of eBPF steering program.
> >>>>>>
> >>>>>> Introduce the code to perform RSS to the kernel in order to overcome
> >>>>>> thse challenges. An alternative solution is to extend the eBPF steering
> >>>>>> program so that it will be able to report to the userspace, but I 
> >>>>>> didn't
> >>>>>> opt for it because extending the current mechanism of eBPF steering
> >>>>>> program as is because it relies on legacy context rewriting, and
> >>>>>> introducing kfunc-based eBPF will result in non-UAPI dependency while
> >>>>>> the other relevant virtualization APIs such as KVM and vhost_net are
> >>>>>> UAPIs.
> >>>>>>
> >>>>>> Signed-off-by: Akihiko Odaki 
> >>>>>> Tested-by: Lei Yang 
> >>>>>> ---
> >>>>>> Documentation/networking/tuntap.rst |   7 ++
> >>>>>> drivers/net/Kconfig |   1 +
> >>>>>> drivers/net/tap.c   |  68 ++-
> >>>>>> drivers/net/tun.c   |  98 +-
> >>>>>> drivers/net/tun_vnet.h  | 159 
> >>>>>> ++--
> >>>>>> include/linux/if_tap.h  |   2 +
> >>>>>> include/linux/skbuff.h  |   3 +
> >>>>>> include/uapi/linux/if_tun.h |  75 +
> >>>>>> net/core/skbuff.c   |   4 +
> >>>>>> 9 files changed, 386 insertions(+), 31 deletions(-)
> >>>>>>
> >
> > [...]
> >
> >>>>> Let's has a consistent name for this and the uapi to be consistent
> >>>>> with TUNSETIFF/TUNGETIFF. Probably TUNSETVNETHASH and
> >>>>> tun_vnet_ioctl_gethash().
> >>>>
> >>>> They have different semantics so they should have different names.
> >>>> TUNGETIFF reports the value currently set while TUNGETVNETHASHCAP
> >>>> reports the value that can be set later.
> >>>
> >>> I'm not sure I will get here. I meant a symmetric name
> >>>
> >>> TUNSETVNETHASH and TUNVETVNETHASH.
> >>
> >> TUNGETVNETHASHCAP does not correspond to TUNGETIFF. The correspondence
> >> of ioctl names is as follows:
> >> TUNGETFEATURES - TUNGETVNETHASHCAP
> >
> > TUNGETFEATURES returns the value set from TUNSETIFF. This differs from
> > TUNGETVNETHASHCAP semantic which just return the capabilities.
> >
> > +static inline long tun_vnet_ioctl_gethashcap(void __user *argp)
> > +{
> > +   static const struct tun_vnet_hash cap = {
> > +   .flags = TUN_VNET_HASH_REPORT | TUN_VNET_HASH_RSS,
> > +   .types = VIRTIO_NET_SUPPORTED_HASH_TYPES
> > +   };
> > +
> > +   return copy_to_user(argp, &cap, sizeof(cap)) ? -EFAULT : 0;
> > +}
> >
> > TUNGETFEATURES doesn't' help too much for non-persist TAP as userspace
> > knows what value it set before.
> >
> >> TUNSETIFF - TUNSETVNETHASH
> >> TUNGETIFF - no corresponding ioctl for the virtio-net hash features
> >
> > And this sounds odd and a hint for a incomplete uAPI as userspace
> > needs to know knowing what can set before doing TUNSETVNETHASH.
>
> You are confused with TUNGETFEATURES and TUNGETIFF. Below is the code
> that implements TUNGETFEATURES:
> if (cmd == TUNGETFEATURES) {
> /* Currently this just means: "what IFF flags are valid?".
>  * This is needed because we never checked for invalid flags on
>  * TUNSETIFF.
>  */
> return put_user(IFF_TUN | IFF_TAP | IFF_NO_CARRIER |
> TUN_FEATURES, (unsigned int __user*)argp);
> } else if (cmd == TUNSETQUEUE) {

Right.

Thanks

>
> Regards,
> Akihiko Odaki
>
> >
> >>
> >> Regards,
> >> Akihiko Odaki
> >>
> >
> > Thanks
> >
>




Re: [PATCH net-next v9 6/6] vhost/net: Support VIRTIO_NET_F_HASH_REPORT

2025-03-16 Thread Jason Wang
On Wed, Mar 12, 2025 at 1:59 PM Akihiko Odaki  wrote:
>
> On 2025/03/12 12:36, Jason Wang wrote:
> > On Tue, Mar 11, 2025 at 2:24 PM Akihiko Odaki  
> > wrote:
> >>
> >> On 2025/03/11 9:42, Jason Wang wrote:
> >>> On Mon, Mar 10, 2025 at 3:04 PM Akihiko Odaki  
> >>> wrote:
> >>>>
> >>>> On 2025/03/10 13:43, Jason Wang wrote:
> >>>>> On Fri, Mar 7, 2025 at 7:02 PM Akihiko Odaki  
> >>>>> wrote:
> >>>>>>
> >>>>>> VIRTIO_NET_F_HASH_REPORT allows to report hash values calculated on the
> >>>>>> host. When VHOST_NET_F_VIRTIO_NET_HDR is employed, it will report no
> >>>>>> hash values (i.e., the hash_report member is always set to
> >>>>>> VIRTIO_NET_HASH_REPORT_NONE). Otherwise, the values reported by the
> >>>>>> underlying socket will be reported.
> >>>>>>
> >>>>>> VIRTIO_NET_F_HASH_REPORT requires VIRTIO_F_VERSION_1.
> >>>>>>
> >>>>>> Signed-off-by: Akihiko Odaki 
> >>>>>> Tested-by: Lei Yang 
> >>>>>> ---
> >>>>>> drivers/vhost/net.c | 49 
> >>>>>> +
> >>>>>> 1 file changed, 29 insertions(+), 20 deletions(-)
> >>>>>>
> >>>>>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> >>>>>> index 
> >>>>>> b9b9e9d40951856d881d77ac74331d914473cd56..16b241b44f89820a42c302f3586ea6bb5e0d4289
> >>>>>>  100644
> >>>>>> --- a/drivers/vhost/net.c
> >>>>>> +++ b/drivers/vhost/net.c
> >>>>>> @@ -73,6 +73,7 @@ enum {
> >>>>>>VHOST_NET_FEATURES = VHOST_FEATURES |
> >>>>>> (1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
> >>>>>> (1ULL << VIRTIO_NET_F_MRG_RXBUF) |
> >>>>>> +(1ULL << VIRTIO_NET_F_HASH_REPORT) |
> >>>>>> (1ULL << VIRTIO_F_ACCESS_PLATFORM) |
> >>>>>> (1ULL << VIRTIO_F_RING_RESET)
> >>>>>> };
> >>>>>> @@ -1097,9 +1098,11 @@ static void handle_rx(struct vhost_net *net)
> >>>>>>.msg_controllen = 0,
> >>>>>>.msg_flags = MSG_DONTWAIT,
> >>>>>>};
> >>>>>> -   struct virtio_net_hdr hdr = {
> >>>>>> -   .flags = 0,
> >>>>>> -   .gso_type = VIRTIO_NET_HDR_GSO_NONE
> >>>>>> +   struct virtio_net_hdr_v1_hash hdr = {
> >>>>>> +   .hdr = {
> >>>>>> +   .flags = 0,
> >>>>>> +   .gso_type = VIRTIO_NET_HDR_GSO_NONE
> >>>>>> +   }
> >>>>>>};
> >>>>>>size_t total_len = 0;
> >>>>>>int err, mergeable;
> >>>>>> @@ -1110,7 +1113,6 @@ static void handle_rx(struct vhost_net *net)
> >>>>>>bool set_num_buffers;
> >>>>>>struct socket *sock;
> >>>>>>struct iov_iter fixup;
> >>>>>> -   __virtio16 num_buffers;
> >>>>>>int recv_pkts = 0;
> >>>>>>
> >>>>>>mutex_lock_nested(&vq->mutex, VHOST_NET_VQ_RX);
> >>>>>> @@ -1191,30 +1193,30 @@ static void handle_rx(struct vhost_net *net)
> >>>>>>vhost_discard_vq_desc(vq, headcount);
> >>>>>>continue;
> >>>>>>}
> >>>>>> +   hdr.hdr.num_buffers = cpu_to_vhost16(vq, headcount);
> >>>>>>/* Supply virtio_net_hdr if 
> >>>>>> VHOST_NET_F_VIRTIO_NET_HDR */
> >>>>>>if (unlikely(vhost_hlen)) {
> >>>>>> -   if (copy_to_iter(&hdr, sizeof(hdr),
> >>>>>> -&fixup) != sizeof(hdr)) {
> >>>>>> +  

Re: [PATCH net-next v9 1/6] virtio_net: Add functions for hashing

2025-03-16 Thread Jason Wang
On Tue, Mar 11, 2025 at 1:49 PM Akihiko Odaki  wrote:
>
> On 2025/03/11 9:47, Jason Wang wrote:
> > On Mon, Mar 10, 2025 at 2:53 PM Akihiko Odaki  
> > wrote:
> >>
> >> On 2025/03/10 12:55, Jason Wang wrote:
> >>> On Fri, Mar 7, 2025 at 7:01 PM Akihiko Odaki  
> >>> wrote:
> >>>>
> >>>> They are useful to implement VIRTIO_NET_F_RSS and
> >>>> VIRTIO_NET_F_HASH_REPORT.
> >>>>
> >>>> Signed-off-by: Akihiko Odaki 
> >>>> Tested-by: Lei Yang 
> >>>> ---
> >>>>include/linux/virtio_net.h | 188 
> >>>> +
> >>>>1 file changed, 188 insertions(+)
> >>>>
> >>>> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
> >>>> index 
> >>>> 02a9f4dc594d02372a6c1850cd600eff9d000d8d..426f33b4b82440d61b2af9fdc4c0b0d4c571b2c5
> >>>>  100644
> >>>> --- a/include/linux/virtio_net.h
> >>>> +++ b/include/linux/virtio_net.h
> >>>> @@ -9,6 +9,194 @@
> >>>>#include 
> >>>>#include 
> >>>>
> >>>> +struct virtio_net_hash {
> >>>> +   u32 value;
> >>>> +   u16 report;
> >>>> +};
> >>>> +
> >>>> +struct virtio_net_toeplitz_state {
> >>>> +   u32 hash;
> >>>> +   const u32 *key;
> >>>> +};
> >>>> +
> >>>> +#define VIRTIO_NET_SUPPORTED_HASH_TYPES (VIRTIO_NET_RSS_HASH_TYPE_IPv4 
> >>>> | \
> >>>> +VIRTIO_NET_RSS_HASH_TYPE_TCPv4 
> >>>> | \
> >>>> +VIRTIO_NET_RSS_HASH_TYPE_UDPv4 
> >>>> | \
> >>>> +VIRTIO_NET_RSS_HASH_TYPE_IPv6 | 
> >>>> \
> >>>> +VIRTIO_NET_RSS_HASH_TYPE_TCPv6 
> >>>> | \
> >>>> +VIRTIO_NET_RSS_HASH_TYPE_UDPv6)
> >>>
> >>> Let's explain why
> >>>
> >>> #define VIRTIO_NET_HASH_REPORT_IPv6_EX 7
> >>> #define VIRTIO_NET_HASH_REPORT_TCPv6_EX8
> >>> #define VIRTIO_NET_HASH_REPORT_UDPv6_EX9
> >>>
> >>> are missed here.
> >>
> >> Because they require parsing IPv6 options and I'm not sure how many we
> >> need to parse. QEMU's eBPF program has a hard-coded limit of 30 options;
> >> it has some explanation for this limit, but it does not seem definitive
> >> either:
> >> https://gitlab.com/qemu-project/qemu/-/commit/f3fa412de28ae3cb31d38811d30a77e4e20456cc#6ec48fc8af2f802e92f5127425e845c4c213ff60_0_165
> >>
> >
> > How about the usersapce datapath RSS in Qemu? (We probably don't need
> > to align with eBPF RSS as it's just a reference implementation)
>
> The userspace datapath RSS has no limit.
>
> The reference implementation is the userspace datapath. The eBPF program
>   is intended to bring real performance benefit to Windows guests in
> contrary.
>
> The userspace implementation does its best to provide defined RSS
> capabilities but may not be performant. Parsing all IPv6 options have a
> performance implication, but it is fine because it is not intended to be
> performant in the first place.
>
> The performance problem is inherent to the userspace implementation,
> which adds an extra overhead to the datapath. The eBPF program on the
> other hand does not incur such overhead because it replaces the existing
> steering algorithm (automq) instead of adding another layer. Hence the
> eBPF program can be practical.
>
> That said, it is not that important to align with the userspace and eBPF
> RSS in QEMU because they are still experimental anyway; the eBPF RSS has
> potential to become a practical implementation but it is still in
> development. The libvirt integration for the eBPF RSS is still not
> complete, and we occasionally add fixes for RSS and hash reporting
> without backporting to the stable branch.
>
> I'm adding interfaces to negotiate hash types rather for the future
> extensibility. The specification may gain more hash types in the future
> and other vhost backends may have a different set of hash types
> supported. Figuring out how to deal with different sets of supported
> hash typs is essential for both the kernel a

Re: [PATCH net-next v10 03/10] tun: Allow steering eBPF program to fall back

2025-03-16 Thread Jason Wang
On Thu, Mar 13, 2025 at 3:01 PM Akihiko Odaki  wrote:
>
> This clarifies a steering eBPF program takes precedence over the other
> steering algorithms.
>
> Signed-off-by: Akihiko Odaki 
> ---

I think we should elaborate more on the advantages of *not* making
this implicit.

Or what's the advantages of having RSS has a priority than eBPF.

Thanks




Re: [PATCH net-next v9 3/6] tun: Introduce virtio-net hash feature

2025-03-17 Thread Jason Wang
On Mon, Mar 17, 2025 at 3:07 PM Akihiko Odaki  wrote:
>
> On 2025/03/17 10:12, Jason Wang wrote:
> > On Wed, Mar 12, 2025 at 1:03 PM Akihiko Odaki  
> > wrote:
> >>
> >> On 2025/03/12 11:35, Jason Wang wrote:
> >>> On Tue, Mar 11, 2025 at 2:11 PM Akihiko Odaki  
> >>> wrote:
> >>>>
> >>>> On 2025/03/11 9:38, Jason Wang wrote:
> >>>>> On Mon, Mar 10, 2025 at 3:45 PM Akihiko Odaki 
> >>>>>  wrote:
> >>>>>>
> >>>>>> On 2025/03/10 12:55, Jason Wang wrote:
> >>>>>>> On Fri, Mar 7, 2025 at 7:01 PM Akihiko Odaki 
> >>>>>>>  wrote:
> >>>>>>>>
> >>>>>>>> Hash reporting
> >>>>>>>> ==
> >>>>>>>>
> >>>>>>>> Allow the guest to reuse the hash value to make receive steering
> >>>>>>>> consistent between the host and guest, and to save hash computation.
> >>>>>>>>
> >>>>>>>> RSS
> >>>>>>>> ===
> >>>>>>>>
> >>>>>>>> RSS is a receive steering algorithm that can be negotiated to use 
> >>>>>>>> with
> >>>>>>>> virtio_net. Conventionally the hash calculation was done by the VMM.
> >>>>>>>> However, computing the hash after the queue was chosen defeats the
> >>>>>>>> purpose of RSS.
> >>>>>>>>
> >>>>>>>> Another approach is to use eBPF steering program. This approach has
> >>>>>>>> another downside: it cannot report the calculated hash due to the
> >>>>>>>> restrictive nature of eBPF steering program.
> >>>>>>>>
> >>>>>>>> Introduce the code to perform RSS to the kernel in order to overcome
> >>>>>>>> thse challenges. An alternative solution is to extend the eBPF 
> >>>>>>>> steering
> >>>>>>>> program so that it will be able to report to the userspace, but I 
> >>>>>>>> didn't
> >>>>>>>> opt for it because extending the current mechanism of eBPF steering
> >>>>>>>> program as is because it relies on legacy context rewriting, and
> >>>>>>>> introducing kfunc-based eBPF will result in non-UAPI dependency while
> >>>>>>>> the other relevant virtualization APIs such as KVM and vhost_net are
> >>>>>>>> UAPIs.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Akihiko Odaki 
> >>>>>>>> Tested-by: Lei Yang 
> >>>>>>>> ---
> >>>>>>>>  Documentation/networking/tuntap.rst |   7 ++
> >>>>>>>>  drivers/net/Kconfig |   1 +
> >>>>>>>>  drivers/net/tap.c   |  68 ++-
> >>>>>>>>  drivers/net/tun.c   |  98 +-
> >>>>>>>>  drivers/net/tun_vnet.h  | 159 
> >>>>>>>> ++--
> >>>>>>>>  include/linux/if_tap.h  |   2 +
> >>>>>>>>  include/linux/skbuff.h  |   3 +
> >>>>>>>>  include/uapi/linux/if_tun.h |  75 +
> >>>>>>>>  net/core/skbuff.c   |   4 +
> >>>>>>>>  9 files changed, 386 insertions(+), 31 deletions(-)
> >>>>>>>>
> >>>>>>>> diff --git a/Documentation/networking/tuntap.rst 
> >>>>>>>> b/Documentation/networking/tuntap.rst
> >>>>>>>> index 
> >>>>>>>> 4d7087f727be5e37dfbf5066a9e9c872cc98898d..86b4ae8caa8ad062c1e558920be42ce0d4217465
> >>>>>>>>  100644
> >>>>>>>> --- a/Documentation/networking/tuntap.rst
> >>>>>>>> +++ b/Documentation/networking/tuntap.rst
> >>>>>>>> @@ -206,6 +206,13 @@ enable is true we enable it, otherwise we 
> >>>>>>>> disable it::
> >>>>>>>>return ioctl(fd, TUNSETQUEUE, (void *)&ifr)

Re: [PATCH net-next v11 00/10] tun: Introduce virtio-net hashing feature

2025-03-17 Thread Jason Wang
On Mon, Mar 17, 2025 at 6:58 PM Akihiko Odaki  wrote:
>
> virtio-net have two usage of hashes: one is RSS and another is hash
> reporting. Conventionally the hash calculation was done by the VMM.
> However, computing the hash after the queue was chosen defeats the
> purpose of RSS.
>
> Another approach is to use eBPF steering program. This approach has
> another downside: it cannot report the calculated hash due to the
> restrictive nature of eBPF.
>
> Introduce the code to compute hashes to the kernel in order to overcome
> thse challenges.
>
> An alternative solution is to extend the eBPF steering program so that it
> will be able to report to the userspace, but it is based on context
> rewrites, which is in feature freeze. We can adopt kfuncs, but they will
> not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM
> and vhost_net).
>
> The patches for QEMU to use this new feature was submitted as RFC and
> is available at:
> https://patchew.org/QEMU/20250313-hash-v4-0-c75c494b4...@daynix.com/
>
> This work was presented at LPC 2024:
> https://lpc.events/event/18/contributions/1963/
>
> V1 -> V2:
>   Changed to introduce a new BPF program type.
>
> Signed-off-by: Akihiko Odaki 
> ---
> Changes in v11:
> - Added the missing code to free vnet_hash in patch
>   "tap: Introduce virtio-net hash feature".
> - Link to v10: 
> https://lore.kernel.org/r/20250313-rss-v10-0-3185d73a9...@daynix.com
>

We only have 2 or 3 points that need to be sorted out. Let's hold on
to the iteration until we had an agreement.

Thanks




Re: [PATCH net-next v9 3/6] tun: Introduce virtio-net hash feature

2025-03-18 Thread Jason Wang
On Tue, Mar 18, 2025 at 6:10 PM Akihiko Odaki  wrote:
>
> On 2025/03/18 9:15, Jason Wang wrote:
> > On Mon, Mar 17, 2025 at 3:07 PM Akihiko Odaki  
> > wrote:
> >>
> >> On 2025/03/17 10:12, Jason Wang wrote:
> >>> On Wed, Mar 12, 2025 at 1:03 PM Akihiko Odaki  
> >>> wrote:
> >>>>
> >>>> On 2025/03/12 11:35, Jason Wang wrote:
> >>>>> On Tue, Mar 11, 2025 at 2:11 PM Akihiko Odaki 
> >>>>>  wrote:
> >>>>>>
> >>>>>> On 2025/03/11 9:38, Jason Wang wrote:
> >>>>>>> On Mon, Mar 10, 2025 at 3:45 PM Akihiko Odaki 
> >>>>>>>  wrote:
> >>>>>>>>
> >>>>>>>> On 2025/03/10 12:55, Jason Wang wrote:
> >>>>>>>>> On Fri, Mar 7, 2025 at 7:01 PM Akihiko Odaki 
> >>>>>>>>>  wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hash reporting
> >>>>>>>>>> ==
> >>>>>>>>>>
> >>>>>>>>>> Allow the guest to reuse the hash value to make receive steering
> >>>>>>>>>> consistent between the host and guest, and to save hash 
> >>>>>>>>>> computation.
> >>>>>>>>>>
> >>>>>>>>>> RSS
> >>>>>>>>>> ===
> >>>>>>>>>>
> >>>>>>>>>> RSS is a receive steering algorithm that can be negotiated to use 
> >>>>>>>>>> with
> >>>>>>>>>> virtio_net. Conventionally the hash calculation was done by the 
> >>>>>>>>>> VMM.
> >>>>>>>>>> However, computing the hash after the queue was chosen defeats the
> >>>>>>>>>> purpose of RSS.
> >>>>>>>>>>
> >>>>>>>>>> Another approach is to use eBPF steering program. This approach has
> >>>>>>>>>> another downside: it cannot report the calculated hash due to the
> >>>>>>>>>> restrictive nature of eBPF steering program.
> >>>>>>>>>>
> >>>>>>>>>> Introduce the code to perform RSS to the kernel in order to 
> >>>>>>>>>> overcome
> >>>>>>>>>> thse challenges. An alternative solution is to extend the eBPF 
> >>>>>>>>>> steering
> >>>>>>>>>> program so that it will be able to report to the userspace, but I 
> >>>>>>>>>> didn't
> >>>>>>>>>> opt for it because extending the current mechanism of eBPF steering
> >>>>>>>>>> program as is because it relies on legacy context rewriting, and
> >>>>>>>>>> introducing kfunc-based eBPF will result in non-UAPI dependency 
> >>>>>>>>>> while
> >>>>>>>>>> the other relevant virtualization APIs such as KVM and vhost_net 
> >>>>>>>>>> are
> >>>>>>>>>> UAPIs.
> >>>>>>>>>>
> >>>>>>>>>> Signed-off-by: Akihiko Odaki 
> >>>>>>>>>> Tested-by: Lei Yang 
> >>>>>>>>>> ---
> >>>>>>>>>>   Documentation/networking/tuntap.rst |   7 ++
> >>>>>>>>>>   drivers/net/Kconfig |   1 +
> >>>>>>>>>>   drivers/net/tap.c   |  68 ++-
> >>>>>>>>>>   drivers/net/tun.c   |  98 
> >>>>>>>>>> +-
> >>>>>>>>>>   drivers/net/tun_vnet.h  | 159 
> >>>>>>>>>> ++--
> >>>>>>>>>>   include/linux/if_tap.h  |   2 +
> >>>>>>>>>>   include/linux/skbuff.h  |   3 +
> >>>>>>>>>>   include/uapi/linux/if_tun.h |  75 +
> >>>>>>>>>>   net/core/skbuff.c   |   4 +
> >>>>&g

Re: [PATCH net-next v7 3/6] tun: Introduce virtio-net hash feature

2025-03-03 Thread Jason Wang
On Fri, Feb 28, 2025 at 3:59 PM Akihiko Odaki  wrote:
>
> Hash reporting
> --
>
> Allow the guest to reuse the hash value to make receive steering
> consistent between the host and guest, and to save hash computation.
>
> RSS
> ---
>
> RSS is a receive steering algorithm that can be negotiated to use with
> virtio_net. Conventionally the hash calculation was done by the VMM.
> However, computing the hash after the queue was chosen defeats the
> purpose of RSS.
>
> Another approach is to use eBPF steering program. This approach has
> another downside: it cannot report the calculated hash due to the
> restrictive nature of eBPF steering program.
>
> Introduce the code to perform RSS to the kernel in order to overcome
> thse challenges. An alternative solution is to extend the eBPF steering
> program so that it will be able to report to the userspace, but I didn't
> opt for it because extending the current mechanism of eBPF steering
> program as is because it relies on legacy context rewriting, and
> introducing kfunc-based eBPF will result in non-UAPI dependency while
> the other relevant virtualization APIs such as KVM and vhost_net are
> UAPIs.
>
> Signed-off-by: Akihiko Odaki 
> ---
>  Documentation/networking/tuntap.rst |   7 ++
>  drivers/net/Kconfig |   1 +
>  drivers/net/tap.c   |  62 -
>  drivers/net/tun.c   |  89 ++
>  drivers/net/tun_vnet.h  | 180 
> +---
>  include/linux/if_tap.h  |   2 +
>  include/linux/skbuff.h  |   3 +
>  include/uapi/linux/if_tun.h |  75 +++
>  net/core/skbuff.c   |   4 +
>  9 files changed, 390 insertions(+), 33 deletions(-)
>
> diff --git a/Documentation/networking/tuntap.rst 
> b/Documentation/networking/tuntap.rst
> index 
> 4d7087f727be5e37dfbf5066a9e9c872cc98898d..86b4ae8caa8ad062c1e558920be42ce0d4217465
>  100644
> --- a/Documentation/networking/tuntap.rst
> +++ b/Documentation/networking/tuntap.rst
> @@ -206,6 +206,13 @@ enable is true we enable it, otherwise we disable it::
>return ioctl(fd, TUNSETQUEUE, (void *)&ifr);
>}
>
> +3.4 Reference
> +-
> +
> +``linux/if_tun.h`` defines the interface described below:
> +
> +.. kernel-doc:: include/uapi/linux/if_tun.h
> +
>  Universal TUN/TAP device driver Frequently Asked Question
>  =
>
> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
> index 
> 1fd5acdc73c6af0e1a861867039c3624fc618e25..aecfd244dd83585fea2c5b815dcd787c58166c28
>  100644
> --- a/drivers/net/Kconfig
> +++ b/drivers/net/Kconfig
> @@ -395,6 +395,7 @@ config TUN
> tristate "Universal TUN/TAP device driver support"
> depends on INET
> select CRC32
> +   select SKB_EXTENSIONS
> help
>   TUN/TAP provides packet reception and transmission for user space
>   programs.  It can be viewed as a simple Point-to-Point or Ethernet
> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
> index 
> d4ece538f1b23789ca60caa6232690e4d0a4d14a..06c6df153a30a4acbcaff8619b0b4229eb638664
>  100644
> --- a/drivers/net/tap.c
> +++ b/drivers/net/tap.c
> @@ -49,6 +49,10 @@ struct major_info {
> struct list_head next;
>  };
>
> +struct tap_skb_cb {
> +   struct virtio_net_hash hash;
> +};
> +
>  #define GOODCOPY_LEN 128
>
>  static const struct proto_ops tap_socket_ops;
> @@ -179,6 +183,22 @@ static void tap_put_queue(struct tap_queue *q)
> sock_put(&q->sk);
>  }
>
> +static struct tap_skb_cb *tap_skb_cb(const struct sk_buff *skb)
> +{
> +   BUILD_BUG_ON(sizeof(skb->cb) < sizeof(struct tap_skb_cb));
> +   return (struct tap_skb_cb *)skb->cb;
> +}
> +
> +static struct virtio_net_hash *tap_add_hash(struct sk_buff *skb)
> +{
> +   return &tap_skb_cb(skb)->hash;
> +}
> +
> +static const struct virtio_net_hash *tap_find_hash(const struct sk_buff *skb)
> +{
> +   return &tap_skb_cb(skb)->hash;
> +}
> +
>  /*
>   * Select a queue based on the rxq of the device on which this packet
>   * arrived. If the incoming device is not mq, calculate a flow hash
> @@ -189,6 +209,7 @@ static void tap_put_queue(struct tap_queue *q)
>  static struct tap_queue *tap_get_queue(struct tap_dev *tap,
>struct sk_buff *skb)
>  {
> +   struct flow_keys_basic keys_basic;
> struct tap_queue *queue = NULL;
> /* Access to taps array is protected by rcu, but access to numvtaps
>  * isn't. Below we use it to lookup a queue, but treat it as a hint
> @@ -196,17 +217,43 @@ static struct tap_queue *tap_get_queue(struct tap_dev 
> *tap,
>  * racing against queue removal.
>  */
> int numvtaps = READ_ONCE(tap->numvtaps);
> +   struct tun_vnet_hash_container *vnet_hash = 
> rcu_dereference(tap->vnet_hash);
> __u32 rxq;
>
> +   *tap_skb_cb(skb) = (struct tap_skb_cb) {
>

Re: [PATCH net-next v9 6/6] vhost/net: Support VIRTIO_NET_F_HASH_REPORT

2025-03-10 Thread Jason Wang
On Mon, Mar 10, 2025 at 3:04 PM Akihiko Odaki  wrote:
>
> On 2025/03/10 13:43, Jason Wang wrote:
> > On Fri, Mar 7, 2025 at 7:02 PM Akihiko Odaki  
> > wrote:
> >>
> >> VIRTIO_NET_F_HASH_REPORT allows to report hash values calculated on the
> >> host. When VHOST_NET_F_VIRTIO_NET_HDR is employed, it will report no
> >> hash values (i.e., the hash_report member is always set to
> >> VIRTIO_NET_HASH_REPORT_NONE). Otherwise, the values reported by the
> >> underlying socket will be reported.
> >>
> >> VIRTIO_NET_F_HASH_REPORT requires VIRTIO_F_VERSION_1.
> >>
> >> Signed-off-by: Akihiko Odaki 
> >> Tested-by: Lei Yang 
> >> ---
> >>   drivers/vhost/net.c | 49 
> >> +
> >>   1 file changed, 29 insertions(+), 20 deletions(-)
> >>
> >> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> >> index 
> >> b9b9e9d40951856d881d77ac74331d914473cd56..16b241b44f89820a42c302f3586ea6bb5e0d4289
> >>  100644
> >> --- a/drivers/vhost/net.c
> >> +++ b/drivers/vhost/net.c
> >> @@ -73,6 +73,7 @@ enum {
> >>  VHOST_NET_FEATURES = VHOST_FEATURES |
> >>   (1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
> >>   (1ULL << VIRTIO_NET_F_MRG_RXBUF) |
> >> +(1ULL << VIRTIO_NET_F_HASH_REPORT) |
> >>   (1ULL << VIRTIO_F_ACCESS_PLATFORM) |
> >>   (1ULL << VIRTIO_F_RING_RESET)
> >>   };
> >> @@ -1097,9 +1098,11 @@ static void handle_rx(struct vhost_net *net)
> >>  .msg_controllen = 0,
> >>  .msg_flags = MSG_DONTWAIT,
> >>  };
> >> -   struct virtio_net_hdr hdr = {
> >> -   .flags = 0,
> >> -   .gso_type = VIRTIO_NET_HDR_GSO_NONE
> >> +   struct virtio_net_hdr_v1_hash hdr = {
> >> +   .hdr = {
> >> +   .flags = 0,
> >> +   .gso_type = VIRTIO_NET_HDR_GSO_NONE
> >> +   }
> >>  };
> >>  size_t total_len = 0;
> >>  int err, mergeable;
> >> @@ -1110,7 +1113,6 @@ static void handle_rx(struct vhost_net *net)
> >>  bool set_num_buffers;
> >>  struct socket *sock;
> >>  struct iov_iter fixup;
> >> -   __virtio16 num_buffers;
> >>  int recv_pkts = 0;
> >>
> >>  mutex_lock_nested(&vq->mutex, VHOST_NET_VQ_RX);
> >> @@ -1191,30 +1193,30 @@ static void handle_rx(struct vhost_net *net)
> >>  vhost_discard_vq_desc(vq, headcount);
> >>  continue;
> >>  }
> >> +   hdr.hdr.num_buffers = cpu_to_vhost16(vq, headcount);
> >>  /* Supply virtio_net_hdr if VHOST_NET_F_VIRTIO_NET_HDR */
> >>  if (unlikely(vhost_hlen)) {
> >> -   if (copy_to_iter(&hdr, sizeof(hdr),
> >> -&fixup) != sizeof(hdr)) {
> >> +   if (copy_to_iter(&hdr, vhost_hlen,
> >> +&fixup) != vhost_hlen) {
> >>  vq_err(vq, "Unable to write vnet_hdr "
> >> "at addr %p\n", vq->iov->iov_base);
> >>  goto out;
> >
> > Is this an "issue" specific to RSS/HASH? If it's not, we need a separate 
> > patch.
> >
> > Honestly, I'm not sure if it's too late to fix this.
>
> There is nothing wrong with the current implementation.

Note that I meant the vhost_hlen part, and the current code is tricky.

The comment said:

"""
/* Supply virtio_net_hdr if VHOST_NET_F_VIRTIO_NET_HDR */
"""

So it tries to only offer virtio_net_hdr even if vhost_hlen is the set
to mrg_rxbuf len.

And this patch changes this behaviour.

Thanks

> The current
> implementation fills the header with zero except num_buffers, which it
> fills some real value. This functionality is working fine with
> VIRTIO_NET_F_MRG_RXBUF and VIRTIO_F_VERSION_1, which change the header size.
>
> Now I'm adding VIRTIO_NET_F_HASH_REPORT and it adds the hash_report
> field, which also needs to be initialized with zero, so I'm making sure
> vhost_net will also initialize it.
>
> Regards,
> Akihiko Odaki
>
> >
> > Others look fine.
> >
> > Thanks
> >
>




Re: [PATCH net-next v9 3/6] tun: Introduce virtio-net hash feature

2025-03-10 Thread Jason Wang
On Mon, Mar 10, 2025 at 3:45 PM Akihiko Odaki  wrote:
>
> On 2025/03/10 12:55, Jason Wang wrote:
> > On Fri, Mar 7, 2025 at 7:01 PM Akihiko Odaki  
> > wrote:
> >>
> >> Hash reporting
> >> ==
> >>
> >> Allow the guest to reuse the hash value to make receive steering
> >> consistent between the host and guest, and to save hash computation.
> >>
> >> RSS
> >> ===
> >>
> >> RSS is a receive steering algorithm that can be negotiated to use with
> >> virtio_net. Conventionally the hash calculation was done by the VMM.
> >> However, computing the hash after the queue was chosen defeats the
> >> purpose of RSS.
> >>
> >> Another approach is to use eBPF steering program. This approach has
> >> another downside: it cannot report the calculated hash due to the
> >> restrictive nature of eBPF steering program.
> >>
> >> Introduce the code to perform RSS to the kernel in order to overcome
> >> thse challenges. An alternative solution is to extend the eBPF steering
> >> program so that it will be able to report to the userspace, but I didn't
> >> opt for it because extending the current mechanism of eBPF steering
> >> program as is because it relies on legacy context rewriting, and
> >> introducing kfunc-based eBPF will result in non-UAPI dependency while
> >> the other relevant virtualization APIs such as KVM and vhost_net are
> >> UAPIs.
> >>
> >> Signed-off-by: Akihiko Odaki 
> >> Tested-by: Lei Yang 
> >> ---
> >>   Documentation/networking/tuntap.rst |   7 ++
> >>   drivers/net/Kconfig |   1 +
> >>   drivers/net/tap.c   |  68 ++-
> >>   drivers/net/tun.c   |  98 +-
> >>   drivers/net/tun_vnet.h  | 159 
> >> ++--
> >>   include/linux/if_tap.h  |   2 +
> >>   include/linux/skbuff.h  |   3 +
> >>   include/uapi/linux/if_tun.h |  75 +
> >>   net/core/skbuff.c   |   4 +
> >>   9 files changed, 386 insertions(+), 31 deletions(-)
> >>
> >> diff --git a/Documentation/networking/tuntap.rst 
> >> b/Documentation/networking/tuntap.rst
> >> index 
> >> 4d7087f727be5e37dfbf5066a9e9c872cc98898d..86b4ae8caa8ad062c1e558920be42ce0d4217465
> >>  100644
> >> --- a/Documentation/networking/tuntap.rst
> >> +++ b/Documentation/networking/tuntap.rst
> >> @@ -206,6 +206,13 @@ enable is true we enable it, otherwise we disable it::
> >> return ioctl(fd, TUNSETQUEUE, (void *)&ifr);
> >> }
> >>
> >> +3.4 Reference
> >> +-
> >> +
> >> +``linux/if_tun.h`` defines the interface described below:
> >> +
> >> +.. kernel-doc:: include/uapi/linux/if_tun.h
> >> +
> >>   Universal TUN/TAP device driver Frequently Asked Question
> >>   =
> >>
> >> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
> >> index 
> >> 1fd5acdc73c6af0e1a861867039c3624fc618e25..aecfd244dd83585fea2c5b815dcd787c58166c28
> >>  100644
> >> --- a/drivers/net/Kconfig
> >> +++ b/drivers/net/Kconfig
> >> @@ -395,6 +395,7 @@ config TUN
> >>  tristate "Universal TUN/TAP device driver support"
> >>  depends on INET
> >>  select CRC32
> >> +   select SKB_EXTENSIONS
> >>  help
> >>TUN/TAP provides packet reception and transmission for user 
> >> space
> >>programs.  It can be viewed as a simple Point-to-Point or 
> >> Ethernet
> >> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
> >> index 
> >> d4ece538f1b23789ca60caa6232690e4d0a4d14a..9428b63ec27e7f92e78a78afcb5e24383862c00d
> >>  100644
> >> --- a/drivers/net/tap.c
> >> +++ b/drivers/net/tap.c
> >> @@ -49,6 +49,10 @@ struct major_info {
> >>  struct list_head next;
> >>   };
> >>
> >> +struct tap_skb_cb {
> >> +   struct virtio_net_hash hash;
> >> +};
> >> +
> >>   #define GOODCOPY_LEN 128
> >>
> >>   static const struct proto_ops tap_socket_ops;
> >> @@ -179,6 +183,22 @@ static void tap_put_queue(struct tap_queue *q)
> >>  sock_put(&q->sk);
> >&g

Re: [PATCH net-next v9 1/6] virtio_net: Add functions for hashing

2025-03-10 Thread Jason Wang
On Mon, Mar 10, 2025 at 2:53 PM Akihiko Odaki  wrote:
>
> On 2025/03/10 12:55, Jason Wang wrote:
> > On Fri, Mar 7, 2025 at 7:01 PM Akihiko Odaki  
> > wrote:
> >>
> >> They are useful to implement VIRTIO_NET_F_RSS and
> >> VIRTIO_NET_F_HASH_REPORT.
> >>
> >> Signed-off-by: Akihiko Odaki 
> >> Tested-by: Lei Yang 
> >> ---
> >>   include/linux/virtio_net.h | 188 
> >> +
> >>   1 file changed, 188 insertions(+)
> >>
> >> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
> >> index 
> >> 02a9f4dc594d02372a6c1850cd600eff9d000d8d..426f33b4b82440d61b2af9fdc4c0b0d4c571b2c5
> >>  100644
> >> --- a/include/linux/virtio_net.h
> >> +++ b/include/linux/virtio_net.h
> >> @@ -9,6 +9,194 @@
> >>   #include 
> >>   #include 
> >>
> >> +struct virtio_net_hash {
> >> +   u32 value;
> >> +   u16 report;
> >> +};
> >> +
> >> +struct virtio_net_toeplitz_state {
> >> +   u32 hash;
> >> +   const u32 *key;
> >> +};
> >> +
> >> +#define VIRTIO_NET_SUPPORTED_HASH_TYPES (VIRTIO_NET_RSS_HASH_TYPE_IPv4 | \
> >> +VIRTIO_NET_RSS_HASH_TYPE_TCPv4 | \
> >> +VIRTIO_NET_RSS_HASH_TYPE_UDPv4 | \
> >> +VIRTIO_NET_RSS_HASH_TYPE_IPv6 | \
> >> +VIRTIO_NET_RSS_HASH_TYPE_TCPv6 | \
> >> +VIRTIO_NET_RSS_HASH_TYPE_UDPv6)
> >
> > Let's explain why
> >
> > #define VIRTIO_NET_HASH_REPORT_IPv6_EX 7
> > #define VIRTIO_NET_HASH_REPORT_TCPv6_EX8
> > #define VIRTIO_NET_HASH_REPORT_UDPv6_EX9
> >
> > are missed here.
>
> Because they require parsing IPv6 options and I'm not sure how many we
> need to parse. QEMU's eBPF program has a hard-coded limit of 30 options;
> it has some explanation for this limit, but it does not seem definitive
> either:
> https://gitlab.com/qemu-project/qemu/-/commit/f3fa412de28ae3cb31d38811d30a77e4e20456cc#6ec48fc8af2f802e92f5127425e845c4c213ff60_0_165
>

How about the usersapce datapath RSS in Qemu? (We probably don't need
to align with eBPF RSS as it's just a reference implementation)

> In this patch series, I add an ioctl to query capability instead; it
> allows me leaving those hash types unimplemented and is crucial to
> assure extensibility for future additions of hash types anyway. Anyone
> who find these hash types useful can implement in the future.

Yes, but we need to make sure no userspace visible behaviour changes
after migration.

>
> >
> > And explain how we could maintain migration compatibility
> >
> > 1) Does those three work for userspace datapath in Qemu? If yes,
> > migration will be broken.
>
> They work for userspace datapath so my RFC patch series for QEMU uses
> TUNGETVNETHASHCAP to prevent breaking migration:
> https://patchew.org/QEMU/20240915-hash-v3-0-79cb08d28...@daynix.com/
>

Ok, let's mention this in the cover letter. Another interesting thing
is the migration from 10.0 to 9.0.

> This patch series first adds configuration options for users to choose
> hash types. QEMU then automatically picks one implementation from the
> following (the earlier one is the more preferred):
> 1) The hash capability of vhost hardware
> 2) The hash capability I'm proposing here
> 3) The eBPF program
> 4) The pure userspace implementation
>
> This decision depends on the following:
> - The required hash types; supported ones are queried for 1) and 2)
> - Whether vhost is enabled or not and what vhost backend is used
> - Whether hash reporting is enabled; 3) is incompatible with this
>
> The network device will not be realized if no implementation satisfies
> the requirements.

This makes sense, let's add this in the cover letter.

>
> > 2) once we support those three in the future. For example, is the qemu
> > expected to probe this via TUNGETVNETHASHCAP in the destination and
> > fail the migration?
>
> QEMU is expected to use TUNGETVNETHASHCAP, but it can selectively enable
> hash types with TUNSETVNETHASH to keep migration working.
>
> In summary, this patch series provides a sufficient facility for the
> userspace to make extensibility and migration compatible;
> TUNGETVNETHASHCAP exposes all of the kernel capabilities and
> TUNSETVNETHASH allows the userspace to limit them.
>
> Regards,
> Akihiko Odaki

Fine.

Re: [PATCH net-next v9 3/6] tun: Introduce virtio-net hash feature

2025-03-10 Thread Jason Wang
On Mon, Mar 10, 2025 at 3:59 PM Akihiko Odaki  wrote:
>
> On 2025/03/10 12:55, Jason Wang wrote:
> > On Fri, Mar 7, 2025 at 7:01 PM Akihiko Odaki  
> > wrote:
> >>
> >> Hash reporting
> >> ==
> >>
> >> Allow the guest to reuse the hash value to make receive steering
> >> consistent between the host and guest, and to save hash computation.
> >>
> >> RSS
> >> ===
> >>
> >> RSS is a receive steering algorithm that can be negotiated to use with
> >> virtio_net. Conventionally the hash calculation was done by the VMM.
> >> However, computing the hash after the queue was chosen defeats the
> >> purpose of RSS.
> >>
> >> Another approach is to use eBPF steering program. This approach has
> >> another downside: it cannot report the calculated hash due to the
> >> restrictive nature of eBPF steering program.
> >>
> >> Introduce the code to perform RSS to the kernel in order to overcome
> >> thse challenges. An alternative solution is to extend the eBPF steering
> >> program so that it will be able to report to the userspace, but I didn't
> >> opt for it because extending the current mechanism of eBPF steering
> >> program as is because it relies on legacy context rewriting, and
> >> introducing kfunc-based eBPF will result in non-UAPI dependency while
> >> the other relevant virtualization APIs such as KVM and vhost_net are
> >> UAPIs.
> >>
> >> Signed-off-by: Akihiko Odaki 
> >> Tested-by: Lei Yang 
> >> ---
> >>   Documentation/networking/tuntap.rst |   7 ++
> >>   drivers/net/Kconfig |   1 +
> >>   drivers/net/tap.c   |  68 ++-
> >>   drivers/net/tun.c   |  98 +-
> >>   drivers/net/tun_vnet.h  | 159 
> >> ++--
> >>   include/linux/if_tap.h  |   2 +
> >>   include/linux/skbuff.h  |   3 +
> >>   include/uapi/linux/if_tun.h |  75 +
> >>   net/core/skbuff.c   |   4 +
> >>   9 files changed, 386 insertions(+), 31 deletions(-)
> >>
> >> diff --git a/Documentation/networking/tuntap.rst 
> >> b/Documentation/networking/tuntap.rst
> >> index 
> >> 4d7087f727be5e37dfbf5066a9e9c872cc98898d..86b4ae8caa8ad062c1e558920be42ce0d4217465
> >>  100644
> >> --- a/Documentation/networking/tuntap.rst
> >> +++ b/Documentation/networking/tuntap.rst
> >> @@ -206,6 +206,13 @@ enable is true we enable it, otherwise we disable it::
> >> return ioctl(fd, TUNSETQUEUE, (void *)&ifr);
> >> }
> >>

[...]

> >>
> >> diff --git a/drivers/net/tun_vnet.h b/drivers/net/tun_vnet.h
> >> index 
> >> 58b9ac7a5fc4084c789fe94fe36b5f8631bf1fa4..8e7d51fb0b4742cef56e7c5ad778b156cc654bed
> >>  100644
> >> --- a/drivers/net/tun_vnet.h
> >> +++ b/drivers/net/tun_vnet.h
> >> @@ -6,6 +6,16 @@
> >>   #define TUN_VNET_LE 0x8000
> >>   #define TUN_VNET_BE 0x4000
> >>
> >> +typedef struct virtio_net_hash *(*tun_vnet_hash_add)(struct sk_buff *);
> >> +typedef const struct virtio_net_hash *(*tun_vnet_hash_find)(const struct 
> >> sk_buff *);
> >> +
> >> +struct tun_vnet_hash_container {
> >> +   struct tun_vnet_hash common;
> >
> > I'd rename this as hash.
>
> Everything in this structure is about hash. "common" represents its
> feature well.
>
> I see a few alternative options though I don't prefer them either; they
> make the code verbose and I don't think they are worthwhile:
> 1. Rename tun_vnet_hash to tun_vnet_hash_common.
> 2. Prefix the other fields with "hash_" for consistency.

Or use different structures, one for hash_report another is for rss.

>
> >
> >> +   struct tun_vnet_hash_rss rss;
> >> +   u32 rss_key[VIRTIO_NET_RSS_MAX_KEY_SIZE];
> >> +   u16 rss_indirection_table[];
> >> +};
> >
> > Besides the separated ioctl, I'd split this structure into rss and
> > hash part as well.

Like this.

Thanks




Re: [PATCH net-next v9 6/6] vhost/net: Support VIRTIO_NET_F_HASH_REPORT

2025-03-09 Thread Jason Wang
On Fri, Mar 7, 2025 at 7:02 PM Akihiko Odaki  wrote:
>
> VIRTIO_NET_F_HASH_REPORT allows to report hash values calculated on the
> host. When VHOST_NET_F_VIRTIO_NET_HDR is employed, it will report no
> hash values (i.e., the hash_report member is always set to
> VIRTIO_NET_HASH_REPORT_NONE). Otherwise, the values reported by the
> underlying socket will be reported.
>
> VIRTIO_NET_F_HASH_REPORT requires VIRTIO_F_VERSION_1.
>
> Signed-off-by: Akihiko Odaki 
> Tested-by: Lei Yang 
> ---
>  drivers/vhost/net.c | 49 +
>  1 file changed, 29 insertions(+), 20 deletions(-)
>
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 
> b9b9e9d40951856d881d77ac74331d914473cd56..16b241b44f89820a42c302f3586ea6bb5e0d4289
>  100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -73,6 +73,7 @@ enum {
> VHOST_NET_FEATURES = VHOST_FEATURES |
>  (1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
>  (1ULL << VIRTIO_NET_F_MRG_RXBUF) |
> +(1ULL << VIRTIO_NET_F_HASH_REPORT) |
>  (1ULL << VIRTIO_F_ACCESS_PLATFORM) |
>  (1ULL << VIRTIO_F_RING_RESET)
>  };
> @@ -1097,9 +1098,11 @@ static void handle_rx(struct vhost_net *net)
> .msg_controllen = 0,
> .msg_flags = MSG_DONTWAIT,
> };
> -   struct virtio_net_hdr hdr = {
> -   .flags = 0,
> -   .gso_type = VIRTIO_NET_HDR_GSO_NONE
> +   struct virtio_net_hdr_v1_hash hdr = {
> +   .hdr = {
> +   .flags = 0,
> +   .gso_type = VIRTIO_NET_HDR_GSO_NONE
> +   }
> };
> size_t total_len = 0;
> int err, mergeable;
> @@ -1110,7 +1113,6 @@ static void handle_rx(struct vhost_net *net)
> bool set_num_buffers;
> struct socket *sock;
> struct iov_iter fixup;
> -   __virtio16 num_buffers;
> int recv_pkts = 0;
>
> mutex_lock_nested(&vq->mutex, VHOST_NET_VQ_RX);
> @@ -1191,30 +1193,30 @@ static void handle_rx(struct vhost_net *net)
> vhost_discard_vq_desc(vq, headcount);
> continue;
> }
> +   hdr.hdr.num_buffers = cpu_to_vhost16(vq, headcount);
> /* Supply virtio_net_hdr if VHOST_NET_F_VIRTIO_NET_HDR */
> if (unlikely(vhost_hlen)) {
> -   if (copy_to_iter(&hdr, sizeof(hdr),
> -&fixup) != sizeof(hdr)) {
> +   if (copy_to_iter(&hdr, vhost_hlen,
> +&fixup) != vhost_hlen) {
> vq_err(vq, "Unable to write vnet_hdr "
>"at addr %p\n", vq->iov->iov_base);
> goto out;

Is this an "issue" specific to RSS/HASH? If it's not, we need a separate patch.

Honestly, I'm not sure if it's too late to fix this.

Others look fine.

Thanks




Re: [PATCH net-next v9 1/6] virtio_net: Add functions for hashing

2025-03-09 Thread Jason Wang
On Fri, Mar 7, 2025 at 7:01 PM Akihiko Odaki  wrote:
>
> They are useful to implement VIRTIO_NET_F_RSS and
> VIRTIO_NET_F_HASH_REPORT.
>
> Signed-off-by: Akihiko Odaki 
> Tested-by: Lei Yang 
> ---
>  include/linux/virtio_net.h | 188 
> +
>  1 file changed, 188 insertions(+)
>
> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
> index 
> 02a9f4dc594d02372a6c1850cd600eff9d000d8d..426f33b4b82440d61b2af9fdc4c0b0d4c571b2c5
>  100644
> --- a/include/linux/virtio_net.h
> +++ b/include/linux/virtio_net.h
> @@ -9,6 +9,194 @@
>  #include 
>  #include 
>
> +struct virtio_net_hash {
> +   u32 value;
> +   u16 report;
> +};
> +
> +struct virtio_net_toeplitz_state {
> +   u32 hash;
> +   const u32 *key;
> +};
> +
> +#define VIRTIO_NET_SUPPORTED_HASH_TYPES (VIRTIO_NET_RSS_HASH_TYPE_IPv4 | \
> +VIRTIO_NET_RSS_HASH_TYPE_TCPv4 | \
> +VIRTIO_NET_RSS_HASH_TYPE_UDPv4 | \
> +VIRTIO_NET_RSS_HASH_TYPE_IPv6 | \
> +VIRTIO_NET_RSS_HASH_TYPE_TCPv6 | \
> +VIRTIO_NET_RSS_HASH_TYPE_UDPv6)

Let's explain why

#define VIRTIO_NET_HASH_REPORT_IPv6_EX 7
#define VIRTIO_NET_HASH_REPORT_TCPv6_EX8
#define VIRTIO_NET_HASH_REPORT_UDPv6_EX9

are missed here.

And explain how we could maintain migration compatibility

1) Does those three work for userspace datapath in Qemu? If yes,
migration will be broken.
2) once we support those three in the future. For example, is the qemu
expected to probe this via TUNGETVNETHASHCAP in the destination and
fail the migration?

Thanks



> +
> +#define VIRTIO_NET_RSS_MAX_KEY_SIZE 40
> +
> +static inline void virtio_net_toeplitz_convert_key(u32 *input, size_t len)
> +{
> +   while (len >= sizeof(*input)) {
> +   *input = be32_to_cpu((__force __be32)*input);
> +   input++;
> +   len -= sizeof(*input);
> +   }
> +}
> +
> +static inline void virtio_net_toeplitz_calc(struct virtio_net_toeplitz_state 
> *state,
> +   const __be32 *input, size_t len)
> +{
> +   while (len >= sizeof(*input)) {
> +   for (u32 map = be32_to_cpu(*input); map; map &= (map - 1)) {
> +   u32 i = ffs(map);
> +
> +   state->hash ^= state->key[0] << (32 - i) |
> +  (u32)((u64)state->key[1] >> i);
> +   }
> +
> +   state->key++;
> +   input++;
> +   len -= sizeof(*input);
> +   }
> +}
> +
> +static inline u8 virtio_net_hash_key_length(u32 types)
> +{
> +   size_t len = 0;
> +
> +   if (types & VIRTIO_NET_HASH_REPORT_IPv4)
> +   len = max(len,
> + sizeof(struct flow_dissector_key_ipv4_addrs));
> +
> +   if (types &
> +   (VIRTIO_NET_HASH_REPORT_TCPv4 | VIRTIO_NET_HASH_REPORT_UDPv4))
> +   len = max(len,
> + sizeof(struct flow_dissector_key_ipv4_addrs) +
> + sizeof(struct flow_dissector_key_ports));
> +
> +   if (types & VIRTIO_NET_HASH_REPORT_IPv6)
> +   len = max(len,
> + sizeof(struct flow_dissector_key_ipv6_addrs));
> +
> +   if (types &
> +   (VIRTIO_NET_HASH_REPORT_TCPv6 | VIRTIO_NET_HASH_REPORT_UDPv6))
> +   len = max(len,
> + sizeof(struct flow_dissector_key_ipv6_addrs) +
> + sizeof(struct flow_dissector_key_ports));
> +
> +   return len + sizeof(u32);
> +}
> +
> +static inline u32 virtio_net_hash_report(u32 types,
> +const struct flow_keys_basic *keys)
> +{
> +   switch (keys->basic.n_proto) {
> +   case cpu_to_be16(ETH_P_IP):
> +   if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) {
> +   if (keys->basic.ip_proto == IPPROTO_TCP &&
> +   (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv4))
> +   return VIRTIO_NET_HASH_REPORT_TCPv4;
> +
> +   if (keys->basic.ip_proto == IPPROTO_UDP &&
> +   (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv4))
> +   return VIRTIO_NET_HASH_REPORT_UDPv4;
> +   }
> +
> +   if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv4)
> +   return VIRTIO_NET_HASH_REPORT_IPv4;
> +
> +   return VIRTIO_NET_HASH_REPORT_NONE;
> +
> +   case cpu_to_be16(ETH_P_IPV6):
> +   if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) {
> +   if (keys->basic.ip_proto == IPPROTO_TCP &&
> +   (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv6))
> +   return VIRTIO_NET_HASH_REPORT_TCPv6;
> +
>

Re: [PATCH net-next v9 3/6] tun: Introduce virtio-net hash feature

2025-03-09 Thread Jason Wang
On Fri, Mar 7, 2025 at 7:01 PM Akihiko Odaki  wrote:
>
> Hash reporting
> ==
>
> Allow the guest to reuse the hash value to make receive steering
> consistent between the host and guest, and to save hash computation.
>
> RSS
> ===
>
> RSS is a receive steering algorithm that can be negotiated to use with
> virtio_net. Conventionally the hash calculation was done by the VMM.
> However, computing the hash after the queue was chosen defeats the
> purpose of RSS.
>
> Another approach is to use eBPF steering program. This approach has
> another downside: it cannot report the calculated hash due to the
> restrictive nature of eBPF steering program.
>
> Introduce the code to perform RSS to the kernel in order to overcome
> thse challenges. An alternative solution is to extend the eBPF steering
> program so that it will be able to report to the userspace, but I didn't
> opt for it because extending the current mechanism of eBPF steering
> program as is because it relies on legacy context rewriting, and
> introducing kfunc-based eBPF will result in non-UAPI dependency while
> the other relevant virtualization APIs such as KVM and vhost_net are
> UAPIs.
>
> Signed-off-by: Akihiko Odaki 
> Tested-by: Lei Yang 
> ---
>  Documentation/networking/tuntap.rst |   7 ++
>  drivers/net/Kconfig |   1 +
>  drivers/net/tap.c   |  68 ++-
>  drivers/net/tun.c   |  98 +-
>  drivers/net/tun_vnet.h  | 159 
> ++--
>  include/linux/if_tap.h  |   2 +
>  include/linux/skbuff.h  |   3 +
>  include/uapi/linux/if_tun.h |  75 +
>  net/core/skbuff.c   |   4 +
>  9 files changed, 386 insertions(+), 31 deletions(-)
>
> diff --git a/Documentation/networking/tuntap.rst 
> b/Documentation/networking/tuntap.rst
> index 
> 4d7087f727be5e37dfbf5066a9e9c872cc98898d..86b4ae8caa8ad062c1e558920be42ce0d4217465
>  100644
> --- a/Documentation/networking/tuntap.rst
> +++ b/Documentation/networking/tuntap.rst
> @@ -206,6 +206,13 @@ enable is true we enable it, otherwise we disable it::
>return ioctl(fd, TUNSETQUEUE, (void *)&ifr);
>}
>
> +3.4 Reference
> +-
> +
> +``linux/if_tun.h`` defines the interface described below:
> +
> +.. kernel-doc:: include/uapi/linux/if_tun.h
> +
>  Universal TUN/TAP device driver Frequently Asked Question
>  =
>
> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
> index 
> 1fd5acdc73c6af0e1a861867039c3624fc618e25..aecfd244dd83585fea2c5b815dcd787c58166c28
>  100644
> --- a/drivers/net/Kconfig
> +++ b/drivers/net/Kconfig
> @@ -395,6 +395,7 @@ config TUN
> tristate "Universal TUN/TAP device driver support"
> depends on INET
> select CRC32
> +   select SKB_EXTENSIONS
> help
>   TUN/TAP provides packet reception and transmission for user space
>   programs.  It can be viewed as a simple Point-to-Point or Ethernet
> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
> index 
> d4ece538f1b23789ca60caa6232690e4d0a4d14a..9428b63ec27e7f92e78a78afcb5e24383862c00d
>  100644
> --- a/drivers/net/tap.c
> +++ b/drivers/net/tap.c
> @@ -49,6 +49,10 @@ struct major_info {
> struct list_head next;
>  };
>
> +struct tap_skb_cb {
> +   struct virtio_net_hash hash;
> +};
> +
>  #define GOODCOPY_LEN 128
>
>  static const struct proto_ops tap_socket_ops;
> @@ -179,6 +183,22 @@ static void tap_put_queue(struct tap_queue *q)
> sock_put(&q->sk);
>  }
>
> +static struct tap_skb_cb *tap_skb_cb(const struct sk_buff *skb)
> +{
> +   BUILD_BUG_ON(sizeof(skb->cb) < sizeof(struct tap_skb_cb));
> +   return (struct tap_skb_cb *)skb->cb;
> +}
> +
> +static struct virtio_net_hash *tap_add_hash(struct sk_buff *skb)
> +{
> +   return &tap_skb_cb(skb)->hash;
> +}
> +
> +static const struct virtio_net_hash *tap_find_hash(const struct sk_buff *skb)
> +{
> +   return &tap_skb_cb(skb)->hash;
> +}
> +
>  /*
>   * Select a queue based on the rxq of the device on which this packet
>   * arrived. If the incoming device is not mq, calculate a flow hash
> @@ -189,6 +209,7 @@ static void tap_put_queue(struct tap_queue *q)
>  static struct tap_queue *tap_get_queue(struct tap_dev *tap,
>struct sk_buff *skb)
>  {
> +   struct flow_keys_basic keys_basic;
> struct tap_queue *queue = NULL;
> /* Access to taps array is protected by rcu, but access to numvtaps
>  * isn't. Below we use it to lookup a queue, but treat it as a hint
> @@ -196,17 +217,47 @@ static struct tap_queue *tap_get_queue(struct tap_dev 
> *tap,
>  * racing against queue removal.
>  */
> int numvtaps = READ_ONCE(tap->numvtaps);
> +   struct tun_vnet_hash_container *vnet_hash = 
> rcu_dereference(tap->vnet_hash);
> __u32 rxq;
>
> +   *tap_skb_cb(

Re: [PATCH net-next v9 3/6] tun: Introduce virtio-net hash feature

2025-03-09 Thread Jason Wang
On Mon, Mar 10, 2025 at 11:55 AM Jason Wang  wrote:
>
> On Fri, Mar 7, 2025 at 7:01 PM Akihiko Odaki  wrote:
> >
> > Hash reporting
> > ==
> >
> > Allow the guest to reuse the hash value to make receive steering
> > consistent between the host and guest, and to save hash computation.
> >
> > RSS
> > ===
> >
> > RSS is a receive steering algorithm that can be negotiated to use with
> > virtio_net. Conventionally the hash calculation was done by the VMM.
> > However, computing the hash after the queue was chosen defeats the
> > purpose of RSS.
> >
> > Another approach is to use eBPF steering program. This approach has
> > another downside: it cannot report the calculated hash due to the
> > restrictive nature of eBPF steering program.
> >
> > Introduce the code to perform RSS to the kernel in order to overcome
> > thse challenges. An alternative solution is to extend the eBPF steering
> > program so that it will be able to report to the userspace, but I didn't
> > opt for it because extending the current mechanism of eBPF steering
> > program as is because it relies on legacy context rewriting, and
> > introducing kfunc-based eBPF will result in non-UAPI dependency while
> > the other relevant virtualization APIs such as KVM and vhost_net are
> > UAPIs.
> >
> > Signed-off-by: Akihiko Odaki 
> > Tested-by: Lei Yang 
> > ---
> >  Documentation/networking/tuntap.rst |   7 ++
> >  drivers/net/Kconfig |   1 +
> >  drivers/net/tap.c   |  68 ++-
> >  drivers/net/tun.c   |  98 +-
> >  drivers/net/tun_vnet.h  | 159 
> > ++--
> >  include/linux/if_tap.h  |   2 +
> >  include/linux/skbuff.h  |   3 +
> >  include/uapi/linux/if_tun.h |  75 +
> >  net/core/skbuff.c   |   4 +
> >  9 files changed, 386 insertions(+), 31 deletions(-)

[...]

> > + *
> > + * The %TUN_VNET_HASH_REPORT flag set with this ioctl will be effective 
> > only
> > + * after calling the %TUNSETVNETHDRSZ ioctl with a number greater than or 
> > equal
> > + * to the size of &struct virtio_net_hdr_v1_hash.
>
> So you had a dependency check already for vnet hdr len. I'd still
> suggest to split this into rss and hash as they are separated
> features. Then we can use separate data structure for them instead of
> a container struct.
>

Besides this, I think we still need to add new bits to TUNGETIFF to
let userspace know about the new ability.

Thanks




Re: [PATCH net-next v9 5/6] selftest: tun: Add tests for virtio-net hashing

2025-03-09 Thread Jason Wang
On Fri, Mar 7, 2025 at 7:02 PM Akihiko Odaki  wrote:
>
> The added tests confirm tun can perform RSS and hash reporting, and
> reject invalid configurations for them.

Let's be more verbose here. E.g what's the network topology used here.

>
> Signed-off-by: Akihiko Odaki 
> Tested-by: Lei Yang 
> ---
>  tools/testing/selftests/net/Makefile |   2 +-
>  tools/testing/selftests/net/tun.c| 584 
> ++-
>  2 files changed, 576 insertions(+), 10 deletions(-)
>
> diff --git a/tools/testing/selftests/net/Makefile 
> b/tools/testing/selftests/net/Makefile
> index 
> 73ee88d6b043004be23b444de667a1d99a6045de..9772f691a9a011d99212df32463cdb930cf0a1a0
>  100644
> --- a/tools/testing/selftests/net/Makefile
> +++ b/tools/testing/selftests/net/Makefile
> @@ -123,6 +123,6 @@ $(OUTPUT)/reuseport_bpf_numa: LDLIBS += -lnuma
>  $(OUTPUT)/tcp_mmap: LDLIBS += -lpthread -lcrypto
>  $(OUTPUT)/tcp_inq: LDLIBS += -lpthread
>  $(OUTPUT)/bind_bhash: LDLIBS += -lpthread
> -$(OUTPUT)/io_uring_zerocopy_tx: CFLAGS += -I../../../include/
> +$(OUTPUT)/io_uring_zerocopy_tx $(OUTPUT)/tun: CFLAGS += -I../../../include/
>
>  include bpf.mk
> diff --git a/tools/testing/selftests/net/tun.c 
> b/tools/testing/selftests/net/tun.c
> index 
> 463dd98f2b80b1bdcb398cee43c834e7dc5cf784..acadeea7194eaea9416a605b47f99f7a5f1f80cd
>  100644
> --- a/tools/testing/selftests/net/tun.c
> +++ b/tools/testing/selftests/net/tun.c
> @@ -2,21 +2,38 @@
>
>  #define _GNU_SOURCE
>
> +#include 
>  #include 
>  #include 
> +#include 

Is this needed?

> +#include 
>  #include 
>  #include 
>  #include 
>  #include 
> -#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
>  #include 
> +#include 
>  #include 
>  #include 
> -#include 
> -#include 
> +#include 
> +#include 
> +#include 
> +#include 
>
>  #include "../kselftest_harness.h"
>
> +#define TUN_HWADDR_SOURCE { 0x02, 0x00, 0x00, 0x00, 0x00, 0x00 }
> +#define TUN_HWADDR_DEST { 0x02, 0x00, 0x00, 0x00, 0x00, 0x01 }
> +#define TUN_IPADDR_SOURCE htonl((172 << 24) | (17 << 16) | 0)
> +#define TUN_IPADDR_DEST htonl((172 << 24) | (17 << 16) | 1)
> +
>  static int tun_attach(int fd, char *dev)
>  {
> struct ifreq ifr;
> @@ -39,7 +56,7 @@ static int tun_detach(int fd, char *dev)
> return ioctl(fd, TUNSETQUEUE, (void *) &ifr);
>  }
>
> -static int tun_alloc(char *dev)
> +static int tun_alloc(char *dev, short flags)
>  {
> struct ifreq ifr;
> int fd, err;
> @@ -52,7 +69,8 @@ static int tun_alloc(char *dev)
>
> memset(&ifr, 0, sizeof(ifr));
> strcpy(ifr.ifr_name, dev);
> -   ifr.ifr_flags = IFF_TAP | IFF_NAPI | IFF_MULTI_QUEUE;
> +   ifr.ifr_flags = flags | IFF_TAP | IFF_NAPI | IFF_NO_PI |
> +   IFF_MULTI_QUEUE;
>
> err = ioctl(fd, TUNSETIFF, (void *) &ifr);
> if (err < 0) {
> @@ -64,6 +82,40 @@ static int tun_alloc(char *dev)
> return fd;
>  }
>
> +static bool tun_add_to_bridge(int local_fd, const char *name)
> +{

I wonder if a packet socket is more convenient here.

Thanks




Re: [PATCH net-next v9 3/6] tun: Introduce virtio-net hash feature

2025-03-11 Thread Jason Wang
On Tue, Mar 11, 2025 at 2:11 PM Akihiko Odaki  wrote:
>
> On 2025/03/11 9:38, Jason Wang wrote:
> > On Mon, Mar 10, 2025 at 3:45 PM Akihiko Odaki  
> > wrote:
> >>
> >> On 2025/03/10 12:55, Jason Wang wrote:
> >>> On Fri, Mar 7, 2025 at 7:01 PM Akihiko Odaki  
> >>> wrote:
> >>>>
> >>>> Hash reporting
> >>>> ==
> >>>>
> >>>> Allow the guest to reuse the hash value to make receive steering
> >>>> consistent between the host and guest, and to save hash computation.
> >>>>
> >>>> RSS
> >>>> ===
> >>>>
> >>>> RSS is a receive steering algorithm that can be negotiated to use with
> >>>> virtio_net. Conventionally the hash calculation was done by the VMM.
> >>>> However, computing the hash after the queue was chosen defeats the
> >>>> purpose of RSS.
> >>>>
> >>>> Another approach is to use eBPF steering program. This approach has
> >>>> another downside: it cannot report the calculated hash due to the
> >>>> restrictive nature of eBPF steering program.
> >>>>
> >>>> Introduce the code to perform RSS to the kernel in order to overcome
> >>>> thse challenges. An alternative solution is to extend the eBPF steering
> >>>> program so that it will be able to report to the userspace, but I didn't
> >>>> opt for it because extending the current mechanism of eBPF steering
> >>>> program as is because it relies on legacy context rewriting, and
> >>>> introducing kfunc-based eBPF will result in non-UAPI dependency while
> >>>> the other relevant virtualization APIs such as KVM and vhost_net are
> >>>> UAPIs.
> >>>>
> >>>> Signed-off-by: Akihiko Odaki 
> >>>> Tested-by: Lei Yang 
> >>>> ---
> >>>>Documentation/networking/tuntap.rst |   7 ++
> >>>>drivers/net/Kconfig |   1 +
> >>>>drivers/net/tap.c   |  68 ++-
> >>>>drivers/net/tun.c   |  98 +-
> >>>>drivers/net/tun_vnet.h  | 159 
> >>>> ++--
> >>>>include/linux/if_tap.h  |   2 +
> >>>>include/linux/skbuff.h  |   3 +
> >>>>include/uapi/linux/if_tun.h |  75 +
> >>>>net/core/skbuff.c   |   4 +
> >>>>9 files changed, 386 insertions(+), 31 deletions(-)
> >>>>
> >>>> diff --git a/Documentation/networking/tuntap.rst 
> >>>> b/Documentation/networking/tuntap.rst
> >>>> index 
> >>>> 4d7087f727be5e37dfbf5066a9e9c872cc98898d..86b4ae8caa8ad062c1e558920be42ce0d4217465
> >>>>  100644
> >>>> --- a/Documentation/networking/tuntap.rst
> >>>> +++ b/Documentation/networking/tuntap.rst
> >>>> @@ -206,6 +206,13 @@ enable is true we enable it, otherwise we disable 
> >>>> it::
> >>>>  return ioctl(fd, TUNSETQUEUE, (void *)&ifr);
> >>>>  }
> >>>>

[...]

> >>>> +static inline long tun_vnet_ioctl_sethash(struct 
> >>>> tun_vnet_hash_container __rcu **hashp,
> >>>> + bool can_rss, void __user 
> >>>> *argp)
> >>>
> >>> So again, can_rss seems to be tricky. Looking at its caller, it tires
> >>> to make eBPF and RSS mutually exclusive. I still don't understand why
> >>> we need this. Allow eBPF program to override some of the path seems to
> >>> be common practice.
> >>>
> >>> What's more, we didn't try (or even can't) to make automq and eBPF to
> >>> be mutually exclusive. So I still didn't see what we gain from this
> >>> and it complicates the codes and may lead to ambiguous uAPI/behaviour.
> >>
> >> automq and eBPF are mutually exclusive; automq is disabled when an eBPF
> >> steering program is set so I followed the example here.
> >
> > I meant from the view of uAPI, the kernel doesn't or can't reject eBPF
> > while using automq.
>  > >>
> >> We don't even have an interface for eBPF to let it fall back to another
> >> alogirhtm.
> >
> > 

Re: [PATCH net-next v9 3/6] tun: Introduce virtio-net hash feature

2025-03-11 Thread Jason Wang
On Tue, Mar 11, 2025 at 2:17 PM Akihiko Odaki  wrote:
>
> On 2025/03/11 9:38, Jason Wang wrote:
> > On Mon, Mar 10, 2025 at 3:45 PM Akihiko Odaki  
> > wrote:
> >>
> >> On 2025/03/10 12:55, Jason Wang wrote:
> >>> On Fri, Mar 7, 2025 at 7:01 PM Akihiko Odaki  
> >>> wrote:
> >>>>
> >>>> Hash reporting
> >>>> ==
> >>>>
> >>>> Allow the guest to reuse the hash value to make receive steering
> >>>> consistent between the host and guest, and to save hash computation.
> >>>>
> >>>> RSS
> >>>> ===
> >>>>
> >>>> RSS is a receive steering algorithm that can be negotiated to use with
> >>>> virtio_net. Conventionally the hash calculation was done by the VMM.
> >>>> However, computing the hash after the queue was chosen defeats the
> >>>> purpose of RSS.
> >>>>
> >>>> Another approach is to use eBPF steering program. This approach has
> >>>> another downside: it cannot report the calculated hash due to the
> >>>> restrictive nature of eBPF steering program.
> >>>>
> >>>> Introduce the code to perform RSS to the kernel in order to overcome
> >>>> thse challenges. An alternative solution is to extend the eBPF steering
> >>>> program so that it will be able to report to the userspace, but I didn't
> >>>> opt for it because extending the current mechanism of eBPF steering
> >>>> program as is because it relies on legacy context rewriting, and
> >>>> introducing kfunc-based eBPF will result in non-UAPI dependency while
> >>>> the other relevant virtualization APIs such as KVM and vhost_net are
> >>>> UAPIs.
> >>>>
> >>>> Signed-off-by: Akihiko Odaki 
> >>>> Tested-by: Lei Yang 
> >>>> ---
> >>>>Documentation/networking/tuntap.rst |   7 ++
> >>>>drivers/net/Kconfig |   1 +
> >>>>drivers/net/tap.c   |  68 ++-
> >>>>drivers/net/tun.c   |  98 +-
> >>>>drivers/net/tun_vnet.h  | 159 
> >>>> ++--
> >>>>include/linux/if_tap.h  |   2 +
> >>>>include/linux/skbuff.h  |   3 +
> >>>>include/uapi/linux/if_tun.h |  75 +
> >>>>net/core/skbuff.c   |   4 +
> >>>>9 files changed, 386 insertions(+), 31 deletions(-)
> >>>>

[...]

> >>> Let's has a consistent name for this and the uapi to be consistent
> >>> with TUNSETIFF/TUNGETIFF. Probably TUNSETVNETHASH and
> >>> tun_vnet_ioctl_gethash().
> >>
> >> They have different semantics so they should have different names.
> >> TUNGETIFF reports the value currently set while TUNGETVNETHASHCAP
> >> reports the value that can be set later.
> >
> > I'm not sure I will get here. I meant a symmetric name
> >
> > TUNSETVNETHASH and TUNVETVNETHASH.
>
> TUNGETVNETHASHCAP does not correspond to TUNGETIFF. The correspondence
> of ioctl names is as follows:
> TUNGETFEATURES - TUNGETVNETHASHCAP

TUNGETFEATURES returns the value set from TUNSETIFF. This differs from
TUNGETVNETHASHCAP semantic which just return the capabilities.

+static inline long tun_vnet_ioctl_gethashcap(void __user *argp)
+{
+   static const struct tun_vnet_hash cap = {
+   .flags = TUN_VNET_HASH_REPORT | TUN_VNET_HASH_RSS,
+   .types = VIRTIO_NET_SUPPORTED_HASH_TYPES
+   };
+
+   return copy_to_user(argp, &cap, sizeof(cap)) ? -EFAULT : 0;
+}

TUNGETFEATURES doesn't' help too much for non-persist TAP as userspace
knows what value it set before.

> TUNSETIFF - TUNSETVNETHASH
> TUNGETIFF - no corresponding ioctl for the virtio-net hash features

And this sounds odd and a hint for a incomplete uAPI as userspace
needs to know knowing what can set before doing TUNSETVNETHASH.

>
> Regards,
> Akihiko Odaki
>

Thanks




Re: [PATCH net-next v9 6/6] vhost/net: Support VIRTIO_NET_F_HASH_REPORT

2025-03-11 Thread Jason Wang
On Tue, Mar 11, 2025 at 2:24 PM Akihiko Odaki  wrote:
>
> On 2025/03/11 9:42, Jason Wang wrote:
> > On Mon, Mar 10, 2025 at 3:04 PM Akihiko Odaki  
> > wrote:
> >>
> >> On 2025/03/10 13:43, Jason Wang wrote:
> >>> On Fri, Mar 7, 2025 at 7:02 PM Akihiko Odaki  
> >>> wrote:
> >>>>
> >>>> VIRTIO_NET_F_HASH_REPORT allows to report hash values calculated on the
> >>>> host. When VHOST_NET_F_VIRTIO_NET_HDR is employed, it will report no
> >>>> hash values (i.e., the hash_report member is always set to
> >>>> VIRTIO_NET_HASH_REPORT_NONE). Otherwise, the values reported by the
> >>>> underlying socket will be reported.
> >>>>
> >>>> VIRTIO_NET_F_HASH_REPORT requires VIRTIO_F_VERSION_1.
> >>>>
> >>>> Signed-off-by: Akihiko Odaki 
> >>>> Tested-by: Lei Yang 
> >>>> ---
> >>>>drivers/vhost/net.c | 49 
> >>>> +
> >>>>1 file changed, 29 insertions(+), 20 deletions(-)
> >>>>
> >>>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> >>>> index 
> >>>> b9b9e9d40951856d881d77ac74331d914473cd56..16b241b44f89820a42c302f3586ea6bb5e0d4289
> >>>>  100644
> >>>> --- a/drivers/vhost/net.c
> >>>> +++ b/drivers/vhost/net.c
> >>>> @@ -73,6 +73,7 @@ enum {
> >>>>   VHOST_NET_FEATURES = VHOST_FEATURES |
> >>>>(1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
> >>>>(1ULL << VIRTIO_NET_F_MRG_RXBUF) |
> >>>> +(1ULL << VIRTIO_NET_F_HASH_REPORT) |
> >>>>(1ULL << VIRTIO_F_ACCESS_PLATFORM) |
> >>>>(1ULL << VIRTIO_F_RING_RESET)
> >>>>};
> >>>> @@ -1097,9 +1098,11 @@ static void handle_rx(struct vhost_net *net)
> >>>>   .msg_controllen = 0,
> >>>>   .msg_flags = MSG_DONTWAIT,
> >>>>   };
> >>>> -   struct virtio_net_hdr hdr = {
> >>>> -   .flags = 0,
> >>>> -   .gso_type = VIRTIO_NET_HDR_GSO_NONE
> >>>> +   struct virtio_net_hdr_v1_hash hdr = {
> >>>> +   .hdr = {
> >>>> +   .flags = 0,
> >>>> +   .gso_type = VIRTIO_NET_HDR_GSO_NONE
> >>>> +   }
> >>>>   };
> >>>>   size_t total_len = 0;
> >>>>   int err, mergeable;
> >>>> @@ -1110,7 +1113,6 @@ static void handle_rx(struct vhost_net *net)
> >>>>   bool set_num_buffers;
> >>>>   struct socket *sock;
> >>>>   struct iov_iter fixup;
> >>>> -   __virtio16 num_buffers;
> >>>>   int recv_pkts = 0;
> >>>>
> >>>>   mutex_lock_nested(&vq->mutex, VHOST_NET_VQ_RX);
> >>>> @@ -1191,30 +1193,30 @@ static void handle_rx(struct vhost_net *net)
> >>>>   vhost_discard_vq_desc(vq, headcount);
> >>>>   continue;
> >>>>   }
> >>>> +   hdr.hdr.num_buffers = cpu_to_vhost16(vq, headcount);
> >>>>   /* Supply virtio_net_hdr if VHOST_NET_F_VIRTIO_NET_HDR 
> >>>> */
> >>>>   if (unlikely(vhost_hlen)) {
> >>>> -   if (copy_to_iter(&hdr, sizeof(hdr),
> >>>> -&fixup) != sizeof(hdr)) {
> >>>> +   if (copy_to_iter(&hdr, vhost_hlen,
> >>>> +&fixup) != vhost_hlen) {
> >>>>   vq_err(vq, "Unable to write vnet_hdr "
> >>>>  "at addr %p\n", 
> >>>> vq->iov->iov_base);
> >>>>   goto out;
> >>>
> >>> Is this an "issue" specific to RSS/HASH? If it's not, we need a separate 
> >>> patch.
> >>>
> >>> Honestly, I'm not sure if it's too late to fix this.
> >>
> >&

Re: [PATCH net-next v2] tun: Pad virtio headers

2025-02-20 Thread Jason Wang
On Thu, Feb 20, 2025 at 4:45 PM Michael S. Tsirkin  wrote:
>
> On Thu, Feb 20, 2025 at 08:58:38AM +0100, Paolo Abeni wrote:
> > Hi,
> >
> > On 2/15/25 7:04 AM, Akihiko Odaki wrote:
> > > tun simply advances iov_iter when it needs to pad virtio header,
> > > which leaves the garbage in the buffer as is. This will become
> > > especially problematic when tun starts to allow enabling the hash
> > > reporting feature; even if the feature is enabled, the packet may lack a
> > > hash value and may contain a hole in the virtio header because the
> > > packet arrived before the feature gets enabled or does not contain the
> > > header fields to be hashed. If the hole is not filled with zero, it is
> > > impossible to tell if the packet lacks a hash value.
> >
> > Should virtio starting sending packets only after feature negotiation?
> > In other words, can the above happen without another bug somewhere else?
>
>
> Not if this is connected with a guest with the standard virtio driver, no.
> The issue is that tun has no concept of feature negotiation,
> and we don't know who uses the vnet header feature, or why.
>
> > I guess the following question is mostly for Jason and Michael: could be
> > possible (/would it make any sense) to use a virtio_net_hdr `flags` bit
> > to explicitly signal the hash fields presence? i.e. making the actual
> > virtio_net_hdr size 'dynamic'.
>
> But it is dynamic - that is why we have TUNSETVNETHDRSZ.

Yes, tun currently only recognizes a subset of the whole virtio-net header.

Thanks




Re: [PATCH net-next v9 3/6] tun: Introduce virtio-net hash feature

2025-03-23 Thread Jason Wang
On Fri, Mar 21, 2025 at 1:57 PM Akihiko Odaki  wrote:
>
> On 2025/03/21 10:13, Jason Wang wrote:
> > On Thu, Mar 20, 2025 at 1:33 PM Akihiko Odaki  
> > wrote:
> >>
> >> On 2025/03/20 10:31, Jason Wang wrote:
> >>> On Wed, Mar 19, 2025 at 1:29 PM Akihiko Odaki  
> >>> wrote:
> >>>>
> >>>> On 2025/03/19 9:58, Jason Wang wrote:
> >>>>> On Tue, Mar 18, 2025 at 6:10 PM Akihiko Odaki 
> >>>>>  wrote:
> >>>>>>
> >>>>>> On 2025/03/18 9:15, Jason Wang wrote:
> >>>>>>> On Mon, Mar 17, 2025 at 3:07 PM Akihiko Odaki 
> >>>>>>>  wrote:
> >>>>>>>>
> >>>>>>>> On 2025/03/17 10:12, Jason Wang wrote:
> >>>>>>>>> On Wed, Mar 12, 2025 at 1:03 PM Akihiko Odaki 
> >>>>>>>>>  wrote:
> >>>>>>>>>>
> >>>>>>>>>> On 2025/03/12 11:35, Jason Wang wrote:
> >>>>>>>>>>> On Tue, Mar 11, 2025 at 2:11 PM Akihiko Odaki 
> >>>>>>>>>>>  wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 2025/03/11 9:38, Jason Wang wrote:
> >>>>>>>>>>>>> On Mon, Mar 10, 2025 at 3:45 PM Akihiko Odaki 
> >>>>>>>>>>>>>  wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On 2025/03/10 12:55, Jason Wang wrote:
> >>>>>>>>>>>>>>> On Fri, Mar 7, 2025 at 7:01 PM Akihiko Odaki 
> >>>>>>>>>>>>>>>  wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hash reporting
> >>>>>>>>>>>>>>>> ==
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Allow the guest to reuse the hash value to make receive 
> >>>>>>>>>>>>>>>> steering
> >>>>>>>>>>>>>>>> consistent between the host and guest, and to save hash 
> >>>>>>>>>>>>>>>> computation.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> RSS
> >>>>>>>>>>>>>>>> ===
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> RSS is a receive steering algorithm that can be negotiated 
> >>>>>>>>>>>>>>>> to use with
> >>>>>>>>>>>>>>>> virtio_net. Conventionally the hash calculation was done by 
> >>>>>>>>>>>>>>>> the VMM.
> >>>>>>>>>>>>>>>> However, computing the hash after the queue was chosen 
> >>>>>>>>>>>>>>>> defeats the
> >>>>>>>>>>>>>>>> purpose of RSS.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Another approach is to use eBPF steering program. This 
> >>>>>>>>>>>>>>>> approach has
> >>>>>>>>>>>>>>>> another downside: it cannot report the calculated hash due 
> >>>>>>>>>>>>>>>> to the
> >>>>>>>>>>>>>>>> restrictive nature of eBPF steering program.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Introduce the code to perform RSS to the kernel in order to 
> >>>>>>>>>>>>>>>> overcome
> >>>>>>>>>>>>>>>> thse challenges. An alternative solution is to extend the 
> >>>>>>>>>>>>>>>> eBPF steering
> >>>>>>>>>>>>>>>> program so that it will be able to report to the userspace, 
> >>>>>>>>>>>>>>

Re: [PATCH net-next v12 01/10] virtio_net: Add functions for hashing

2025-06-02 Thread Jason Wang
On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki  wrote:
>
> They are useful to implement VIRTIO_NET_F_RSS and
> VIRTIO_NET_F_HASH_REPORT.
>
> Signed-off-by: Akihiko Odaki 
> Tested-by: Lei Yang 
> ---
>  include/linux/virtio_net.h | 188 
> +
>  1 file changed, 188 insertions(+)
>
> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
> index 02a9f4dc594d..426f33b4b824 100644
> --- a/include/linux/virtio_net.h
> +++ b/include/linux/virtio_net.h
> @@ -9,6 +9,194 @@
>  #include 
>  #include 
>
> +struct virtio_net_hash {
> +   u32 value;
> +   u16 report;
> +};
> +
> +struct virtio_net_toeplitz_state {
> +   u32 hash;
> +   const u32 *key;
> +};
> +
> +#define VIRTIO_NET_SUPPORTED_HASH_TYPES (VIRTIO_NET_RSS_HASH_TYPE_IPv4 | \
> +VIRTIO_NET_RSS_HASH_TYPE_TCPv4 | \
> +VIRTIO_NET_RSS_HASH_TYPE_UDPv4 | \
> +VIRTIO_NET_RSS_HASH_TYPE_IPv6 | \
> +VIRTIO_NET_RSS_HASH_TYPE_TCPv6 | \
> +VIRTIO_NET_RSS_HASH_TYPE_UDPv6)
> +
> +#define VIRTIO_NET_RSS_MAX_KEY_SIZE 40
> +
> +static inline void virtio_net_toeplitz_convert_key(u32 *input, size_t len)
> +{
> +   while (len >= sizeof(*input)) {
> +   *input = be32_to_cpu((__force __be32)*input);
> +   input++;
> +   len -= sizeof(*input);
> +   }
> +}
> +
> +static inline void virtio_net_toeplitz_calc(struct virtio_net_toeplitz_state 
> *state,
> +   const __be32 *input, size_t len)
> +{
> +   while (len >= sizeof(*input)) {
> +   for (u32 map = be32_to_cpu(*input); map; map &= (map - 1)) {
> +   u32 i = ffs(map);
> +
> +   state->hash ^= state->key[0] << (32 - i) |
> +  (u32)((u64)state->key[1] >> i);
> +   }
> +
> +   state->key++;
> +   input++;
> +   len -= sizeof(*input);
> +   }
> +}
> +
> +static inline u8 virtio_net_hash_key_length(u32 types)
> +{
> +   size_t len = 0;
> +
> +   if (types & VIRTIO_NET_HASH_REPORT_IPv4)
> +   len = max(len,
> + sizeof(struct flow_dissector_key_ipv4_addrs));
> +
> +   if (types &
> +   (VIRTIO_NET_HASH_REPORT_TCPv4 | VIRTIO_NET_HASH_REPORT_UDPv4))
> +   len = max(len,
> + sizeof(struct flow_dissector_key_ipv4_addrs) +
> + sizeof(struct flow_dissector_key_ports));
> +
> +   if (types & VIRTIO_NET_HASH_REPORT_IPv6)
> +   len = max(len,
> + sizeof(struct flow_dissector_key_ipv6_addrs));
> +
> +   if (types &
> +   (VIRTIO_NET_HASH_REPORT_TCPv6 | VIRTIO_NET_HASH_REPORT_UDPv6))
> +   len = max(len,
> + sizeof(struct flow_dissector_key_ipv6_addrs) +
> + sizeof(struct flow_dissector_key_ports));
> +
> +   return len + sizeof(u32);
> +}
> +
> +static inline u32 virtio_net_hash_report(u32 types,
> +const struct flow_keys_basic *keys)
> +{
> +   switch (keys->basic.n_proto) {
> +   case cpu_to_be16(ETH_P_IP):
> +   if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) {
> +   if (keys->basic.ip_proto == IPPROTO_TCP &&
> +   (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv4))
> +   return VIRTIO_NET_HASH_REPORT_TCPv4;
> +
> +   if (keys->basic.ip_proto == IPPROTO_UDP &&
> +   (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv4))
> +   return VIRTIO_NET_HASH_REPORT_UDPv4;
> +   }
> +
> +   if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv4)
> +   return VIRTIO_NET_HASH_REPORT_IPv4;
> +
> +   return VIRTIO_NET_HASH_REPORT_NONE;
> +
> +   case cpu_to_be16(ETH_P_IPV6):
> +   if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) {
> +   if (keys->basic.ip_proto == IPPROTO_TCP &&
> +   (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv6))
> +   return VIRTIO_NET_HASH_REPORT_TCPv6;
> +
> +   if (keys->basic.ip_proto == IPPROTO_UDP &&
> +   (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv6))
> +   return VIRTIO_NET_HASH_REPORT_UDPv6;
> +   }
> +
> +   if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv6)
> +   return VIRTIO_NET_HASH_REPORT_IPv6;
> +
> +   return VIRTIO_NET_HASH_REPORT_NONE;
> +
> +   default:
> +   return VIRTIO_NET_HASH_REPORT_NONE;
> +   }
> +}
> +
> +static inline void virtio_net_hash_rss(const stru

Re: [PATCH net-next v12 02/10] net: flow_dissector: Export flow_keys_dissector_symmetric

2025-06-02 Thread Jason Wang
On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki  wrote:
>
> flow_keys_dissector_symmetric is useful to derive a symmetric hash
> and to know its source such as IPv4, IPv6, TCP, and UDP.
>
> Signed-off-by: Akihiko Odaki 
> Tested-by: Lei Yang 
> ---

Acked-by: Jason Wang 

Thanks




Re: [PATCH net-next v12 04/10] tun: Add common virtio-net hash feature code

2025-06-23 Thread Jason Wang
On Fri, Jun 20, 2025 at 1:01 AM Akihiko Odaki  wrote:
>
> On 2025/06/17 12:39, Jason Wang wrote:
> > On Fri, Jun 6, 2025 at 5:27 PM Akihiko Odaki  
> > wrote:
> >>
> >> On 2025/06/06 10:01, Jason Wang wrote:
> >>> On Thu, Jun 5, 2025 at 4:18 PM Akihiko Odaki  
> >>> wrote:
> >>>>
> >>>> On 2025/06/05 11:46, Jason Wang wrote:
> >>>>> On Wed, Jun 4, 2025 at 4:42 PM Akihiko Odaki  
> >>>>> wrote:
> >>>>>>
> >>>>>> On 2025/06/04 10:53, Jason Wang wrote:
> >>>>>>> On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki 
> >>>>>>>  wrote:
> >>>>>>>>
> >>>>>>>> Add common code required for the features being added to TUN and TAP.
> >>>>>>>> They will be enabled for each of them in following patches.
> >>>>>>>>
> >>>>>>>> Added Features
> >>>>>>>> ==
> >>>>>>>>
> >>>>>>>> Hash reporting
> >>>>>>>> --
> >>>>>>>>
> >>>>>>>> Allow the guest to reuse the hash value to make receive steering
> >>>>>>>> consistent between the host and guest, and to save hash computation.
> >>>>>>>>
> >>>>>>>> Receive Side Scaling (RSS)
> >>>>>>>> --
> >>>>>>>>
> >>>>>>>> RSS is a receive steering algorithm that can be negotiated to use 
> >>>>>>>> with
> >>>>>>>> virtio_net. Conventionally the hash calculation was done by the VMM.
> >>>>>>>> However, computing the hash after the queue was chosen defeats the
> >>>>>>>> purpose of RSS.
> >>>>>>>>
> >>>>>>>> Another approach is to use eBPF steering program. This approach has
> >>>>>>>> another downside: it cannot report the calculated hash due to the
> >>>>>>>> restrictive nature of eBPF steering program.
> >>>>>>>>
> >>>>>>>> Introduce the code to perform RSS to the kernel in order to overcome
> >>>>>>>> thse challenges. An alternative solution is to extend the eBPF 
> >>>>>>>> steering
> >>>>>>>> program so that it will be able to report to the userspace, but I 
> >>>>>>>> didn't
> >>>>>>>> opt for it because extending the current mechanism of eBPF steering
> >>>>>>>> program as is because it relies on legacy context rewriting, and
> >>>>>>>> introducing kfunc-based eBPF will result in non-UAPI dependency while
> >>>>>>>> the other relevant virtualization APIs such as KVM and vhost_net are
> >>>>>>>> UAPIs.
> >>>>>>>>
> >>>>>>>> Added ioctls
> >>>>>>>> 
> >>>>>>>>
> >>>>>>>> They are designed to make extensibility and VM migration compatible.
> >>>>>>>> This change only adds the implementation and does not expose them to
> >>>>>>>> the userspace.
> >>>>>>>>
> >>>>>>>> TUNGETVNETHASHTYPES
> >>>>>>>> ---
> >>>>>>>>
> >>>>>>>> This ioctl tells supported hash types. It is useful to check if a VM 
> >>>>>>>> can
> >>>>>>>> be migrated to the current host.
> >>>>>>>>
> >>>>>>>> TUNSETVNETREPORTINGAUTOMQ, TUNSETVNETREPORTINGRSS, and TUNSETVNETRSS
> >>>>>>>> 
> >>>>>>>>
> >>>>>>>> These ioctls configures a steering algorithm and, if needed, hash
> >>>>>>>> reporting.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Akihiko Odaki 
> >>>>>>>> Tested-by: Lei Yang 
> >>>>>>>> ---
> >>>>>>>>  drivers/net/tap.c   |  10 ++-
> >&

Re: [PATCH net-next v12 01/10] virtio_net: Add functions for hashing

2025-06-23 Thread Jason Wang
On Mon, Jun 23, 2025 at 1:40 AM Yuri Benditovich
 wrote:
>
> > Yuri, can you help to clarify this?
>
> I see here several questions:
> 1. Whether it is ok for the device not to indicate support for XXX_EX hash 
> type?
> - I think, yes (strictly speaking, it was better to test that before
> submitting the patches )
> 2. Is it possible that the guest will enable some XXX_EX hash type if
> the device does not indicate that it is supported?
> - No (I think this is part of the spec)

There's another question, is the device allowed to fallback to
VIRTIO_NET_HASH_TYPE_IPv6 if it fails to parse extensions?

> 3. What to do if we migrate between systems with different
> capabilities of hash support/reporting/whatever
> - IMO, at this moment such case should be excluded and only mechanism
> we have for that is the compatible machine version
> - in some future the change of device capabilities can be communicated
> to the driver and _probably_ the driver might be able to communicate
> the change of device capabilities to the OS

Are you suggesting implementing all hash types? Note that Akihiko
raises the issue that in the actual implementation there should be a
limitation of the maximum number of options. If such a limitation is
different between src and dst, the difference could be noticed by the
guest.

> 4. Does it make sense to have fine configuration of hash types mask
> via command-line?
> - IMO, no. This would require the user to have too much knowledge
> about RSS internals
>
> Please let me know if I missed something.
>

Thanks




Re: [PATCH net-next v12 01/10] virtio_net: Add functions for hashing

2025-06-16 Thread Jason Wang
On Fri, Jun 6, 2025 at 5:10 PM Akihiko Odaki  wrote:
>
> On 2025/06/06 9:48, Jason Wang wrote:
> > On Thu, Jun 5, 2025 at 3:58 PM Akihiko Odaki  
> > wrote:
> >>
> >> On 2025/06/05 10:53, Jason Wang wrote:
> >>> On Wed, Jun 4, 2025 at 3:20 PM Akihiko Odaki  
> >>> wrote:
> >>>>
> >>>> On 2025/06/04 10:18, Jason Wang wrote:
> >>>>> On Tue, Jun 3, 2025 at 1:31 PM Akihiko Odaki  
> >>>>> wrote:
> >>>>>>
> >>>>>> On 2025/06/03 12:19, Jason Wang wrote:
> >>>>>>> On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki 
> >>>>>>>  wrote:
> >>>>>>>>
> >>>>>>>> They are useful to implement VIRTIO_NET_F_RSS and
> >>>>>>>> VIRTIO_NET_F_HASH_REPORT.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Akihiko Odaki 
> >>>>>>>> Tested-by: Lei Yang 
> >>>>>>>> ---
> >>>>>>>>  include/linux/virtio_net.h | 188 
> >>>>>>>> +
> >>>>>>>>  1 file changed, 188 insertions(+)
> >>>>>>>>
> >>>>>>>> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
> >>>>>>>> index 02a9f4dc594d..426f33b4b824 100644
> >>>>>>>> --- a/include/linux/virtio_net.h
> >>>>>>>> +++ b/include/linux/virtio_net.h
> >>>>>>>> @@ -9,6 +9,194 @@
> >>>>>>>>  #include 
> >>>>>>>>  #include 
> >>>>>>>>
> >>>>>>>> +struct virtio_net_hash {
> >>>>>>>> +   u32 value;
> >>>>>>>> +   u16 report;
> >>>>>>>> +};
> >>>>>>>> +
> >>>>>>>> +struct virtio_net_toeplitz_state {
> >>>>>>>> +   u32 hash;
> >>>>>>>> +   const u32 *key;
> >>>>>>>> +};
> >>>>>>>> +
> >>>>>>>> +#define VIRTIO_NET_SUPPORTED_HASH_TYPES 
> >>>>>>>> (VIRTIO_NET_RSS_HASH_TYPE_IPv4 | \
> >>>>>>>> +
> >>>>>>>> VIRTIO_NET_RSS_HASH_TYPE_TCPv4 | \
> >>>>>>>> +
> >>>>>>>> VIRTIO_NET_RSS_HASH_TYPE_UDPv4 | \
> >>>>>>>> +
> >>>>>>>> VIRTIO_NET_RSS_HASH_TYPE_IPv6 | \
> >>>>>>>> +
> >>>>>>>> VIRTIO_NET_RSS_HASH_TYPE_TCPv6 | \
> >>>>>>>> +
> >>>>>>>> VIRTIO_NET_RSS_HASH_TYPE_UDPv6)
> >>>>>>>> +
> >>>>>>>> +#define VIRTIO_NET_RSS_MAX_KEY_SIZE 40
> >>>>>>>> +
> >>>>>>>> +static inline void virtio_net_toeplitz_convert_key(u32 *input, 
> >>>>>>>> size_t len)
> >>>>>>>> +{
> >>>>>>>> +   while (len >= sizeof(*input)) {
> >>>>>>>> +   *input = be32_to_cpu((__force __be32)*input);
> >>>>>>>> +   input++;
> >>>>>>>> +   len -= sizeof(*input);
> >>>>>>>> +   }
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static inline void virtio_net_toeplitz_calc(struct 
> >>>>>>>> virtio_net_toeplitz_state *state,
> >>>>>>>> +   const __be32 *input, 
> >>>>>>>> size_t len)
> >>>>>>>> +{
> >>>>>>>> +   while (len >= sizeof(*input)) {
> >>>>>>>> +   for (u32 map = be32_to_cpu(*input); map; map &= (map 
> >>>>>>>> - 1)) {
> >>>>>>>> +   u32 i = ffs(map);
> >>>>>>>> +
> >>>>>>>> +   state->hash ^= 

Re: [PATCH net-next v12 04/10] tun: Add common virtio-net hash feature code

2025-06-16 Thread Jason Wang
On Fri, Jun 6, 2025 at 5:27 PM Akihiko Odaki  wrote:
>
> On 2025/06/06 10:01, Jason Wang wrote:
> > On Thu, Jun 5, 2025 at 4:18 PM Akihiko Odaki  
> > wrote:
> >>
> >> On 2025/06/05 11:46, Jason Wang wrote:
> >>> On Wed, Jun 4, 2025 at 4:42 PM Akihiko Odaki  
> >>> wrote:
> >>>>
> >>>> On 2025/06/04 10:53, Jason Wang wrote:
> >>>>> On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki 
> >>>>>  wrote:
> >>>>>>
> >>>>>> Add common code required for the features being added to TUN and TAP.
> >>>>>> They will be enabled for each of them in following patches.
> >>>>>>
> >>>>>> Added Features
> >>>>>> ==
> >>>>>>
> >>>>>> Hash reporting
> >>>>>> --
> >>>>>>
> >>>>>> Allow the guest to reuse the hash value to make receive steering
> >>>>>> consistent between the host and guest, and to save hash computation.
> >>>>>>
> >>>>>> Receive Side Scaling (RSS)
> >>>>>> --
> >>>>>>
> >>>>>> RSS is a receive steering algorithm that can be negotiated to use with
> >>>>>> virtio_net. Conventionally the hash calculation was done by the VMM.
> >>>>>> However, computing the hash after the queue was chosen defeats the
> >>>>>> purpose of RSS.
> >>>>>>
> >>>>>> Another approach is to use eBPF steering program. This approach has
> >>>>>> another downside: it cannot report the calculated hash due to the
> >>>>>> restrictive nature of eBPF steering program.
> >>>>>>
> >>>>>> Introduce the code to perform RSS to the kernel in order to overcome
> >>>>>> thse challenges. An alternative solution is to extend the eBPF steering
> >>>>>> program so that it will be able to report to the userspace, but I 
> >>>>>> didn't
> >>>>>> opt for it because extending the current mechanism of eBPF steering
> >>>>>> program as is because it relies on legacy context rewriting, and
> >>>>>> introducing kfunc-based eBPF will result in non-UAPI dependency while
> >>>>>> the other relevant virtualization APIs such as KVM and vhost_net are
> >>>>>> UAPIs.
> >>>>>>
> >>>>>> Added ioctls
> >>>>>> 
> >>>>>>
> >>>>>> They are designed to make extensibility and VM migration compatible.
> >>>>>> This change only adds the implementation and does not expose them to
> >>>>>> the userspace.
> >>>>>>
> >>>>>> TUNGETVNETHASHTYPES
> >>>>>> ---
> >>>>>>
> >>>>>> This ioctl tells supported hash types. It is useful to check if a VM 
> >>>>>> can
> >>>>>> be migrated to the current host.
> >>>>>>
> >>>>>> TUNSETVNETREPORTINGAUTOMQ, TUNSETVNETREPORTINGRSS, and TUNSETVNETRSS
> >>>>>> 
> >>>>>>
> >>>>>> These ioctls configures a steering algorithm and, if needed, hash
> >>>>>> reporting.
> >>>>>>
> >>>>>> Signed-off-by: Akihiko Odaki 
> >>>>>> Tested-by: Lei Yang 
> >>>>>> ---
> >>>>>> drivers/net/tap.c   |  10 ++-
> >>>>>> drivers/net/tun.c   |  12 +++-
> >>>>>> drivers/net/tun_vnet.h  | 165 
> >>>>>> +---
> >>>>>> include/uapi/linux/if_tun.h |  71 +++
> >>>>>> 4 files changed, 244 insertions(+), 14 deletions(-)
> >>>>>>
> >>>>>> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
> >>>>>> index d4ece538f1b2..25c60ff2d3f2 100644
> >>>>>> --- a/drivers/net/tap.c
> >>>>>> +++ b/drivers/net/tap.c
> >>>>>> @@ -179,6 +179,11 @@ static void tap_put_queue(struct tap_queue *q)
> 

Re: [PATCH net-next v12 01/10] virtio_net: Add functions for hashing

2025-06-05 Thread Jason Wang
On Thu, Jun 5, 2025 at 3:58 PM Akihiko Odaki  wrote:
>
> On 2025/06/05 10:53, Jason Wang wrote:
> > On Wed, Jun 4, 2025 at 3:20 PM Akihiko Odaki  
> > wrote:
> >>
> >> On 2025/06/04 10:18, Jason Wang wrote:
> >>> On Tue, Jun 3, 2025 at 1:31 PM Akihiko Odaki  
> >>> wrote:
> >>>>
> >>>> On 2025/06/03 12:19, Jason Wang wrote:
> >>>>> On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki 
> >>>>>  wrote:
> >>>>>>
> >>>>>> They are useful to implement VIRTIO_NET_F_RSS and
> >>>>>> VIRTIO_NET_F_HASH_REPORT.
> >>>>>>
> >>>>>> Signed-off-by: Akihiko Odaki 
> >>>>>> Tested-by: Lei Yang 
> >>>>>> ---
> >>>>>> include/linux/virtio_net.h | 188 
> >>>>>> +
> >>>>>> 1 file changed, 188 insertions(+)
> >>>>>>
> >>>>>> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
> >>>>>> index 02a9f4dc594d..426f33b4b824 100644
> >>>>>> --- a/include/linux/virtio_net.h
> >>>>>> +++ b/include/linux/virtio_net.h
> >>>>>> @@ -9,6 +9,194 @@
> >>>>>> #include 
> >>>>>> #include 
> >>>>>>
> >>>>>> +struct virtio_net_hash {
> >>>>>> +   u32 value;
> >>>>>> +   u16 report;
> >>>>>> +};
> >>>>>> +
> >>>>>> +struct virtio_net_toeplitz_state {
> >>>>>> +   u32 hash;
> >>>>>> +   const u32 *key;
> >>>>>> +};
> >>>>>> +
> >>>>>> +#define VIRTIO_NET_SUPPORTED_HASH_TYPES 
> >>>>>> (VIRTIO_NET_RSS_HASH_TYPE_IPv4 | \
> >>>>>> +
> >>>>>> VIRTIO_NET_RSS_HASH_TYPE_TCPv4 | \
> >>>>>> +
> >>>>>> VIRTIO_NET_RSS_HASH_TYPE_UDPv4 | \
> >>>>>> +VIRTIO_NET_RSS_HASH_TYPE_IPv6 
> >>>>>> | \
> >>>>>> +
> >>>>>> VIRTIO_NET_RSS_HASH_TYPE_TCPv6 | \
> >>>>>> +
> >>>>>> VIRTIO_NET_RSS_HASH_TYPE_UDPv6)
> >>>>>> +
> >>>>>> +#define VIRTIO_NET_RSS_MAX_KEY_SIZE 40
> >>>>>> +
> >>>>>> +static inline void virtio_net_toeplitz_convert_key(u32 *input, size_t 
> >>>>>> len)
> >>>>>> +{
> >>>>>> +   while (len >= sizeof(*input)) {
> >>>>>> +   *input = be32_to_cpu((__force __be32)*input);
> >>>>>> +   input++;
> >>>>>> +   len -= sizeof(*input);
> >>>>>> +   }
> >>>>>> +}
> >>>>>> +
> >>>>>> +static inline void virtio_net_toeplitz_calc(struct 
> >>>>>> virtio_net_toeplitz_state *state,
> >>>>>> +   const __be32 *input, 
> >>>>>> size_t len)
> >>>>>> +{
> >>>>>> +   while (len >= sizeof(*input)) {
> >>>>>> +   for (u32 map = be32_to_cpu(*input); map; map &= (map - 
> >>>>>> 1)) {
> >>>>>> +   u32 i = ffs(map);
> >>>>>> +
> >>>>>> +   state->hash ^= state->key[0] << (32 - i) |
> >>>>>> +  (u32)((u64)state->key[1] >> i);
> >>>>>> +   }
> >>>>>> +
> >>>>>> +   state->key++;
> >>>>>> +   input++;
> >>>>>> +   len -= sizeof(*input);
> >>>>>> +   }
> >>>>>> +}
> >>>>>> +
> >>>>>> +static inline u8 virtio_net_hash_key_length(u32 types)
> >>>>>> +{
> >>>>>> +   size_t len = 0;
> >>>>>> +
> >>

Re: [PATCH net-next v12 04/10] tun: Add common virtio-net hash feature code

2025-06-05 Thread Jason Wang
On Thu, Jun 5, 2025 at 4:18 PM Akihiko Odaki  wrote:
>
> On 2025/06/05 11:46, Jason Wang wrote:
> > On Wed, Jun 4, 2025 at 4:42 PM Akihiko Odaki  
> > wrote:
> >>
> >> On 2025/06/04 10:53, Jason Wang wrote:
> >>> On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki  
> >>> wrote:
> >>>>
> >>>> Add common code required for the features being added to TUN and TAP.
> >>>> They will be enabled for each of them in following patches.
> >>>>
> >>>> Added Features
> >>>> ==
> >>>>
> >>>> Hash reporting
> >>>> --
> >>>>
> >>>> Allow the guest to reuse the hash value to make receive steering
> >>>> consistent between the host and guest, and to save hash computation.
> >>>>
> >>>> Receive Side Scaling (RSS)
> >>>> --
> >>>>
> >>>> RSS is a receive steering algorithm that can be negotiated to use with
> >>>> virtio_net. Conventionally the hash calculation was done by the VMM.
> >>>> However, computing the hash after the queue was chosen defeats the
> >>>> purpose of RSS.
> >>>>
> >>>> Another approach is to use eBPF steering program. This approach has
> >>>> another downside: it cannot report the calculated hash due to the
> >>>> restrictive nature of eBPF steering program.
> >>>>
> >>>> Introduce the code to perform RSS to the kernel in order to overcome
> >>>> thse challenges. An alternative solution is to extend the eBPF steering
> >>>> program so that it will be able to report to the userspace, but I didn't
> >>>> opt for it because extending the current mechanism of eBPF steering
> >>>> program as is because it relies on legacy context rewriting, and
> >>>> introducing kfunc-based eBPF will result in non-UAPI dependency while
> >>>> the other relevant virtualization APIs such as KVM and vhost_net are
> >>>> UAPIs.
> >>>>
> >>>> Added ioctls
> >>>> 
> >>>>
> >>>> They are designed to make extensibility and VM migration compatible.
> >>>> This change only adds the implementation and does not expose them to
> >>>> the userspace.
> >>>>
> >>>> TUNGETVNETHASHTYPES
> >>>> ---
> >>>>
> >>>> This ioctl tells supported hash types. It is useful to check if a VM can
> >>>> be migrated to the current host.
> >>>>
> >>>> TUNSETVNETREPORTINGAUTOMQ, TUNSETVNETREPORTINGRSS, and TUNSETVNETRSS
> >>>> 
> >>>>
> >>>> These ioctls configures a steering algorithm and, if needed, hash
> >>>> reporting.
> >>>>
> >>>> Signed-off-by: Akihiko Odaki 
> >>>> Tested-by: Lei Yang 
> >>>> ---
> >>>>drivers/net/tap.c   |  10 ++-
> >>>>drivers/net/tun.c   |  12 +++-
> >>>>drivers/net/tun_vnet.h  | 165 
> >>>> +---
> >>>>include/uapi/linux/if_tun.h |  71 +++
> >>>>4 files changed, 244 insertions(+), 14 deletions(-)
> >>>>
> >>>> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
> >>>> index d4ece538f1b2..25c60ff2d3f2 100644
> >>>> --- a/drivers/net/tap.c
> >>>> +++ b/drivers/net/tap.c
> >>>> @@ -179,6 +179,11 @@ static void tap_put_queue(struct tap_queue *q)
> >>>>   sock_put(&q->sk);
> >>>>}
> >>>>
> >>>> +static const struct virtio_net_hash *tap_find_hash(const struct sk_buff 
> >>>> *skb)
> >>>> +{
> >>>> +   return NULL;
> >>>> +}
> >>>> +
> >>>>/*
> >>>> * Select a queue based on the rxq of the device on which this packet
> >>>> * arrived. If the incoming device is not mq, calculate a flow hash
> >>>> @@ -711,11 +716,12 @@ static ssize_t tap_put_user(struct tap_queue *q,
> >>>>   int total;
> >>>>
> >>>>   if (q->flags & IFF_VNET_HDR) {
> >>&

Re: [PATCH net-next v12 01/10] virtio_net: Add functions for hashing

2025-06-23 Thread Jason Wang
On Mon, Jun 23, 2025 at 10:28 PM Yuri Benditovich
 wrote:
>
> On Mon, Jun 23, 2025 at 11:07 AM Jason Wang  wrote:
> >
> > On Mon, Jun 23, 2025 at 1:40 AM Yuri Benditovich
> >  wrote:
> > >
> > > > Yuri, can you help to clarify this?
> > >
> > > I see here several questions:
> > > 1. Whether it is ok for the device not to indicate support for XXX_EX 
> > > hash type?
> > > - I think, yes (strictly speaking, it was better to test that before
> > > submitting the patches )
> > > 2. Is it possible that the guest will enable some XXX_EX hash type if
> > > the device does not indicate that it is supported?
> > > - No (I think this is part of the spec)
> >
> > There's another question, is the device allowed to fallback to
> > VIRTIO_NET_HASH_TYPE_IPv6 if it fails to parse extensions?
> MSFT expectations for that are at
> https://learn.microsoft.com/en-us/windows-hardware/drivers/network/rss-hashing-types
> If I read them correctly, the answer is "no"

Ok, so I guess it implies the implementation should be ready to deal
with arbitrary length of ipv6 options.

> BTW, my personal opinion is that placing all these things with hash
> calculations into kernel instead of ebpf does not make too much sense.

If I remember correctly, we tried to enable it via eBPF, but failed
due to the rejection of eBPF maintainers.

Maybe we can revisit the idea. But anyhow the hardcoded logic might
still be useful as eBPF is not guaranteed to work in all cases.

Thanks

>
> >
> > > 3. What to do if we migrate between systems with different
> > > capabilities of hash support/reporting/whatever
> > > - IMO, at this moment such case should be excluded and only mechanism
> > > we have for that is the compatible machine version
> > > - in some future the change of device capabilities can be communicated
> > > to the driver and _probably_ the driver might be able to communicate
> > > the change of device capabilities to the OS
> >
> > Are you suggesting implementing all hash types? Note that Akihiko
> > raises the issue that in the actual implementation there should be a
> > limitation of the maximum number of options. If such a limitation is
> > different between src and dst, the difference could be noticed by the
> > guest.
> >
> > > 4. Does it make sense to have fine configuration of hash types mask
> > > via command-line?
> > > - IMO, no. This would require the user to have too much knowledge
> > > about RSS internals
> > >
> > > Please let me know if I missed something.
> > >
> >
> > Thanks
> >
>




Re: [PATCH net-next v12 01/10] virtio_net: Add functions for hashing

2025-06-03 Thread Jason Wang
On Tue, Jun 3, 2025 at 1:31 PM Akihiko Odaki  wrote:
>
> On 2025/06/03 12:19, Jason Wang wrote:
> > On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki  
> > wrote:
> >>
> >> They are useful to implement VIRTIO_NET_F_RSS and
> >> VIRTIO_NET_F_HASH_REPORT.
> >>
> >> Signed-off-by: Akihiko Odaki 
> >> Tested-by: Lei Yang 
> >> ---
> >>   include/linux/virtio_net.h | 188 
> >> +
> >>   1 file changed, 188 insertions(+)
> >>
> >> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
> >> index 02a9f4dc594d..426f33b4b824 100644
> >> --- a/include/linux/virtio_net.h
> >> +++ b/include/linux/virtio_net.h
> >> @@ -9,6 +9,194 @@
> >>   #include 
> >>   #include 
> >>
> >> +struct virtio_net_hash {
> >> +   u32 value;
> >> +   u16 report;
> >> +};
> >> +
> >> +struct virtio_net_toeplitz_state {
> >> +   u32 hash;
> >> +   const u32 *key;
> >> +};
> >> +
> >> +#define VIRTIO_NET_SUPPORTED_HASH_TYPES (VIRTIO_NET_RSS_HASH_TYPE_IPv4 | \
> >> +VIRTIO_NET_RSS_HASH_TYPE_TCPv4 | \
> >> +VIRTIO_NET_RSS_HASH_TYPE_UDPv4 | \
> >> +VIRTIO_NET_RSS_HASH_TYPE_IPv6 | \
> >> +VIRTIO_NET_RSS_HASH_TYPE_TCPv6 | \
> >> +VIRTIO_NET_RSS_HASH_TYPE_UDPv6)
> >> +
> >> +#define VIRTIO_NET_RSS_MAX_KEY_SIZE 40
> >> +
> >> +static inline void virtio_net_toeplitz_convert_key(u32 *input, size_t len)
> >> +{
> >> +   while (len >= sizeof(*input)) {
> >> +   *input = be32_to_cpu((__force __be32)*input);
> >> +   input++;
> >> +   len -= sizeof(*input);
> >> +   }
> >> +}
> >> +
> >> +static inline void virtio_net_toeplitz_calc(struct 
> >> virtio_net_toeplitz_state *state,
> >> +   const __be32 *input, size_t 
> >> len)
> >> +{
> >> +   while (len >= sizeof(*input)) {
> >> +   for (u32 map = be32_to_cpu(*input); map; map &= (map - 1)) 
> >> {
> >> +   u32 i = ffs(map);
> >> +
> >> +   state->hash ^= state->key[0] << (32 - i) |
> >> +  (u32)((u64)state->key[1] >> i);
> >> +   }
> >> +
> >> +   state->key++;
> >> +   input++;
> >> +   len -= sizeof(*input);
> >> +   }
> >> +}
> >> +
> >> +static inline u8 virtio_net_hash_key_length(u32 types)
> >> +{
> >> +   size_t len = 0;
> >> +
> >> +   if (types & VIRTIO_NET_HASH_REPORT_IPv4)
> >> +   len = max(len,
> >> + sizeof(struct flow_dissector_key_ipv4_addrs));
> >> +
> >> +   if (types &
> >> +   (VIRTIO_NET_HASH_REPORT_TCPv4 | VIRTIO_NET_HASH_REPORT_UDPv4))
> >> +   len = max(len,
> >> + sizeof(struct flow_dissector_key_ipv4_addrs) +
> >> + sizeof(struct flow_dissector_key_ports));
> >> +
> >> +   if (types & VIRTIO_NET_HASH_REPORT_IPv6)
> >> +   len = max(len,
> >> + sizeof(struct flow_dissector_key_ipv6_addrs));
> >> +
> >> +   if (types &
> >> +   (VIRTIO_NET_HASH_REPORT_TCPv6 | VIRTIO_NET_HASH_REPORT_UDPv6))
> >> +   len = max(len,
> >> + sizeof(struct flow_dissector_key_ipv6_addrs) +
> >> + sizeof(struct flow_dissector_key_ports));
> >> +
> >> +   return len + sizeof(u32);
> >> +}
> >> +
> >> +static inline u32 virtio_net_hash_report(u32 types,
> >> +const struct flow_keys_basic 
> >> *keys)
> >> +{
> >> +   switch (keys->basic.n_proto) {
> >> +   case cpu_to_be16(ETH_P_IP):
> >> +   if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) {
> >> +   if (keys->basic.ip_proto == IPPROTO_TCP &&
> >>

Re: [PATCH net-next v12 03/10] tun: Allow steering eBPF program to fall back

2025-06-03 Thread Jason Wang
On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki  wrote:
>
> This clarifies a steering eBPF program takes precedence over the other
> steering algorithms.

Let's give an example on the use case for this.

>
> Signed-off-by: Akihiko Odaki 
> ---
>  Documentation/networking/tuntap.rst |  7 +++
>  drivers/net/tun.c   | 28 +---
>  include/uapi/linux/if_tun.h |  9 +
>  3 files changed, 33 insertions(+), 11 deletions(-)
>
> diff --git a/Documentation/networking/tuntap.rst 
> b/Documentation/networking/tuntap.rst
> index 4d7087f727be..86b4ae8caa8a 100644
> --- a/Documentation/networking/tuntap.rst
> +++ b/Documentation/networking/tuntap.rst
> @@ -206,6 +206,13 @@ enable is true we enable it, otherwise we disable it::
>return ioctl(fd, TUNSETQUEUE, (void *)&ifr);
>}
>
> +3.4 Reference
> +-
> +
> +``linux/if_tun.h`` defines the interface described below:
> +
> +.. kernel-doc:: include/uapi/linux/if_tun.h
> +
>  Universal TUN/TAP device driver Frequently Asked Question
>  =
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index d8f4d3e996a7..9133ab9ed3f5 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -476,21 +476,29 @@ static u16 tun_automq_select_queue(struct tun_struct 
> *tun, struct sk_buff *skb)
> return txq;
>  }
>
> -static u16 tun_ebpf_select_queue(struct tun_struct *tun, struct sk_buff *skb)
> +static bool tun_ebpf_select_queue(struct tun_struct *tun, struct sk_buff 
> *skb,
> + u16 *ret)
>  {
> struct tun_prog *prog;
> u32 numqueues;
> -   u16 ret = 0;
> +   u32 prog_ret;
> +
> +   prog = rcu_dereference(tun->steering_prog);
> +   if (!prog)
> +   return false;
>
> numqueues = READ_ONCE(tun->numqueues);
> -   if (!numqueues)
> -   return 0;
> +   if (!numqueues) {
> +   *ret = 0;
> +   return true;
> +   }
>
> -   prog = rcu_dereference(tun->steering_prog);
> -   if (prog)
> -   ret = bpf_prog_run_clear_cb(prog->prog, skb);
> +   prog_ret = bpf_prog_run_clear_cb(prog->prog, skb);
> +   if (prog_ret == TUN_STEERINGEBPF_FALLBACK)
> +   return false;

This seems to break the uAPI. So I think we need a new ioctl to enable
the behaviour

>
> -   return ret % numqueues;
> +   *ret = (u16)prog_ret % numqueues;
> +   return true;
>  }
>
>  static u16 tun_select_queue(struct net_device *dev, struct sk_buff *skb,
> @@ -500,9 +508,7 @@ static u16 tun_select_queue(struct net_device *dev, 
> struct sk_buff *skb,
> u16 ret;
>
> rcu_read_lock();
> -   if (rcu_dereference(tun->steering_prog))
> -   ret = tun_ebpf_select_queue(tun, skb);
> -   else
> +   if (!tun_ebpf_select_queue(tun, skb, &ret))
> ret = tun_automq_select_queue(tun, skb);
> rcu_read_unlock();
>
> diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h
> index 287cdc81c939..980de74724fc 100644
> --- a/include/uapi/linux/if_tun.h
> +++ b/include/uapi/linux/if_tun.h
> @@ -115,4 +115,13 @@ struct tun_filter {
> __u8   addr[][ETH_ALEN];
>  };
>
> +/**
> + * define TUN_STEERINGEBPF_FALLBACK - A steering eBPF return value to fall 
> back
> + *
> + * A steering eBPF program may return this value to fall back to the steering
> + * algorithm that should have been used if the program was not set. This 
> allows
> + * selectively overriding the steering decision.
> + */
> +#define TUN_STEERINGEBPF_FALLBACK -1

Not a native speaker, consider it works more like XDP_PASS, would it
be better to use "TUN_STERRING_PASS"?

Thanks

> +
>  #endif /* _UAPI__IF_TUN_H */
>
> --
> 2.49.0
>




Re: [PATCH net-next v12 04/10] tun: Add common virtio-net hash feature code

2025-06-03 Thread Jason Wang
On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki  wrote:
>
> Add common code required for the features being added to TUN and TAP.
> They will be enabled for each of them in following patches.
>
> Added Features
> ==
>
> Hash reporting
> --
>
> Allow the guest to reuse the hash value to make receive steering
> consistent between the host and guest, and to save hash computation.
>
> Receive Side Scaling (RSS)
> --
>
> RSS is a receive steering algorithm that can be negotiated to use with
> virtio_net. Conventionally the hash calculation was done by the VMM.
> However, computing the hash after the queue was chosen defeats the
> purpose of RSS.
>
> Another approach is to use eBPF steering program. This approach has
> another downside: it cannot report the calculated hash due to the
> restrictive nature of eBPF steering program.
>
> Introduce the code to perform RSS to the kernel in order to overcome
> thse challenges. An alternative solution is to extend the eBPF steering
> program so that it will be able to report to the userspace, but I didn't
> opt for it because extending the current mechanism of eBPF steering
> program as is because it relies on legacy context rewriting, and
> introducing kfunc-based eBPF will result in non-UAPI dependency while
> the other relevant virtualization APIs such as KVM and vhost_net are
> UAPIs.
>
> Added ioctls
> 
>
> They are designed to make extensibility and VM migration compatible.
> This change only adds the implementation and does not expose them to
> the userspace.
>
> TUNGETVNETHASHTYPES
> ---
>
> This ioctl tells supported hash types. It is useful to check if a VM can
> be migrated to the current host.
>
> TUNSETVNETREPORTINGAUTOMQ, TUNSETVNETREPORTINGRSS, and TUNSETVNETRSS
> 
>
> These ioctls configures a steering algorithm and, if needed, hash
> reporting.
>
> Signed-off-by: Akihiko Odaki 
> Tested-by: Lei Yang 
> ---
>  drivers/net/tap.c   |  10 ++-
>  drivers/net/tun.c   |  12 +++-
>  drivers/net/tun_vnet.h  | 165 
> +---
>  include/uapi/linux/if_tun.h |  71 +++
>  4 files changed, 244 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
> index d4ece538f1b2..25c60ff2d3f2 100644
> --- a/drivers/net/tap.c
> +++ b/drivers/net/tap.c
> @@ -179,6 +179,11 @@ static void tap_put_queue(struct tap_queue *q)
> sock_put(&q->sk);
>  }
>
> +static const struct virtio_net_hash *tap_find_hash(const struct sk_buff *skb)
> +{
> +   return NULL;
> +}
> +
>  /*
>   * Select a queue based on the rxq of the device on which this packet
>   * arrived. If the incoming device is not mq, calculate a flow hash
> @@ -711,11 +716,12 @@ static ssize_t tap_put_user(struct tap_queue *q,
> int total;
>
> if (q->flags & IFF_VNET_HDR) {
> -   struct virtio_net_hdr vnet_hdr;
> +   struct virtio_net_hdr_v1_hash vnet_hdr;
>
> vnet_hdr_len = READ_ONCE(q->vnet_hdr_sz);
>
> -   ret = tun_vnet_hdr_from_skb(q->flags, NULL, skb, &vnet_hdr);
> +   ret = tun_vnet_hdr_from_skb(vnet_hdr_len, q->flags, NULL, skb,
> +   tap_find_hash, &vnet_hdr);
> if (ret)
> return ret;
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 9133ab9ed3f5..03d47799e9bd 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -451,6 +451,11 @@ static inline void tun_flow_save_rps_rxhash(struct 
> tun_flow_entry *e, u32 hash)
> e->rps_rxhash = hash;
>  }
>
> +static const struct virtio_net_hash *tun_find_hash(const struct sk_buff *skb)
> +{
> +   return NULL;
> +}
> +
>  /* We try to identify a flow through its rxhash. The reason that
>   * we do not check rxq no. is because some cards(e.g 82599), chooses
>   * the rxq based on the txq where the last packet of the flow comes. As
> @@ -1993,7 +1998,7 @@ static ssize_t tun_put_user_xdp(struct tun_struct *tun,
> ssize_t ret;
>
> if (tun->flags & IFF_VNET_HDR) {
> -   struct virtio_net_hdr gso = { 0 };
> +   struct virtio_net_hdr_v1_hash gso = { 0 };
>
> vnet_hdr_sz = READ_ONCE(tun->vnet_hdr_sz);
> ret = tun_vnet_hdr_put(vnet_hdr_sz, iter, &gso);
> @@ -2046,9 +2051,10 @@ static ssize_t tun_put_user(struct tun_struct *tun,
> }
>
> if (vnet_hdr_sz) {
> -   struct virtio_net_hdr gso;
> +   struct virtio_net_hdr_v1_hash gso;
>
> -   ret = tun_vnet_hdr_from_skb(tun->flags, tun->dev, skb, &gso);
> +   ret = tun_vnet_hdr_from_skb(vnet_hdr_sz, tun->flags, tun->dev,
> +   skb, tun_find_hash, &gso);
> if (ret)
> return ret;
>
> diff 

Re: [PATCH net-next v12 01/10] virtio_net: Add functions for hashing

2025-06-04 Thread Jason Wang
On Wed, Jun 4, 2025 at 3:20 PM Akihiko Odaki  wrote:
>
> On 2025/06/04 10:18, Jason Wang wrote:
> > On Tue, Jun 3, 2025 at 1:31 PM Akihiko Odaki  
> > wrote:
> >>
> >> On 2025/06/03 12:19, Jason Wang wrote:
> >>> On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki  
> >>> wrote:
> >>>>
> >>>> They are useful to implement VIRTIO_NET_F_RSS and
> >>>> VIRTIO_NET_F_HASH_REPORT.
> >>>>
> >>>> Signed-off-by: Akihiko Odaki 
> >>>> Tested-by: Lei Yang 
> >>>> ---
> >>>>include/linux/virtio_net.h | 188 
> >>>> +
> >>>>1 file changed, 188 insertions(+)
> >>>>
> >>>> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
> >>>> index 02a9f4dc594d..426f33b4b824 100644
> >>>> --- a/include/linux/virtio_net.h
> >>>> +++ b/include/linux/virtio_net.h
> >>>> @@ -9,6 +9,194 @@
> >>>>#include 
> >>>>#include 
> >>>>
> >>>> +struct virtio_net_hash {
> >>>> +   u32 value;
> >>>> +   u16 report;
> >>>> +};
> >>>> +
> >>>> +struct virtio_net_toeplitz_state {
> >>>> +   u32 hash;
> >>>> +   const u32 *key;
> >>>> +};
> >>>> +
> >>>> +#define VIRTIO_NET_SUPPORTED_HASH_TYPES (VIRTIO_NET_RSS_HASH_TYPE_IPv4 
> >>>> | \
> >>>> +VIRTIO_NET_RSS_HASH_TYPE_TCPv4 
> >>>> | \
> >>>> +VIRTIO_NET_RSS_HASH_TYPE_UDPv4 
> >>>> | \
> >>>> +VIRTIO_NET_RSS_HASH_TYPE_IPv6 | 
> >>>> \
> >>>> +VIRTIO_NET_RSS_HASH_TYPE_TCPv6 
> >>>> | \
> >>>> +VIRTIO_NET_RSS_HASH_TYPE_UDPv6)
> >>>> +
> >>>> +#define VIRTIO_NET_RSS_MAX_KEY_SIZE 40
> >>>> +
> >>>> +static inline void virtio_net_toeplitz_convert_key(u32 *input, size_t 
> >>>> len)
> >>>> +{
> >>>> +   while (len >= sizeof(*input)) {
> >>>> +   *input = be32_to_cpu((__force __be32)*input);
> >>>> +   input++;
> >>>> +   len -= sizeof(*input);
> >>>> +   }
> >>>> +}
> >>>> +
> >>>> +static inline void virtio_net_toeplitz_calc(struct 
> >>>> virtio_net_toeplitz_state *state,
> >>>> +   const __be32 *input, size_t 
> >>>> len)
> >>>> +{
> >>>> +   while (len >= sizeof(*input)) {
> >>>> +   for (u32 map = be32_to_cpu(*input); map; map &= (map - 
> >>>> 1)) {
> >>>> +   u32 i = ffs(map);
> >>>> +
> >>>> +   state->hash ^= state->key[0] << (32 - i) |
> >>>> +  (u32)((u64)state->key[1] >> i);
> >>>> +   }
> >>>> +
> >>>> +   state->key++;
> >>>> +   input++;
> >>>> +   len -= sizeof(*input);
> >>>> +   }
> >>>> +}
> >>>> +
> >>>> +static inline u8 virtio_net_hash_key_length(u32 types)
> >>>> +{
> >>>> +   size_t len = 0;
> >>>> +
> >>>> +   if (types & VIRTIO_NET_HASH_REPORT_IPv4)
> >>>> +   len = max(len,
> >>>> + sizeof(struct flow_dissector_key_ipv4_addrs));
> >>>> +
> >>>> +   if (types &
> >>>> +   (VIRTIO_NET_HASH_REPORT_TCPv4 | 
> >>>> VIRTIO_NET_HASH_REPORT_UDPv4))
> >>>> +   len = max(len,
> >>>> + sizeof(struct flow_dissector_key_ipv4_addrs) +
> >>>> + sizeof(struct flow_dissector_key_ports));
> >>>> +
> >>>> +   if (types & VIRTIO_NET_HASH_REPORT_IPv6)
> >>>> +   len = max(len,
> >>>> +

Re: [PATCH net-next v12 03/10] tun: Allow steering eBPF program to fall back

2025-06-04 Thread Jason Wang
On Wed, Jun 4, 2025 at 3:24 PM Akihiko Odaki  wrote:
>
> On 2025/06/04 10:27, Jason Wang wrote:
> > On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki  
> > wrote:
> >>
> >> This clarifies a steering eBPF program takes precedence over the other
> >> steering algorithms.
> >
> > Let's give an example on the use case for this.
> >
> >>
> >> Signed-off-by: Akihiko Odaki 
> >> ---
> >>   Documentation/networking/tuntap.rst |  7 +++
> >>   drivers/net/tun.c   | 28 +---
> >>   include/uapi/linux/if_tun.h |  9 +
> >>   3 files changed, 33 insertions(+), 11 deletions(-)
> >>
> >> diff --git a/Documentation/networking/tuntap.rst 
> >> b/Documentation/networking/tuntap.rst
> >> index 4d7087f727be..86b4ae8caa8a 100644
> >> --- a/Documentation/networking/tuntap.rst
> >> +++ b/Documentation/networking/tuntap.rst
> >> @@ -206,6 +206,13 @@ enable is true we enable it, otherwise we disable it::
> >> return ioctl(fd, TUNSETQUEUE, (void *)&ifr);
> >> }
> >>
> >> +3.4 Reference
> >> +-
> >> +
> >> +``linux/if_tun.h`` defines the interface described below:
> >> +
> >> +.. kernel-doc:: include/uapi/linux/if_tun.h
> >> +
> >>   Universal TUN/TAP device driver Frequently Asked Question
> >>   =
> >>
> >> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> >> index d8f4d3e996a7..9133ab9ed3f5 100644
> >> --- a/drivers/net/tun.c
> >> +++ b/drivers/net/tun.c
> >> @@ -476,21 +476,29 @@ static u16 tun_automq_select_queue(struct tun_struct 
> >> *tun, struct sk_buff *skb)
> >>  return txq;
> >>   }
> >>
> >> -static u16 tun_ebpf_select_queue(struct tun_struct *tun, struct sk_buff 
> >> *skb)
> >> +static bool tun_ebpf_select_queue(struct tun_struct *tun, struct sk_buff 
> >> *skb,
> >> + u16 *ret)
> >>   {
> >>  struct tun_prog *prog;
> >>  u32 numqueues;
> >> -   u16 ret = 0;
> >> +   u32 prog_ret;
> >> +
> >> +   prog = rcu_dereference(tun->steering_prog);
> >> +   if (!prog)
> >> +   return false;
> >>
> >>  numqueues = READ_ONCE(tun->numqueues);
> >> -   if (!numqueues)
> >> -   return 0;
> >> +   if (!numqueues) {
> >> +   *ret = 0;
> >> +   return true;
> >> +   }
> >>
> >> -   prog = rcu_dereference(tun->steering_prog);
> >> -   if (prog)
> >> -   ret = bpf_prog_run_clear_cb(prog->prog, skb);
> >> +   prog_ret = bpf_prog_run_clear_cb(prog->prog, skb);
> >> +   if (prog_ret == TUN_STEERINGEBPF_FALLBACK)
> >> +   return false;
> >
> > This seems to break the uAPI. So I think we need a new ioctl to enable
> > the behaviour
>
> I assumed it is fine to repurpose one of the 32-bit integer values since
> 32-bit integer is too big to specify the queue number, but it may not be
> fine. I don't have a concrete use case either.
>
> Perhaps it is safer to note that TUNSETSTEERINGEBPF takes precedence
> over TUNSETVNETRSS to allow such an extension in the future (but without
> implementing one now).

Yes, that's my original point. Let's start from something that is
simpler and safer.

Thanks




Re: [PATCH net-next v12 04/10] tun: Add common virtio-net hash feature code

2025-06-04 Thread Jason Wang
On Wed, Jun 4, 2025 at 4:42 PM Akihiko Odaki  wrote:
>
> On 2025/06/04 10:53, Jason Wang wrote:
> > On Fri, May 30, 2025 at 12:50 PM Akihiko Odaki  
> > wrote:
> >>
> >> Add common code required for the features being added to TUN and TAP.
> >> They will be enabled for each of them in following patches.
> >>
> >> Added Features
> >> ==
> >>
> >> Hash reporting
> >> --
> >>
> >> Allow the guest to reuse the hash value to make receive steering
> >> consistent between the host and guest, and to save hash computation.
> >>
> >> Receive Side Scaling (RSS)
> >> --
> >>
> >> RSS is a receive steering algorithm that can be negotiated to use with
> >> virtio_net. Conventionally the hash calculation was done by the VMM.
> >> However, computing the hash after the queue was chosen defeats the
> >> purpose of RSS.
> >>
> >> Another approach is to use eBPF steering program. This approach has
> >> another downside: it cannot report the calculated hash due to the
> >> restrictive nature of eBPF steering program.
> >>
> >> Introduce the code to perform RSS to the kernel in order to overcome
> >> thse challenges. An alternative solution is to extend the eBPF steering
> >> program so that it will be able to report to the userspace, but I didn't
> >> opt for it because extending the current mechanism of eBPF steering
> >> program as is because it relies on legacy context rewriting, and
> >> introducing kfunc-based eBPF will result in non-UAPI dependency while
> >> the other relevant virtualization APIs such as KVM and vhost_net are
> >> UAPIs.
> >>
> >> Added ioctls
> >> 
> >>
> >> They are designed to make extensibility and VM migration compatible.
> >> This change only adds the implementation and does not expose them to
> >> the userspace.
> >>
> >> TUNGETVNETHASHTYPES
> >> ---
> >>
> >> This ioctl tells supported hash types. It is useful to check if a VM can
> >> be migrated to the current host.
> >>
> >> TUNSETVNETREPORTINGAUTOMQ, TUNSETVNETREPORTINGRSS, and TUNSETVNETRSS
> >> 
> >>
> >> These ioctls configures a steering algorithm and, if needed, hash
> >> reporting.
> >>
> >> Signed-off-by: Akihiko Odaki 
> >> Tested-by: Lei Yang 
> >> ---
> >>   drivers/net/tap.c   |  10 ++-
> >>   drivers/net/tun.c   |  12 +++-
> >>   drivers/net/tun_vnet.h  | 165 
> >> +---
> >>   include/uapi/linux/if_tun.h |  71 +++
> >>   4 files changed, 244 insertions(+), 14 deletions(-)
> >>
> >> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
> >> index d4ece538f1b2..25c60ff2d3f2 100644
> >> --- a/drivers/net/tap.c
> >> +++ b/drivers/net/tap.c
> >> @@ -179,6 +179,11 @@ static void tap_put_queue(struct tap_queue *q)
> >>  sock_put(&q->sk);
> >>   }
> >>
> >> +static const struct virtio_net_hash *tap_find_hash(const struct sk_buff 
> >> *skb)
> >> +{
> >> +   return NULL;
> >> +}
> >> +
> >>   /*
> >>* Select a queue based on the rxq of the device on which this packet
> >>* arrived. If the incoming device is not mq, calculate a flow hash
> >> @@ -711,11 +716,12 @@ static ssize_t tap_put_user(struct tap_queue *q,
> >>  int total;
> >>
> >>  if (q->flags & IFF_VNET_HDR) {
> >> -   struct virtio_net_hdr vnet_hdr;
> >> +   struct virtio_net_hdr_v1_hash vnet_hdr;
> >>
> >>  vnet_hdr_len = READ_ONCE(q->vnet_hdr_sz);
> >>
> >> -   ret = tun_vnet_hdr_from_skb(q->flags, NULL, skb, 
> >> &vnet_hdr);
> >> +   ret = tun_vnet_hdr_from_skb(vnet_hdr_len, q->flags, NULL, 
> >> skb,
> >> +   tap_find_hash, &vnet_hdr);
> >>  if (ret)
> >>  return ret;
> >>
> >> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> >> index 9133ab9ed3f5..03d47799e9bd 100644
> >> --- a/drivers/net/tun.c
> >> +++ b/drivers/net/tun.c