[PATCH RFC v3 0/9] tun: Introduce virtio-net hashing feature

2024-09-14 Thread Akihiko Odaki
virtio-net have two usage of hashes: one is RSS and another is hash
reporting. Conventionally the hash calculation was done by the VMM.
However, computing the hash after the queue was chosen defeats the
purpose of RSS.

Another approach is to use eBPF steering program. This approach has
another downside: it cannot report the calculated hash due to the
restrictive nature of eBPF.

Introduce the code to compute hashes to the kernel in order to overcome
thse challenges.

An alternative solution is to extend the eBPF steering program so that it
will be able to report to the userspace, but it is based on context
rewrites, which is in feature freeze. We can adopt kfuncs, but they will
not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM
and vhost_net).

QEMU patched to use this new feature is available at:
https://github.com/daynix/qemu/tree/akihikodaki/rss2

The QEMU patches will soon be submitted to the upstream as RFC too.

This work will be presented at LPC 2024:
https://lpc.events/event/18/contributions/1963/

V1 -> V2:
  Changed to introduce a new BPF program type.

Signed-off-by: Akihiko Odaki 
---
Changes in v3:
- Reverted back to add ioctl.
- Split patch "tun: Introduce virtio-net hashing feature" into
  "tun: Introduce virtio-net hash reporting feature" and
  "tun: Introduce virtio-net RSS".
- Changed to reuse hash values computed for automq instead of performing
  RSS hashing when hash reporting is requested but RSS is not.
- Extracted relevant data from struct tun_struct to keep it minimal.
- Added kernel-doc.
- Changed to allow calling TUNGETVNETHASHCAP before TUNSETIFF.
- Initialized num_buffers with 1.
- Added a test case for unclassified packets.
- Fixed error handling in tests.
- Changed tests to verify that the queue index will not overflow.
- Rebased.
- Link to v2: 
https://lore.kernel.org/r/20231015141644.260646-1-akihiko.od...@daynix.com

---
Akihiko Odaki (9):
  skbuff: Introduce SKB_EXT_TUN_VNET_HASH
  virtio_net: Add functions for hashing
  net: flow_dissector: Export flow_keys_dissector_symmetric
  tap: Pad virtio header with zero
  tun: Pad virtio header with zero
  tun: Introduce virtio-net hash reporting feature
  tun: Introduce virtio-net RSS
  selftest: tun: Add tests for virtio-net hashing
  vhost/net: Support VIRTIO_NET_F_HASH_REPORT

 Documentation/networking/tuntap.rst  |   7 +
 drivers/net/Kconfig  |   1 +
 drivers/net/tap.c|   2 +-
 drivers/net/tun.c| 255 --
 drivers/vhost/net.c  |  16 +-
 include/linux/skbuff.h   |  10 +
 include/linux/virtio_net.h   | 198 +++
 include/net/flow_dissector.h |   1 +
 include/uapi/linux/if_tun.h  |  71 
 net/core/flow_dissector.c|   3 +-
 net/core/skbuff.c|   3 +
 tools/testing/selftests/net/Makefile |   2 +-
 tools/testing/selftests/net/tun.c| 666 ++-
 13 files changed, 1195 insertions(+), 40 deletions(-)
---
base-commit: 46a0057a5853cbdb58211c19e89badc6fd50
change-id: 20240403-rss-e737d89efa77

Best regards,
-- 
Akihiko Odaki 




[PATCH RFC v3 1/9] skbuff: Introduce SKB_EXT_TUN_VNET_HASH

2024-09-14 Thread Akihiko Odaki
This new extension will be used by tun to carry the hash values and
types to report with virtio-net headers.

Signed-off-by: Akihiko Odaki 
---
 include/linux/skbuff.h | 10 ++
 net/core/skbuff.c  |  3 +++
 2 files changed, 13 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 29c3ea5b6e93..17cee21c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -334,6 +334,13 @@ struct tc_skb_ext {
 };
 #endif
 
+#if IS_ENABLED(CONFIG_TUN)
+struct tun_vnet_hash_ext {
+   u32 value;
+   u16 report;
+};
+#endif
+
 struct sk_buff_head {
/* These two members must be first to match sk_buff. */
struct_group_tagged(sk_buff_list, list,
@@ -4718,6 +4725,9 @@ enum skb_ext_id {
 #endif
 #if IS_ENABLED(CONFIG_MCTP_FLOWS)
SKB_EXT_MCTP,
+#endif
+#if IS_ENABLED(CONFIG_TUN)
+   SKB_EXT_TUN_VNET_HASH,
 #endif
SKB_EXT_NUM, /* must be last */
 };
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 83f8cd8aa2d1..ce34523fd8de 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4979,6 +4979,9 @@ static const u8 skb_ext_type_len[] = {
 #if IS_ENABLED(CONFIG_MCTP_FLOWS)
[SKB_EXT_MCTP] = SKB_EXT_CHUNKSIZEOF(struct mctp_flow),
 #endif
+#if IS_ENABLED(CONFIG_TUN)
+   [SKB_EXT_TUN_VNET_HASH] = SKB_EXT_CHUNKSIZEOF(struct tun_vnet_hash_ext),
+#endif
 };
 
 static __always_inline unsigned int skb_ext_total_length(void)

-- 
2.46.0




[PATCH RFC v3 2/9] virtio_net: Add functions for hashing

2024-09-14 Thread Akihiko Odaki
They are useful to implement VIRTIO_NET_F_RSS and
VIRTIO_NET_F_HASH_REPORT.

Signed-off-by: Akihiko Odaki 
---
 include/linux/virtio_net.h | 198 +
 1 file changed, 198 insertions(+)

diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index 6c395a2600e8..7ee2e2f2625a 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -9,6 +9,183 @@
 #include 
 #include 
 
+struct virtio_net_hash {
+   u32 value;
+   u16 report;
+};
+
+struct virtio_net_toeplitz_state {
+   u32 hash;
+   u32 key_buffer;
+   const __be32 *key;
+};
+
+#define VIRTIO_NET_SUPPORTED_HASH_TYPES (VIRTIO_NET_RSS_HASH_TYPE_IPv4 | \
+VIRTIO_NET_RSS_HASH_TYPE_TCPv4 | \
+VIRTIO_NET_RSS_HASH_TYPE_UDPv4 | \
+VIRTIO_NET_RSS_HASH_TYPE_IPv6 | \
+VIRTIO_NET_RSS_HASH_TYPE_TCPv6 | \
+VIRTIO_NET_RSS_HASH_TYPE_UDPv6)
+
+#define VIRTIO_NET_RSS_MAX_KEY_SIZE 40
+
+static inline void virtio_net_toeplitz(struct virtio_net_toeplitz_state *state,
+  const __be32 *input, size_t len)
+{
+   u32 key;
+
+   while (len) {
+   state->key++;
+   key = be32_to_cpu(*state->key);
+
+   for (u32 bit = BIT(31); bit; bit >>= 1) {
+   if (be32_to_cpu(*input) & bit)
+   state->hash ^= state->key_buffer;
+
+   state->key_buffer =
+   (state->key_buffer << 1) | !!(key & bit);
+   }
+
+   input++;
+   len--;
+   }
+}
+
+static inline u8 virtio_net_hash_key_length(u32 types)
+{
+   size_t len = 0;
+
+   if (types & VIRTIO_NET_HASH_REPORT_IPv4)
+   len = max(len,
+ sizeof(struct flow_dissector_key_ipv4_addrs));
+
+   if (types &
+   (VIRTIO_NET_HASH_REPORT_TCPv4 | VIRTIO_NET_HASH_REPORT_UDPv4))
+   len = max(len,
+ sizeof(struct flow_dissector_key_ipv4_addrs) +
+ sizeof(struct flow_dissector_key_ports));
+
+   if (types & VIRTIO_NET_HASH_REPORT_IPv6)
+   len = max(len,
+ sizeof(struct flow_dissector_key_ipv6_addrs));
+
+   if (types &
+   (VIRTIO_NET_HASH_REPORT_TCPv6 | VIRTIO_NET_HASH_REPORT_UDPv6))
+   len = max(len,
+ sizeof(struct flow_dissector_key_ipv6_addrs) +
+ sizeof(struct flow_dissector_key_ports));
+
+   return 4 + len;
+}
+
+static inline u32 virtio_net_hash_report(u32 types,
+struct flow_dissector_key_basic key)
+{
+   switch (key.n_proto) {
+   case htons(ETH_P_IP):
+   if (key.ip_proto == IPPROTO_TCP &&
+   (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv4))
+   return VIRTIO_NET_HASH_REPORT_TCPv4;
+
+   if (key.ip_proto == IPPROTO_UDP &&
+   (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv4))
+   return VIRTIO_NET_HASH_REPORT_UDPv4;
+
+   if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv4)
+   return VIRTIO_NET_HASH_REPORT_IPv4;
+
+   return VIRTIO_NET_HASH_REPORT_NONE;
+
+   case htons(ETH_P_IPV6):
+   if (key.ip_proto == IPPROTO_TCP &&
+   (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv6))
+   return VIRTIO_NET_HASH_REPORT_TCPv6;
+
+   if (key.ip_proto == IPPROTO_UDP &&
+   (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv6))
+   return VIRTIO_NET_HASH_REPORT_UDPv6;
+
+   if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv6)
+   return VIRTIO_NET_HASH_REPORT_IPv6;
+
+   return VIRTIO_NET_HASH_REPORT_NONE;
+
+   default:
+   return VIRTIO_NET_HASH_REPORT_NONE;
+   }
+}
+
+static inline bool virtio_net_hash_rss(const struct sk_buff *skb,
+  u32 types, const __be32 *key,
+  struct virtio_net_hash *hash)
+{
+   u16 report;
+   struct virtio_net_toeplitz_state toeplitz_state = {
+   .key_buffer = be32_to_cpu(*key),
+   .key = key
+   };
+   struct flow_keys flow;
+
+   if (!skb_flow_dissect_flow_keys(skb, &flow, 0))
+   return false;
+
+   report = virtio_net_hash_report(types, flow.basic);
+
+   switch (report) {
+   case VIRTIO_NET_HASH_REPORT_IPv4:
+   virtio_net_toeplitz(&toeplitz_state,
+   (__be32 *)&flow.addrs.v4addrs,
+

[PATCH RFC v3 3/9] net: flow_dissector: Export flow_keys_dissector_symmetric

2024-09-14 Thread Akihiko Odaki
flow_keys_dissector_symmetric is useful to derive a symmetric hash
and to know its source such as IPv4, IPv6, TCP, and UDP.

Signed-off-by: Akihiko Odaki 
---
 include/net/flow_dissector.h | 1 +
 net/core/flow_dissector.c| 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index ced79dc8e856..d01c1ec77b7d 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -423,6 +423,7 @@ __be32 flow_get_u32_src(const struct flow_keys *flow);
 __be32 flow_get_u32_dst(const struct flow_keys *flow);
 
 extern struct flow_dissector flow_keys_dissector;
+extern struct flow_dissector flow_keys_dissector_symmetric;
 extern struct flow_dissector flow_keys_basic_dissector;
 
 /* struct flow_keys_digest:
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 0e638a37aa09..9822988f2d49 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -1852,7 +1852,8 @@ void make_flow_keys_digest(struct flow_keys_digest 
*digest,
 }
 EXPORT_SYMBOL(make_flow_keys_digest);
 
-static struct flow_dissector flow_keys_dissector_symmetric __read_mostly;
+struct flow_dissector flow_keys_dissector_symmetric __read_mostly;
+EXPORT_SYMBOL(flow_keys_dissector_symmetric);
 
 u32 __skb_get_hash_symmetric_net(const struct net *net, const struct sk_buff 
*skb)
 {

-- 
2.46.0




[PATCH RFC v3 4/9] tap: Pad virtio header with zero

2024-09-14 Thread Akihiko Odaki
tap used to simply advance iov_iter when it needs to pad virtio header.
This leaves the garbage in the buffer as is and prevents telling if the
header is padded or contains some real data.

In theory, a user of tap can fill the buffer with zero before calling
read() to avoid such a problem, but leaving the garbage in the buffer is
awkward anyway so fill the buffer in tap.

Signed-off-by: Akihiko Odaki 
---
 drivers/net/tap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 77574f7a3bd4..ba044302ccc6 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -813,7 +813,7 @@ static ssize_t tap_put_user(struct tap_queue *q,
sizeof(vnet_hdr))
return -EFAULT;
 
-   iov_iter_advance(iter, vnet_hdr_len - sizeof(vnet_hdr));
+   iov_iter_zero(vnet_hdr_len - sizeof(vnet_hdr), iter);
}
total = vnet_hdr_len;
total += skb->len;

-- 
2.46.0




[PATCH RFC v3 5/9] tun: Pad virtio header with zero

2024-09-14 Thread Akihiko Odaki
tun used to simply advance iov_iter when it needs to pad virtio header.
This leaves the garbage in the buffer as is and prevents telling if the
header is padded or contains some real data.

In theory, a user of tun can fill the buffer with zero before calling
read() to avoid such a problem, but leaving the garbage in the buffer is
awkward anyway so fill the buffer in tun.

Signed-off-by: Akihiko Odaki 
---
 drivers/net/tun.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 1d06c560c5e6..9d93ab9ee58f 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2073,7 +2073,7 @@ static ssize_t tun_put_user_xdp(struct tun_struct *tun,
if (unlikely(copy_to_iter(&gso, sizeof(gso), iter) !=
 sizeof(gso)))
return -EFAULT;
-   iov_iter_advance(iter, vnet_hdr_sz - sizeof(gso));
+   iov_iter_zero(vnet_hdr_sz - sizeof(gso), iter);
}
 
ret = copy_to_iter(xdp_frame->data, size, iter) + vnet_hdr_sz;
@@ -2146,7 +2146,7 @@ static ssize_t tun_put_user(struct tun_struct *tun,
if (copy_to_iter(&gso, sizeof(gso), iter) != sizeof(gso))
return -EFAULT;
 
-   iov_iter_advance(iter, vnet_hdr_sz - sizeof(gso));
+   iov_iter_zero(vnet_hdr_sz - sizeof(gso), iter);
}
 
if (vlan_hlen) {

-- 
2.46.0




[PATCH RFC v3 6/9] tun: Introduce virtio-net hash reporting feature

2024-09-14 Thread Akihiko Odaki
Allow the guest to reuse the hash value to make receive steering
consistent between the host and guest, and to save hash computation.

Signed-off-by: Akihiko Odaki 
---
 Documentation/networking/tuntap.rst |   7 ++
 drivers/net/Kconfig |   1 +
 drivers/net/tun.c   | 146 +++-
 include/uapi/linux/if_tun.h |  44 +++
 4 files changed, 180 insertions(+), 18 deletions(-)

diff --git a/Documentation/networking/tuntap.rst 
b/Documentation/networking/tuntap.rst
index 4d7087f727be..86b4ae8caa8a 100644
--- a/Documentation/networking/tuntap.rst
+++ b/Documentation/networking/tuntap.rst
@@ -206,6 +206,13 @@ enable is true we enable it, otherwise we disable it::
   return ioctl(fd, TUNSETQUEUE, (void *)&ifr);
   }
 
+3.4 Reference
+-
+
+``linux/if_tun.h`` defines the interface described below:
+
+.. kernel-doc:: include/uapi/linux/if_tun.h
+
 Universal TUN/TAP device driver Frequently Asked Question
 =
 
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 9920b3a68ed1..e2a7bd703550 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -395,6 +395,7 @@ config TUN
tristate "Universal TUN/TAP device driver support"
depends on INET
select CRC32
+   select SKB_EXTENSIONS
help
  TUN/TAP provides packet reception and transmission for user space
  programs.  It can be viewed as a simple Point-to-Point or Ethernet
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 9d93ab9ee58f..b8fcd71becac 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -173,6 +173,10 @@ struct tun_prog {
struct bpf_prog *prog;
 };
 
+struct tun_vnet_hash_container {
+   struct tun_vnet_hash common;
+};
+
 /* Since the socket were moved to tun_file, to preserve the behavior of persist
  * device, socket filter, sndbuf and vnet header size were restore when the
  * file were attached to a persist device.
@@ -210,6 +214,7 @@ struct tun_struct {
struct bpf_prog __rcu *xdp_prog;
struct tun_prog __rcu *steering_prog;
struct tun_prog __rcu *filter_prog;
+   struct tun_vnet_hash_container __rcu *vnet_hash;
struct ethtool_link_ksettings link_ksettings;
/* init args */
struct file *file;
@@ -221,6 +226,11 @@ struct veth {
__be16 h_vlan_TCI;
 };
 
+static const struct tun_vnet_hash tun_vnet_hash_cap = {
+   .flags = TUN_VNET_HASH_REPORT,
+   .types = VIRTIO_NET_SUPPORTED_HASH_TYPES
+};
+
 static void tun_flow_init(struct tun_struct *tun);
 static void tun_flow_uninit(struct tun_struct *tun);
 
@@ -322,10 +332,17 @@ static long tun_set_vnet_be(struct tun_struct *tun, int 
__user *argp)
if (get_user(be, argp))
return -EFAULT;
 
-   if (be)
+   if (be) {
+   struct tun_vnet_hash_container *vnet_hash = 
rtnl_dereference(tun->vnet_hash);
+
+   if (!(tun->flags & TUN_VNET_LE) &&
+   vnet_hash && (vnet_hash->flags & TUN_VNET_HASH_REPORT))
+   return -EBUSY;
+
tun->flags |= TUN_VNET_BE;
-   else
+   } else {
tun->flags &= ~TUN_VNET_BE;
+   }
 
return 0;
 }
@@ -522,14 +539,20 @@ static inline void tun_flow_save_rps_rxhash(struct 
tun_flow_entry *e, u32 hash)
  * the userspace application move between processors, we may get a
  * different rxq no. here.
  */
-static u16 tun_automq_select_queue(struct tun_struct *tun, struct sk_buff *skb)
+static u16 tun_automq_select_queue(struct tun_struct *tun, struct sk_buff *skb,
+  const struct tun_vnet_hash_container 
*vnet_hash)
 {
+   struct tun_vnet_hash_ext *ext;
+   struct flow_keys keys;
struct tun_flow_entry *e;
u32 txq, numqueues;
 
numqueues = READ_ONCE(tun->numqueues);
 
-   txq = __skb_get_hash_symmetric(skb);
+   memset(&keys, 0, sizeof(keys));
+   skb_flow_dissect(skb, &flow_keys_dissector_symmetric, &keys, 0);
+
+   txq = flow_hash_from_keys(&keys);
e = tun_flow_find(&tun->flows[tun_hashfn(txq)], txq);
if (e) {
tun_flow_save_rps_rxhash(e, txq);
@@ -538,6 +561,16 @@ static u16 tun_automq_select_queue(struct tun_struct *tun, 
struct sk_buff *skb)
txq = reciprocal_scale(txq, numqueues);
}
 
+   if (vnet_hash && (vnet_hash->common.flags & TUN_VNET_HASH_REPORT)) {
+   ext = skb_ext_add(skb, SKB_EXT_TUN_VNET_HASH);
+   if (ext) {
+   u32 types = vnet_hash->common.types;
+
+   ext->report = virtio_net_hash_report(types, keys.basic);
+   ext->value = skb->l4_hash ? skb->hash : txq;
+   }
+   }
+
return txq;
 }

[PATCH RFC v3 7/9] tun: Introduce virtio-net RSS

2024-09-14 Thread Akihiko Odaki
RSS is a receive steering algorithm that can be negotiated to use with
virtio_net. Conventionally the hash calculation was done by the VMM.
However, computing the hash after the queue was chosen defeats the
purpose of RSS.

Another approach is to use eBPF steering program. This approach has
another downside: it cannot report the calculated hash due to the
restrictive nature of eBPF steering program.

Introduce the code to perform RSS to the kernel in order to overcome
thse challenges. An alternative solution is to extend the eBPF steering
program so that it will be able to report to the userspace, but I didn't
opt for it because extending the current mechanism of eBPF steering
program as is because it relies on legacy context rewriting, and
introducing kfunc-based eBPF will result in non-UAPI dependency while
the other relevant virtualization APIs such as KVM and vhost_net are
UAPIs.

Signed-off-by: Akihiko Odaki 
---
 drivers/net/tun.c   | 119 +++-
 include/uapi/linux/if_tun.h |  27 ++
 2 files changed, 133 insertions(+), 13 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index b8fcd71becac..5a429b391144 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -175,6 +175,9 @@ struct tun_prog {
 
 struct tun_vnet_hash_container {
struct tun_vnet_hash common;
+   struct tun_vnet_hash_rss rss;
+   __be32 rss_key[VIRTIO_NET_RSS_MAX_KEY_SIZE];
+   u16 rss_indirection_table[];
 };
 
 /* Since the socket were moved to tun_file, to preserve the behavior of persist
@@ -227,7 +230,7 @@ struct veth {
 };
 
 static const struct tun_vnet_hash tun_vnet_hash_cap = {
-   .flags = TUN_VNET_HASH_REPORT,
+   .flags = TUN_VNET_HASH_REPORT | TUN_VNET_HASH_RSS,
.types = VIRTIO_NET_SUPPORTED_HASH_TYPES
 };
 
@@ -591,6 +594,36 @@ static u16 tun_ebpf_select_queue(struct tun_struct *tun, 
struct sk_buff *skb)
return ret % numqueues;
 }
 
+static u16 tun_vnet_rss_select_queue(struct tun_struct *tun,
+struct sk_buff *skb,
+const struct tun_vnet_hash_container 
*vnet_hash)
+{
+   struct tun_vnet_hash_ext *ext;
+   struct virtio_net_hash hash;
+   u32 numqueues = READ_ONCE(tun->numqueues);
+   u16 txq, index;
+
+   if (!numqueues)
+   return 0;
+
+   if (!virtio_net_hash_rss(skb, vnet_hash->common.types, 
vnet_hash->rss_key,
+&hash))
+   return vnet_hash->rss.unclassified_queue % numqueues;
+
+   if (vnet_hash->common.flags & TUN_VNET_HASH_REPORT) {
+   ext = skb_ext_add(skb, SKB_EXT_TUN_VNET_HASH);
+   if (ext) {
+   ext->value = hash.value;
+   ext->report = hash.report;
+   }
+   }
+
+   index = hash.value & vnet_hash->rss.indirection_table_mask;
+   txq = READ_ONCE(vnet_hash->rss_indirection_table[index]);
+
+   return txq % numqueues;
+}
+
 static u16 tun_select_queue(struct net_device *dev, struct sk_buff *skb,
struct net_device *sb_dev)
 {
@@ -603,7 +636,10 @@ static u16 tun_select_queue(struct net_device *dev, struct 
sk_buff *skb,
} else {
struct tun_vnet_hash_container *vnet_hash = 
rcu_dereference(tun->vnet_hash);
 
-   ret = tun_automq_select_queue(tun, skb, vnet_hash);
+   if (vnet_hash && (vnet_hash->common.flags & TUN_VNET_HASH_RSS))
+   ret = tun_vnet_rss_select_queue(tun, skb, vnet_hash);
+   else
+   ret = tun_automq_select_queue(tun, skb, vnet_hash);
}
rcu_read_unlock();
 
@@ -3085,13 +3121,9 @@ static int tun_set_queue(struct file *file, struct ifreq 
*ifr)
 }
 
 static int tun_set_ebpf(struct tun_struct *tun, struct tun_prog __rcu **prog_p,
-   void __user *data)
+   int fd)
 {
struct bpf_prog *prog;
-   int fd;
-
-   if (copy_from_user(&fd, data, sizeof(fd)))
-   return -EFAULT;
 
if (fd == -1) {
prog = NULL;
@@ -3157,6 +3189,7 @@ static long __tun_chr_ioctl(struct file *file, unsigned 
int cmd,
int ifindex;
int sndbuf;
int vnet_hdr_sz;
+   int fd;
int le;
int ret;
bool do_notify = false;
@@ -3460,11 +3493,27 @@ static long __tun_chr_ioctl(struct file *file, unsigned 
int cmd,
break;
 
case TUNSETSTEERINGEBPF:
-   ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
+   if (get_user(fd, (int __user *)argp)) {
+   ret = -EFAULT;
+   break;
+   }
+
+   vnet_hash = rtnl_dereference(tun->vnet_hash);
+   if (fd != -1 && vnet_hash && (vnet_hash->common.fla

[PATCH RFC v3 8/9] selftest: tun: Add tests for virtio-net hashing

2024-09-14 Thread Akihiko Odaki
The added tests confirm tun can perform RSS and hash reporting, and
reject invalid configurations for them.

Signed-off-by: Akihiko Odaki 
---
 tools/testing/selftests/net/Makefile |   2 +-
 tools/testing/selftests/net/tun.c| 666 ++-
 2 files changed, 660 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/net/Makefile 
b/tools/testing/selftests/net/Makefile
index 8eaffd7a641c..5629e68bf69d 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -109,6 +109,6 @@ $(OUTPUT)/reuseport_bpf_numa: LDLIBS += -lnuma
 $(OUTPUT)/tcp_mmap: LDLIBS += -lpthread -lcrypto
 $(OUTPUT)/tcp_inq: LDLIBS += -lpthread
 $(OUTPUT)/bind_bhash: LDLIBS += -lpthread
-$(OUTPUT)/io_uring_zerocopy_tx: CFLAGS += -I../../../include/
+$(OUTPUT)/io_uring_zerocopy_tx $(OUTPUT)/tun: CFLAGS += -I../../../include/
 
 include bpf.mk
diff --git a/tools/testing/selftests/net/tun.c 
b/tools/testing/selftests/net/tun.c
index fa83918b62d1..f46affa39d5c 100644
--- a/tools/testing/selftests/net/tun.c
+++ b/tools/testing/selftests/net/tun.c
@@ -2,21 +2,37 @@
 
 #define _GNU_SOURCE
 
+#include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
-#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
 #include 
+#include 
 #include 
 #include 
-#include 
-#include 
+#include 
+#include 
+#include 
+#include 
 
 #include "../kselftest_harness.h"
 
+#define TUN_HWADDR_SOURCE { 0x02, 0x00, 0x00, 0x00, 0x00, 0x00 }
+#define TUN_HWADDR_DEST { 0x02, 0x00, 0x00, 0x00, 0x00, 0x01 }
+#define TUN_IPADDR_SOURCE htonl((172 << 24) | (17 << 16) | 0)
+#define TUN_IPADDR_DEST htonl((172 << 24) | (17 << 16) | 1)
+
 static int tun_attach(int fd, char *dev)
 {
struct ifreq ifr;
@@ -39,7 +55,7 @@ static int tun_detach(int fd, char *dev)
return ioctl(fd, TUNSETQUEUE, (void *) &ifr);
 }
 
-static int tun_alloc(char *dev)
+static int tun_alloc(char *dev, short flags)
 {
struct ifreq ifr;
int fd, err;
@@ -52,7 +68,8 @@ static int tun_alloc(char *dev)
 
memset(&ifr, 0, sizeof(ifr));
strcpy(ifr.ifr_name, dev);
-   ifr.ifr_flags = IFF_TAP | IFF_NAPI | IFF_MULTI_QUEUE;
+   ifr.ifr_flags = flags | IFF_TAP | IFF_NAPI | IFF_NO_PI |
+   IFF_MULTI_QUEUE;
 
err = ioctl(fd, TUNSETIFF, (void *) &ifr);
if (err < 0) {
@@ -64,6 +81,40 @@ static int tun_alloc(char *dev)
return fd;
 }
 
+static bool tun_add_to_bridge(int local_fd, const char *name)
+{
+   struct ifreq ifreq = {
+   .ifr_name = "xbridge",
+   .ifr_ifindex = if_nametoindex(name)
+   };
+
+   if (!ifreq.ifr_ifindex) {
+   perror("if_nametoindex");
+   return false;
+   }
+
+   if (ioctl(local_fd, SIOCBRADDIF, &ifreq)) {
+   perror("SIOCBRADDIF");
+   return false;
+   }
+
+   return true;
+}
+
+static bool tun_set_flags(int local_fd, const char *name, short flags)
+{
+   struct ifreq ifreq = { .ifr_flags = flags };
+
+   strcpy(ifreq.ifr_name, name);
+
+   if (ioctl(local_fd, SIOCSIFFLAGS, &ifreq)) {
+   perror("SIOCSIFFLAGS");
+   return false;
+   }
+
+   return true;
+}
+
 static int tun_delete(char *dev)
 {
struct {
@@ -102,6 +153,159 @@ static int tun_delete(char *dev)
return ret;
 }
 
+static uint32_t tun_sum(const void *buf, size_t len)
+{
+   const uint16_t *sbuf = buf;
+   uint32_t sum = 0;
+
+   while (len > 1) {
+   sum += *sbuf++;
+   len -= 2;
+   }
+
+   if (len)
+   sum += *(uint8_t *)sbuf;
+
+   return sum;
+}
+
+static uint16_t tun_build_ip_check(uint32_t sum)
+{
+   return ~((sum & 0x) + (sum >> 16));
+}
+
+static uint32_t tun_build_ip_pseudo_sum(const void *iphdr)
+{
+   uint16_t tot_len = ntohs(((struct iphdr *)iphdr)->tot_len);
+
+   return tun_sum((char *)iphdr + offsetof(struct iphdr, saddr), 8) +
+  htons(((struct iphdr *)iphdr)->protocol) +
+  htons(tot_len - sizeof(struct iphdr));
+}
+
+static uint32_t tun_build_ipv6_pseudo_sum(const void *ipv6hdr)
+{
+   return tun_sum((char *)ipv6hdr + offsetof(struct ipv6hdr, saddr), 32) +
+  ((struct ipv6hdr *)ipv6hdr)->payload_len +
+  htons(((struct ipv6hdr *)ipv6hdr)->nexthdr);
+}
+
+static void tun_build_ethhdr(struct ethhdr *ethhdr, uint16_t proto)
+{
+   *ethhdr = (struct ethhdr) {
+   .h_dest = TUN_HWADDR_DEST,
+   .h_source = TUN_HWADDR_SOURCE,
+   .h_proto = htons(proto)
+   };
+}
+
+static void tun_build_iphdr(void *dest, uint16_t len, uint8_t protocol)
+{
+   struct iphdr iphdr = {
+   .ihl = sizeof(iphdr) / 4,
+   .version = 

[PATCH RFC v3 9/9] vhost/net: Support VIRTIO_NET_F_HASH_REPORT

2024-09-14 Thread Akihiko Odaki
VIRTIO_NET_F_HASH_REPORT allows to report hash values calculated on the
host. When VHOST_NET_F_VIRTIO_NET_HDR is employed, it will report no
hash values (i.e., the hash_report member is always set to
VIRTIO_NET_HASH_REPORT_NONE). Otherwise, the values reported by the
underlying socket will be reported.

VIRTIO_NET_F_HASH_REPORT requires VIRTIO_F_VERSION_1.

Signed-off-by: Akihiko Odaki 
---
 drivers/vhost/net.c | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index f16279351db5..ec1167a782ec 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -73,6 +73,7 @@ enum {
VHOST_NET_FEATURES = VHOST_FEATURES |
 (1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
 (1ULL << VIRTIO_NET_F_MRG_RXBUF) |
+(1ULL << VIRTIO_NET_F_HASH_REPORT) |
 (1ULL << VIRTIO_F_ACCESS_PLATFORM) |
 (1ULL << VIRTIO_F_RING_RESET)
 };
@@ -1604,10 +1605,13 @@ static int vhost_net_set_features(struct vhost_net *n, 
u64 features)
size_t vhost_hlen, sock_hlen, hdr_len;
int i;
 
-   hdr_len = (features & ((1ULL << VIRTIO_NET_F_MRG_RXBUF) |
-  (1ULL << VIRTIO_F_VERSION_1))) ?
-   sizeof(struct virtio_net_hdr_mrg_rxbuf) :
-   sizeof(struct virtio_net_hdr);
+   if (features & (1ULL << VIRTIO_NET_F_HASH_REPORT))
+   hdr_len = sizeof(struct virtio_net_hdr_v1_hash);
+   else if (features & ((1ULL << VIRTIO_NET_F_MRG_RXBUF) |
+(1ULL << VIRTIO_F_VERSION_1)))
+   hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
+   else
+   hdr_len = sizeof(struct virtio_net_hdr);
if (features & (1 << VHOST_NET_F_VIRTIO_NET_HDR)) {
/* vhost provides vnet_hdr */
vhost_hlen = hdr_len;
@@ -1688,6 +1692,10 @@ static long vhost_net_ioctl(struct file *f, unsigned int 
ioctl,
return -EFAULT;
if (features & ~VHOST_NET_FEATURES)
return -EOPNOTSUPP;
+   if ((features & ((1ULL << VIRTIO_F_VERSION_1) |
+(1ULL << VIRTIO_NET_F_HASH_REPORT))) ==
+   (1ULL << VIRTIO_NET_F_HASH_REPORT))
+   return -EINVAL;
return vhost_net_set_features(n, features);
case VHOST_GET_BACKEND_FEATURES:
features = VHOST_NET_BACKEND_FEATURES;

-- 
2.46.0




[PATCH RFC v5 02/10] skbuff: Introduce SKB_EXT_TUN_VNET_HASH

2024-10-08 Thread Akihiko Odaki
This new extension will be used by tun to carry the hash values and
types to report with virtio-net headers.

Signed-off-by: Akihiko Odaki 
---
 include/linux/skbuff.h | 3 +++
 net/core/skbuff.c  | 4 
 2 files changed, 7 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 29c3ea5b6e93..a361c4150144 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -4718,6 +4718,9 @@ enum skb_ext_id {
 #endif
 #if IS_ENABLED(CONFIG_MCTP_FLOWS)
SKB_EXT_MCTP,
+#endif
+#if IS_ENABLED(CONFIG_TUN)
+   SKB_EXT_TUN_VNET_HASH,
 #endif
SKB_EXT_NUM, /* must be last */
 };
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 83f8cd8aa2d1..f0bf94cf458b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -64,6 +64,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -4979,6 +4980,9 @@ static const u8 skb_ext_type_len[] = {
 #if IS_ENABLED(CONFIG_MCTP_FLOWS)
[SKB_EXT_MCTP] = SKB_EXT_CHUNKSIZEOF(struct mctp_flow),
 #endif
+#if IS_ENABLED(CONFIG_TUN)
+   [SKB_EXT_TUN_VNET_HASH] = SKB_EXT_CHUNKSIZEOF(struct virtio_net_hash),
+#endif
 };
 
 static __always_inline unsigned int skb_ext_total_length(void)

-- 
2.46.2




[PATCH RFC v5 00/10] tun: Introduce virtio-net hashing feature

2024-10-08 Thread Akihiko Odaki
virtio-net have two usage of hashes: one is RSS and another is hash
reporting. Conventionally the hash calculation was done by the VMM.
However, computing the hash after the queue was chosen defeats the
purpose of RSS.

Another approach is to use eBPF steering program. This approach has
another downside: it cannot report the calculated hash due to the
restrictive nature of eBPF.

Introduce the code to compute hashes to the kernel in order to overcome
thse challenges.

An alternative solution is to extend the eBPF steering program so that it
will be able to report to the userspace, but it is based on context
rewrites, which is in feature freeze. We can adopt kfuncs, but they will
not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM
and vhost_net).

The patches for QEMU to use this new feature was submitted as RFC and
is available at:
https://patchew.org/QEMU/20240915-hash-v3-0-79cb08d28...@daynix.com/

This work was presented at LPC 2024:
https://lpc.events/event/18/contributions/1963/

V1 -> V2:
  Changed to introduce a new BPF program type.

Signed-off-by: Akihiko Odaki 
---
Changes in v5:
- Fixed a compilation error with CONFIG_TUN_VNET_CROSS_LE.
- Optimized the calculation of the hash value according to:
  https://git.dpdk.org/dpdk/commit/?id=3fb1ea032bd6ff8317af5dac9af901f1f324cab4
- Added patch "tun: Unify vnet implementation".
- Dropped patch "tap: Pad virtio header with zero".
- Added patch "selftest: tun: Test vnet ioctls without device".
- Reworked selftests to skip for older kernels.
- Documented the case when the underlying device is deleted and packets
  have queue_mapping set by TC.
- Reordered test harness arguments.
- Added code to handle fragmented packets.
- Link to v4: 
https://lore.kernel.org/r/20240924-rss-v4-0-84e932ec0...@daynix.com

Changes in v4:
- Moved tun_vnet_hash_ext to if_tun.h.
- Renamed virtio_net_toeplitz() to virtio_net_toeplitz_calc().
- Replaced htons() with cpu_to_be16().
- Changed virtio_net_hash_rss() to return void.
- Reordered variable declarations in virtio_net_hash_rss().
- Removed virtio_net_hdr_v1_hash_from_skb().
- Updated messages of "tap: Pad virtio header with zero" and
  "tun: Pad virtio header with zero".
- Fixed vnet_hash allocation size.
- Ensured to free vnet_hash when destructing tun_struct.
- Link to v3: 
https://lore.kernel.org/r/20240915-rss-v3-0-c630015db...@daynix.com

Changes in v3:
- Reverted back to add ioctl.
- Split patch "tun: Introduce virtio-net hashing feature" into
  "tun: Introduce virtio-net hash reporting feature" and
  "tun: Introduce virtio-net RSS".
- Changed to reuse hash values computed for automq instead of performing
  RSS hashing when hash reporting is requested but RSS is not.
- Extracted relevant data from struct tun_struct to keep it minimal.
- Added kernel-doc.
- Changed to allow calling TUNGETVNETHASHCAP before TUNSETIFF.
- Initialized num_buffers with 1.
- Added a test case for unclassified packets.
- Fixed error handling in tests.
- Changed tests to verify that the queue index will not overflow.
- Rebased.
- Link to v2: 
https://lore.kernel.org/r/20231015141644.260646-1-akihiko.od...@daynix.com

---
Akihiko Odaki (10):
  virtio_net: Add functions for hashing
  skbuff: Introduce SKB_EXT_TUN_VNET_HASH
  net: flow_dissector: Export flow_keys_dissector_symmetric
  tun: Unify vnet implementation
  tun: Pad virtio header with zero
  tun: Introduce virtio-net hash reporting feature
  tun: Introduce virtio-net RSS
  selftest: tun: Test vnet ioctls without device
  selftest: tun: Add tests for virtio-net hashing
  vhost/net: Support VIRTIO_NET_F_HASH_REPORT

 Documentation/networking/tuntap.rst  |   7 +
 MAINTAINERS  |   1 +
 drivers/net/Kconfig  |   1 +
 drivers/net/tap.c| 218 
 drivers/net/tun.c| 293 ++--
 drivers/net/tun_vnet.h   | 342 +++
 drivers/vhost/net.c  |  16 +-
 include/linux/if_tap.h   |   2 +
 include/linux/skbuff.h   |   3 +
 include/linux/virtio_net.h   | 188 +++
 include/net/flow_dissector.h |   1 +
 include/uapi/linux/if_tun.h  |  75 +
 net/core/flow_dissector.c|   3 +-
 net/core/skbuff.c|   4 +
 tools/testing/selftests/net/Makefile |   2 +-
 tools/testing/selftests/net/tun.c| 630 ++-
 16 files changed, 1430 insertions(+), 356 deletions(-)
---
base-commit: 752ebcbe87aceeb6334e846a466116197711a982
change-id: 20240403-rss-e737d89efa77

Best regards,
-- 
Akihiko Odaki 




[PATCH RFC v5 01/10] virtio_net: Add functions for hashing

2024-10-08 Thread Akihiko Odaki
They are useful to implement VIRTIO_NET_F_RSS and
VIRTIO_NET_F_HASH_REPORT.

Signed-off-by: Akihiko Odaki 
---
 include/linux/virtio_net.h | 188 +
 1 file changed, 188 insertions(+)

diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index 276ca543ef44..6f192bb9ba1d 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -9,6 +9,194 @@
 #include 
 #include 
 
+struct virtio_net_hash {
+   u32 value;
+   u16 report;
+};
+
+struct virtio_net_toeplitz_state {
+   u32 hash;
+   const u32 *key;
+};
+
+#define VIRTIO_NET_SUPPORTED_HASH_TYPES (VIRTIO_NET_RSS_HASH_TYPE_IPv4 | \
+VIRTIO_NET_RSS_HASH_TYPE_TCPv4 | \
+VIRTIO_NET_RSS_HASH_TYPE_UDPv4 | \
+VIRTIO_NET_RSS_HASH_TYPE_IPv6 | \
+VIRTIO_NET_RSS_HASH_TYPE_TCPv6 | \
+VIRTIO_NET_RSS_HASH_TYPE_UDPv6)
+
+#define VIRTIO_NET_RSS_MAX_KEY_SIZE 40
+
+static inline void virtio_net_toeplitz_convert_key(u32 *input, size_t len)
+{
+   while (len >= sizeof(*input)) {
+   *input = be32_to_cpu((__force __be32)*input);
+   input++;
+   len -= sizeof(*input);
+   }
+}
+
+static inline void virtio_net_toeplitz_calc(struct virtio_net_toeplitz_state 
*state,
+   const __be32 *input, size_t len)
+{
+   while (len >= sizeof(*input)) {
+   for (u32 map = be32_to_cpu(*input); map; map &= (map - 1)) {
+   u32 i = ffs(map);
+
+   state->hash ^= state->key[0] << (32 - i) |
+  (u32)((u64)state->key[1] >> i);
+   }
+
+   state->key++;
+   input++;
+   len -= sizeof(*input);
+   }
+}
+
+static inline u8 virtio_net_hash_key_length(u32 types)
+{
+   size_t len = 0;
+
+   if (types & VIRTIO_NET_HASH_REPORT_IPv4)
+   len = max(len,
+ sizeof(struct flow_dissector_key_ipv4_addrs));
+
+   if (types &
+   (VIRTIO_NET_HASH_REPORT_TCPv4 | VIRTIO_NET_HASH_REPORT_UDPv4))
+   len = max(len,
+ sizeof(struct flow_dissector_key_ipv4_addrs) +
+ sizeof(struct flow_dissector_key_ports));
+
+   if (types & VIRTIO_NET_HASH_REPORT_IPv6)
+   len = max(len,
+ sizeof(struct flow_dissector_key_ipv6_addrs));
+
+   if (types &
+   (VIRTIO_NET_HASH_REPORT_TCPv6 | VIRTIO_NET_HASH_REPORT_UDPv6))
+   len = max(len,
+ sizeof(struct flow_dissector_key_ipv6_addrs) +
+ sizeof(struct flow_dissector_key_ports));
+
+   return len + 4;
+}
+
+static inline u32 virtio_net_hash_report(u32 types,
+const struct flow_keys_basic *keys)
+{
+   switch (keys->basic.n_proto) {
+   case cpu_to_be16(ETH_P_IP):
+   if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) {
+   if (keys->basic.ip_proto == IPPROTO_TCP &&
+   (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv4))
+   return VIRTIO_NET_HASH_REPORT_TCPv4;
+
+   if (keys->basic.ip_proto == IPPROTO_UDP &&
+   (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv4))
+   return VIRTIO_NET_HASH_REPORT_UDPv4;
+   }
+
+   if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv4)
+   return VIRTIO_NET_HASH_REPORT_IPv4;
+
+   return VIRTIO_NET_HASH_REPORT_NONE;
+
+   case cpu_to_be16(ETH_P_IPV6):
+   if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) {
+   if (keys->basic.ip_proto == IPPROTO_TCP &&
+   (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv6))
+   return VIRTIO_NET_HASH_REPORT_TCPv6;
+
+   if (keys->basic.ip_proto == IPPROTO_UDP &&
+   (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv6))
+   return VIRTIO_NET_HASH_REPORT_UDPv6;
+   }
+
+   if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv6)
+   return VIRTIO_NET_HASH_REPORT_IPv6;
+
+   return VIRTIO_NET_HASH_REPORT_NONE;
+
+   default:
+   return VIRTIO_NET_HASH_REPORT_NONE;
+   }
+}
+
+static inline void virtio_net_hash_rss(const struct sk_buff *skb,
+  u32 types, const u32 *key,
+  struct virtio_net_hash *hash)
+{
+   struct virtio_net_toeplitz_state toep

[PATCH RFC v5 03/10] net: flow_dissector: Export flow_keys_dissector_symmetric

2024-10-08 Thread Akihiko Odaki
flow_keys_dissector_symmetric is useful to derive a symmetric hash
and to know its source such as IPv4, IPv6, TCP, and UDP.

Signed-off-by: Akihiko Odaki 
---
 include/net/flow_dissector.h | 1 +
 net/core/flow_dissector.c| 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index ced79dc8e856..d01c1ec77b7d 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -423,6 +423,7 @@ __be32 flow_get_u32_src(const struct flow_keys *flow);
 __be32 flow_get_u32_dst(const struct flow_keys *flow);
 
 extern struct flow_dissector flow_keys_dissector;
+extern struct flow_dissector flow_keys_dissector_symmetric;
 extern struct flow_dissector flow_keys_basic_dissector;
 
 /* struct flow_keys_digest:
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 0e638a37aa09..9822988f2d49 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -1852,7 +1852,8 @@ void make_flow_keys_digest(struct flow_keys_digest 
*digest,
 }
 EXPORT_SYMBOL(make_flow_keys_digest);
 
-static struct flow_dissector flow_keys_dissector_symmetric __read_mostly;
+struct flow_dissector flow_keys_dissector_symmetric __read_mostly;
+EXPORT_SYMBOL(flow_keys_dissector_symmetric);
 
 u32 __skb_get_hash_symmetric_net(const struct net *net, const struct sk_buff 
*skb)
 {

-- 
2.46.2




[PATCH RFC v5 04/10] tun: Unify vnet implementation

2024-10-08 Thread Akihiko Odaki
Both tun and tap exposes the same set of virtio-net-related features.
Unify their implementations to ease future changes.

Signed-off-by: Akihiko Odaki 
---
 MAINTAINERS|   1 +
 drivers/net/tap.c  | 172 ++--
 drivers/net/tun.c  | 208 -
 drivers/net/tun_vnet.h | 181 ++
 4 files changed, 238 insertions(+), 324 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index cc40a9d9b8cd..209b4e1cccb1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -23338,6 +23338,7 @@ F:  Documentation/networking/tuntap.rst
 F: arch/um/os-Linux/drivers/
 F: drivers/net/tap.c
 F: drivers/net/tun.c
+F: drivers/net/tun_vnet.h
 
 TURBOCHANNEL SUBSYSTEM
 M: "Maciej W. Rozycki" 
diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 77574f7a3bd4..9a34ceed0c2c 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -26,74 +26,9 @@
 #include 
 #include 
 
-#define TAP_IFFEATURES (IFF_VNET_HDR | IFF_MULTI_QUEUE)
-
-#define TAP_VNET_LE 0x8000
-#define TAP_VNET_BE 0x4000
-
-#ifdef CONFIG_TUN_VNET_CROSS_LE
-static inline bool tap_legacy_is_little_endian(struct tap_queue *q)
-{
-   return q->flags & TAP_VNET_BE ? false :
-   virtio_legacy_is_little_endian();
-}
-
-static long tap_get_vnet_be(struct tap_queue *q, int __user *sp)
-{
-   int s = !!(q->flags & TAP_VNET_BE);
-
-   if (put_user(s, sp))
-   return -EFAULT;
-
-   return 0;
-}
-
-static long tap_set_vnet_be(struct tap_queue *q, int __user *sp)
-{
-   int s;
-
-   if (get_user(s, sp))
-   return -EFAULT;
-
-   if (s)
-   q->flags |= TAP_VNET_BE;
-   else
-   q->flags &= ~TAP_VNET_BE;
-
-   return 0;
-}
-#else
-static inline bool tap_legacy_is_little_endian(struct tap_queue *q)
-{
-   return virtio_legacy_is_little_endian();
-}
-
-static long tap_get_vnet_be(struct tap_queue *q, int __user *argp)
-{
-   return -EINVAL;
-}
+#include "tun_vnet.h"
 
-static long tap_set_vnet_be(struct tap_queue *q, int __user *argp)
-{
-   return -EINVAL;
-}
-#endif /* CONFIG_TUN_VNET_CROSS_LE */
-
-static inline bool tap_is_little_endian(struct tap_queue *q)
-{
-   return q->flags & TAP_VNET_LE ||
-   tap_legacy_is_little_endian(q);
-}
-
-static inline u16 tap16_to_cpu(struct tap_queue *q, __virtio16 val)
-{
-   return __virtio16_to_cpu(tap_is_little_endian(q), val);
-}
-
-static inline __virtio16 cpu_to_tap16(struct tap_queue *q, u16 val)
-{
-   return __cpu_to_virtio16(tap_is_little_endian(q), val);
-}
+#define TAP_IFFEATURES (IFF_VNET_HDR | IFF_MULTI_QUEUE)
 
 static struct proto tap_proto = {
.name = "tap",
@@ -641,10 +576,10 @@ static ssize_t tap_get_user(struct tap_queue *q, void 
*msg_control,
struct sk_buff *skb;
struct tap_dev *tap;
unsigned long total_len = iov_iter_count(from);
-   unsigned long len = total_len;
+   unsigned long len;
int err;
struct virtio_net_hdr vnet_hdr = { 0 };
-   int vnet_hdr_len = 0;
+   int hdr_len;
int copylen = 0;
int depth;
bool zerocopy = false;
@@ -652,38 +587,20 @@ static ssize_t tap_get_user(struct tap_queue *q, void 
*msg_control,
enum skb_drop_reason drop_reason;
 
if (q->flags & IFF_VNET_HDR) {
-   vnet_hdr_len = READ_ONCE(q->vnet_hdr_sz);
-
-   err = -EINVAL;
-   if (len < vnet_hdr_len)
-   goto err;
-   len -= vnet_hdr_len;
-
-   err = -EFAULT;
-   if (!copy_from_iter_full(&vnet_hdr, sizeof(vnet_hdr), from))
-   goto err;
-   iov_iter_advance(from, vnet_hdr_len - sizeof(vnet_hdr));
-   if ((vnet_hdr.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
-tap16_to_cpu(q, vnet_hdr.csum_start) +
-tap16_to_cpu(q, vnet_hdr.csum_offset) + 2 >
-tap16_to_cpu(q, vnet_hdr.hdr_len))
-   vnet_hdr.hdr_len = cpu_to_tap16(q,
-tap16_to_cpu(q, vnet_hdr.csum_start) +
-tap16_to_cpu(q, vnet_hdr.csum_offset) + 2);
-   err = -EINVAL;
-   if (tap16_to_cpu(q, vnet_hdr.hdr_len) > len)
+   hdr_len = tun_vnet_hdr_get(READ_ONCE(q->vnet_hdr_sz), q->flags, 
from, &vnet_hdr);
+   if (hdr_len < 0) {
+   err = hdr_len;
goto err;
+   }
+   } else {
+   hdr_len = 0;
}
 
-   err = -EINVAL;
-   if (unlikely(len < ETH_HLEN))
-   goto err;
-
+   len = iov_iter_count(from);
if (msg_control && sock_flag(&q->sk, SOCK_ZEROCOPY)) {
   

[PATCH RFC v5 06/10] tun: Introduce virtio-net hash reporting feature

2024-10-08 Thread Akihiko Odaki
Allow the guest to reuse the hash value to make receive steering
consistent between the host and guest, and to save hash computation.

Signed-off-by: Akihiko Odaki 
---
 Documentation/networking/tuntap.rst |   7 +++
 drivers/net/Kconfig |   1 +
 drivers/net/tap.c   |  45 ++--
 drivers/net/tun.c   |  46 
 drivers/net/tun_vnet.h  | 102 +++-
 include/linux/if_tap.h  |   2 +
 include/uapi/linux/if_tun.h |  48 +
 7 files changed, 223 insertions(+), 28 deletions(-)

diff --git a/Documentation/networking/tuntap.rst 
b/Documentation/networking/tuntap.rst
index 4d7087f727be..86b4ae8caa8a 100644
--- a/Documentation/networking/tuntap.rst
+++ b/Documentation/networking/tuntap.rst
@@ -206,6 +206,13 @@ enable is true we enable it, otherwise we disable it::
   return ioctl(fd, TUNSETQUEUE, (void *)&ifr);
   }
 
+3.4 Reference
+-
+
+``linux/if_tun.h`` defines the interface described below:
+
+.. kernel-doc:: include/uapi/linux/if_tun.h
+
 Universal TUN/TAP device driver Frequently Asked Question
 =
 
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 9920b3a68ed1..e2a7bd703550 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -395,6 +395,7 @@ config TUN
tristate "Universal TUN/TAP device driver support"
depends on INET
select CRC32
+   select SKB_EXTENSIONS
help
  TUN/TAP provides packet reception and transmission for user space
  programs.  It can be viewed as a simple Point-to-Point or Ethernet
diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 9a34ceed0c2c..5e2fbe63ca47 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -179,6 +179,16 @@ static void tap_put_queue(struct tap_queue *q)
sock_put(&q->sk);
 }
 
+static struct virtio_net_hash *tap_add_hash(struct sk_buff *skb)
+{
+   return (struct virtio_net_hash *)skb->cb;
+}
+
+static const struct virtio_net_hash *tap_find_hash(const struct sk_buff *skb)
+{
+   return (const struct virtio_net_hash *)skb->cb;
+}
+
 /*
  * Select a queue based on the rxq of the device on which this packet
  * arrived. If the incoming device is not mq, calculate a flow hash
@@ -189,6 +199,7 @@ static void tap_put_queue(struct tap_queue *q)
 static struct tap_queue *tap_get_queue(struct tap_dev *tap,
   struct sk_buff *skb)
 {
+   struct flow_keys_basic keys_basic;
struct tap_queue *queue = NULL;
/* Access to taps array is protected by rcu, but access to numvtaps
 * isn't. Below we use it to lookup a queue, but treat it as a hint
@@ -198,15 +209,32 @@ static struct tap_queue *tap_get_queue(struct tap_dev 
*tap,
int numvtaps = READ_ONCE(tap->numvtaps);
__u32 rxq;
 
+   *tap_add_hash(skb) = (struct virtio_net_hash) { .report = 
VIRTIO_NET_HASH_REPORT_NONE };
+
if (!numvtaps)
goto out;
 
if (numvtaps == 1)
goto single;
 
+   if (!skb->l4_hash && !skb->sw_hash) {
+   struct flow_keys keys;
+
+   skb_flow_dissect_flow_keys(skb, &keys, 
FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL);
+   rxq = flow_hash_from_keys(&keys);
+   keys_basic = (struct flow_keys_basic) {
+   .control = keys.control,
+   .basic = keys.basic
+   };
+   } else {
+   skb_flow_dissect_flow_keys_basic(NULL, skb, &keys_basic, NULL, 
0, 0, 0,
+
FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL);
+   rxq = skb->hash;
+   }
+
/* Check if we can use flow to select a queue */
-   rxq = skb_get_hash(skb);
if (rxq) {
+   tun_vnet_hash_report(&tap->vnet_hash, skb, &keys_basic, rxq, 
tap_add_hash);
queue = rcu_dereference(tap->taps[rxq % numvtaps]);
goto out;
}
@@ -713,15 +741,16 @@ static ssize_t tap_put_user(struct tap_queue *q,
int total;
 
if (q->flags & IFF_VNET_HDR) {
-   struct virtio_net_hdr vnet_hdr;
+   struct virtio_net_hdr_v1_hash vnet_hdr;
 
vnet_hdr_len = READ_ONCE(q->vnet_hdr_sz);
 
-   ret = tun_vnet_hdr_from_skb(q->flags, NULL, skb, &vnet_hdr);
+   ret = tun_vnet_hdr_from_skb(vnet_hdr_len, q->flags, NULL, skb,
+   tap_find_hash, &vnet_hdr);
if (ret < 0)
goto done;
 
-   ret = tun_vnet_hdr_put(vnet_hdr_len, iter, &vnet_hdr);
+   ret = tun_vnet_hdr_put(vnet_hdr_len, iter, &vnet_hdr, ret);
if (ret < 0)
  

[PATCH RFC v5 05/10] tun: Pad virtio header with zero

2024-10-08 Thread Akihiko Odaki
tun used to simply advance iov_iter when it needs to pad virtio header,
which leaves the garbage in the buffer as is. This is especially
problematic when tun starts to allow enabling the hash reporting
feature; even if the feature is enabled, the packet may lack a hash
value and may contain a hole in the virtio header because the packet
arrived before the feature gets enabled or does not contain the
header fields to be hashed. If the hole is not filled with zero, it is
impossible to tell if the packet lacks a hash value.

In theory, a user of tun can fill the buffer with zero before calling
read() to avoid such a problem, but leaving the garbage in the buffer is
awkward anyway so fill the buffer in tun.

Signed-off-by: Akihiko Odaki 
---
 drivers/net/tun_vnet.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/tun_vnet.h b/drivers/net/tun_vnet.h
index 7c7f3f6d85e9..c40bde0fdf8c 100644
--- a/drivers/net/tun_vnet.h
+++ b/drivers/net/tun_vnet.h
@@ -138,7 +138,8 @@ static inline int tun_vnet_hdr_put(int sz, struct iov_iter 
*iter,
if (copy_to_iter(hdr, sizeof(*hdr), iter) != sizeof(*hdr))
return -EFAULT;
 
-   iov_iter_advance(iter, sz - sizeof(*hdr));
+   if (iov_iter_zero(sz - sizeof(*hdr), iter) != sz - sizeof(*hdr))
+   return -EFAULT;
 
return 0;
 }

-- 
2.46.2




[PATCH RFC v5 07/10] tun: Introduce virtio-net RSS

2024-10-08 Thread Akihiko Odaki
RSS is a receive steering algorithm that can be negotiated to use with
virtio_net. Conventionally the hash calculation was done by the VMM.
However, computing the hash after the queue was chosen defeats the
purpose of RSS.

Another approach is to use eBPF steering program. This approach has
another downside: it cannot report the calculated hash due to the
restrictive nature of eBPF steering program.

Introduce the code to perform RSS to the kernel in order to overcome
thse challenges. An alternative solution is to extend the eBPF steering
program so that it will be able to report to the userspace, but I didn't
opt for it because extending the current mechanism of eBPF steering
program as is because it relies on legacy context rewriting, and
introducing kfunc-based eBPF will result in non-UAPI dependency while
the other relevant virtualization APIs such as KVM and vhost_net are
UAPIs.

Signed-off-by: Akihiko Odaki 
---
 drivers/net/tap.c   | 11 +-
 drivers/net/tun.c   | 57 ---
 drivers/net/tun_vnet.h  | 96 +
 include/linux/if_tap.h  |  4 +-
 include/uapi/linux/if_tun.h | 27 +
 5 files changed, 169 insertions(+), 26 deletions(-)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 5e2fbe63ca47..a58b83285af4 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -207,6 +207,7 @@ static struct tap_queue *tap_get_queue(struct tap_dev *tap,
 * racing against queue removal.
 */
int numvtaps = READ_ONCE(tap->numvtaps);
+   struct tun_vnet_hash_container *vnet_hash = 
rcu_dereference(tap->vnet_hash);
__u32 rxq;
 
*tap_add_hash(skb) = (struct virtio_net_hash) { .report = 
VIRTIO_NET_HASH_REPORT_NONE };
@@ -217,6 +218,12 @@ static struct tap_queue *tap_get_queue(struct tap_dev *tap,
if (numvtaps == 1)
goto single;
 
+   if (vnet_hash && (vnet_hash->common.flags & TUN_VNET_HASH_RSS)) {
+   rxq = tun_vnet_rss_select_queue(numvtaps, vnet_hash, skb, 
tap_add_hash);
+   queue = rcu_dereference(tap->taps[rxq]);
+   goto out;
+   }
+
if (!skb->l4_hash && !skb->sw_hash) {
struct flow_keys keys;
 
@@ -234,7 +241,7 @@ static struct tap_queue *tap_get_queue(struct tap_dev *tap,
 
/* Check if we can use flow to select a queue */
if (rxq) {
-   tun_vnet_hash_report(&tap->vnet_hash, skb, &keys_basic, rxq, 
tap_add_hash);
+   tun_vnet_hash_report(vnet_hash, skb, &keys_basic, rxq, 
tap_add_hash);
queue = rcu_dereference(tap->taps[rxq % numvtaps]);
goto out;
}
@@ -1058,7 +1065,7 @@ static long tap_ioctl(struct file *file, unsigned int cmd,
tap = rtnl_dereference(q->tap);
ret = tun_vnet_ioctl(&q->vnet_hdr_sz, &q->flags,
 tap ? &tap->vnet_hash : NULL, -EINVAL,
-cmd, sp);
+true, cmd, sp);
rtnl_unlock();
return ret;
}
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 27308417b834..18528568aed7 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -209,7 +209,7 @@ struct tun_struct {
struct bpf_prog __rcu *xdp_prog;
struct tun_prog __rcu *steering_prog;
struct tun_prog __rcu *filter_prog;
-   struct tun_vnet_hash vnet_hash;
+   struct tun_vnet_hash_container __rcu *vnet_hash;
struct ethtool_link_ksettings link_ksettings;
/* init args */
struct file *file;
@@ -468,7 +468,9 @@ static const struct virtio_net_hash *tun_find_hash(const 
struct sk_buff *skb)
  * the userspace application move between processors, we may get a
  * different rxq no. here.
  */
-static u16 tun_automq_select_queue(struct tun_struct *tun, struct sk_buff *skb)
+static u16 tun_automq_select_queue(struct tun_struct *tun,
+  const struct tun_vnet_hash_container 
*vnet_hash,
+  struct sk_buff *skb)
 {
struct flow_keys keys;
struct flow_keys_basic keys_basic;
@@ -493,7 +495,7 @@ static u16 tun_automq_select_queue(struct tun_struct *tun, 
struct sk_buff *skb)
.control = keys.control,
.basic = keys.basic
};
-   tun_vnet_hash_report(&tun->vnet_hash, skb, &keys_basic, skb->l4_hash ? 
skb->hash : txq,
+   tun_vnet_hash_report(vnet_hash, skb, &keys_basic, skb->l4_hash ? 
skb->hash : txq,
 tun_add_hash);
 
return txq;
@@ -523,10 +525,17 @@ static u16 tun_select_queue(struct net_device *dev, 
struct sk_buff *skb,
u16 ret;
 
rcu_read_lock();
-   if (rcu_dereference(tun->steering_prog))
+   if (rcu_derefer

[PATCH RFC v5 08/10] selftest: tun: Test vnet ioctls without device

2024-10-08 Thread Akihiko Odaki
Ensure that vnet ioctls result in EBADFD when the underlying device is
deleted.

Signed-off-by: Akihiko Odaki 
---
 tools/testing/selftests/net/tun.c | 74 +++
 1 file changed, 74 insertions(+)

diff --git a/tools/testing/selftests/net/tun.c 
b/tools/testing/selftests/net/tun.c
index fa83918b62d1..463dd98f2b80 100644
--- a/tools/testing/selftests/net/tun.c
+++ b/tools/testing/selftests/net/tun.c
@@ -159,4 +159,78 @@ TEST_F(tun, reattach_close_delete) {
EXPECT_EQ(tun_delete(self->ifname), 0);
 }
 
+FIXTURE(tun_deleted)
+{
+   char ifname[IFNAMSIZ];
+   int fd;
+};
+
+FIXTURE_SETUP(tun_deleted)
+{
+   self->ifname[0] = 0;
+   self->fd = tun_alloc(self->ifname);
+   ASSERT_LE(0, self->fd);
+
+   ASSERT_EQ(0, tun_delete(self->ifname))
+   EXPECT_EQ(0, close(self->fd));
+}
+
+FIXTURE_TEARDOWN(tun_deleted)
+{
+   EXPECT_EQ(0, close(self->fd));
+}
+
+TEST_F(tun_deleted, getvnethdrsz)
+{
+   ASSERT_EQ(-1, ioctl(self->fd, TUNGETVNETHDRSZ));
+   EXPECT_EQ(EBADFD, errno);
+}
+
+TEST_F(tun_deleted, setvnethdrsz)
+{
+   ASSERT_EQ(-1, ioctl(self->fd, TUNSETVNETHDRSZ));
+   EXPECT_EQ(EBADFD, errno);
+}
+
+TEST_F(tun_deleted, getvnetle)
+{
+   ASSERT_EQ(-1, ioctl(self->fd, TUNGETVNETLE));
+   EXPECT_EQ(EBADFD, errno);
+}
+
+TEST_F(tun_deleted, setvnetle)
+{
+   ASSERT_EQ(-1, ioctl(self->fd, TUNSETVNETLE));
+   EXPECT_EQ(EBADFD, errno);
+}
+
+TEST_F(tun_deleted, getvnetbe)
+{
+   ASSERT_EQ(-1, ioctl(self->fd, TUNGETVNETBE));
+   EXPECT_EQ(EBADFD, errno);
+}
+
+TEST_F(tun_deleted, setvnetbe)
+{
+   ASSERT_EQ(-1, ioctl(self->fd, TUNSETVNETBE));
+   EXPECT_EQ(EBADFD, errno);
+}
+
+TEST_F(tun_deleted, getvnethashcap)
+{
+   struct tun_vnet_hash cap;
+   int i = ioctl(self->fd, TUNGETVNETHASHCAP, &cap);
+
+   if (i == -1 && errno == EBADFD)
+   SKIP(return, "TUNGETVNETHASHCAP not supported");
+
+   EXPECT_EQ(0, i);
+}
+
+TEST_F(tun_deleted, setvnethash)
+{
+   ASSERT_EQ(-1, ioctl(self->fd, TUNSETVNETHASH));
+   EXPECT_EQ(EBADFD, errno);
+}
+
 TEST_HARNESS_MAIN

-- 
2.46.2




[PATCH RFC v5 10/10] vhost/net: Support VIRTIO_NET_F_HASH_REPORT

2024-10-08 Thread Akihiko Odaki
VIRTIO_NET_F_HASH_REPORT allows to report hash values calculated on the
host. When VHOST_NET_F_VIRTIO_NET_HDR is employed, it will report no
hash values (i.e., the hash_report member is always set to
VIRTIO_NET_HASH_REPORT_NONE). Otherwise, the values reported by the
underlying socket will be reported.

VIRTIO_NET_F_HASH_REPORT requires VIRTIO_F_VERSION_1.

Signed-off-by: Akihiko Odaki 
---
 drivers/vhost/net.c | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index f16279351db5..ec1167a782ec 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -73,6 +73,7 @@ enum {
VHOST_NET_FEATURES = VHOST_FEATURES |
 (1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
 (1ULL << VIRTIO_NET_F_MRG_RXBUF) |
+(1ULL << VIRTIO_NET_F_HASH_REPORT) |
 (1ULL << VIRTIO_F_ACCESS_PLATFORM) |
 (1ULL << VIRTIO_F_RING_RESET)
 };
@@ -1604,10 +1605,13 @@ static int vhost_net_set_features(struct vhost_net *n, 
u64 features)
size_t vhost_hlen, sock_hlen, hdr_len;
int i;
 
-   hdr_len = (features & ((1ULL << VIRTIO_NET_F_MRG_RXBUF) |
-  (1ULL << VIRTIO_F_VERSION_1))) ?
-   sizeof(struct virtio_net_hdr_mrg_rxbuf) :
-   sizeof(struct virtio_net_hdr);
+   if (features & (1ULL << VIRTIO_NET_F_HASH_REPORT))
+   hdr_len = sizeof(struct virtio_net_hdr_v1_hash);
+   else if (features & ((1ULL << VIRTIO_NET_F_MRG_RXBUF) |
+(1ULL << VIRTIO_F_VERSION_1)))
+   hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
+   else
+   hdr_len = sizeof(struct virtio_net_hdr);
if (features & (1 << VHOST_NET_F_VIRTIO_NET_HDR)) {
/* vhost provides vnet_hdr */
vhost_hlen = hdr_len;
@@ -1688,6 +1692,10 @@ static long vhost_net_ioctl(struct file *f, unsigned int 
ioctl,
return -EFAULT;
if (features & ~VHOST_NET_FEATURES)
return -EOPNOTSUPP;
+   if ((features & ((1ULL << VIRTIO_F_VERSION_1) |
+(1ULL << VIRTIO_NET_F_HASH_REPORT))) ==
+   (1ULL << VIRTIO_NET_F_HASH_REPORT))
+   return -EINVAL;
return vhost_net_set_features(n, features);
case VHOST_GET_BACKEND_FEATURES:
features = VHOST_NET_BACKEND_FEATURES;

-- 
2.46.2




[PATCH RFC v5 09/10] selftest: tun: Add tests for virtio-net hashing

2024-10-08 Thread Akihiko Odaki
The added tests confirm tun can perform RSS and hash reporting, and
reject invalid configurations for them.

Signed-off-by: Akihiko Odaki 
---
 tools/testing/selftests/net/Makefile |   2 +-
 tools/testing/selftests/net/tun.c| 558 ++-
 2 files changed, 551 insertions(+), 9 deletions(-)

diff --git a/tools/testing/selftests/net/Makefile 
b/tools/testing/selftests/net/Makefile
index 9d5aa817411b..8e2ab5068171 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -110,6 +110,6 @@ $(OUTPUT)/reuseport_bpf_numa: LDLIBS += -lnuma
 $(OUTPUT)/tcp_mmap: LDLIBS += -lpthread -lcrypto
 $(OUTPUT)/tcp_inq: LDLIBS += -lpthread
 $(OUTPUT)/bind_bhash: LDLIBS += -lpthread
-$(OUTPUT)/io_uring_zerocopy_tx: CFLAGS += -I../../../include/
+$(OUTPUT)/io_uring_zerocopy_tx $(OUTPUT)/tun: CFLAGS += -I../../../include/
 
 include bpf.mk
diff --git a/tools/testing/selftests/net/tun.c 
b/tools/testing/selftests/net/tun.c
index 463dd98f2b80..ac3858744841 100644
--- a/tools/testing/selftests/net/tun.c
+++ b/tools/testing/selftests/net/tun.c
@@ -2,21 +2,37 @@
 
 #define _GNU_SOURCE
 
+#include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
-#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
 #include 
+#include 
 #include 
 #include 
-#include 
-#include 
+#include 
+#include 
+#include 
+#include 
 
 #include "../kselftest_harness.h"
 
+#define TUN_HWADDR_SOURCE { 0x02, 0x00, 0x00, 0x00, 0x00, 0x00 }
+#define TUN_HWADDR_DEST { 0x02, 0x00, 0x00, 0x00, 0x00, 0x01 }
+#define TUN_IPADDR_SOURCE htonl((172 << 24) | (17 << 16) | 0)
+#define TUN_IPADDR_DEST htonl((172 << 24) | (17 << 16) | 1)
+
 static int tun_attach(int fd, char *dev)
 {
struct ifreq ifr;
@@ -39,7 +55,7 @@ static int tun_detach(int fd, char *dev)
return ioctl(fd, TUNSETQUEUE, (void *) &ifr);
 }
 
-static int tun_alloc(char *dev)
+static int tun_alloc(char *dev, short flags)
 {
struct ifreq ifr;
int fd, err;
@@ -52,7 +68,8 @@ static int tun_alloc(char *dev)
 
memset(&ifr, 0, sizeof(ifr));
strcpy(ifr.ifr_name, dev);
-   ifr.ifr_flags = IFF_TAP | IFF_NAPI | IFF_MULTI_QUEUE;
+   ifr.ifr_flags = flags | IFF_TAP | IFF_NAPI | IFF_NO_PI |
+   IFF_MULTI_QUEUE;
 
err = ioctl(fd, TUNSETIFF, (void *) &ifr);
if (err < 0) {
@@ -64,6 +81,40 @@ static int tun_alloc(char *dev)
return fd;
 }
 
+static bool tun_add_to_bridge(int local_fd, const char *name)
+{
+   struct ifreq ifreq = {
+   .ifr_name = "xbridge",
+   .ifr_ifindex = if_nametoindex(name)
+   };
+
+   if (!ifreq.ifr_ifindex) {
+   perror("if_nametoindex");
+   return false;
+   }
+
+   if (ioctl(local_fd, SIOCBRADDIF, &ifreq)) {
+   perror("SIOCBRADDIF");
+   return false;
+   }
+
+   return true;
+}
+
+static bool tun_set_flags(int local_fd, const char *name, short flags)
+{
+   struct ifreq ifreq = { .ifr_flags = flags };
+
+   strcpy(ifreq.ifr_name, name);
+
+   if (ioctl(local_fd, SIOCSIFFLAGS, &ifreq)) {
+   perror("SIOCSIFFLAGS");
+   return false;
+   }
+
+   return true;
+}
+
 static int tun_delete(char *dev)
 {
struct {
@@ -102,6 +153,159 @@ static int tun_delete(char *dev)
return ret;
 }
 
+static uint32_t tun_sum(const void *buf, size_t len)
+{
+   const uint16_t *sbuf = buf;
+   uint32_t sum = 0;
+
+   while (len > 1) {
+   sum += *sbuf++;
+   len -= 2;
+   }
+
+   if (len)
+   sum += *(uint8_t *)sbuf;
+
+   return sum;
+}
+
+static uint16_t tun_build_ip_check(uint32_t sum)
+{
+   return ~((sum & 0x) + (sum >> 16));
+}
+
+static uint32_t tun_build_ip_pseudo_sum(const void *iphdr)
+{
+   uint16_t tot_len = ntohs(((struct iphdr *)iphdr)->tot_len);
+
+   return tun_sum((char *)iphdr + offsetof(struct iphdr, saddr), 8) +
+  htons(((struct iphdr *)iphdr)->protocol) +
+  htons(tot_len - sizeof(struct iphdr));
+}
+
+static uint32_t tun_build_ipv6_pseudo_sum(const void *ipv6hdr)
+{
+   return tun_sum((char *)ipv6hdr + offsetof(struct ipv6hdr, saddr), 32) +
+  ((struct ipv6hdr *)ipv6hdr)->payload_len +
+  htons(((struct ipv6hdr *)ipv6hdr)->nexthdr);
+}
+
+static void tun_build_ethhdr(struct ethhdr *ethhdr, uint16_t proto)
+{
+   *ethhdr = (struct ethhdr) {
+   .h_dest = TUN_HWADDR_DEST,
+   .h_source = TUN_HWADDR_SOURCE,
+   .h_proto = htons(proto)
+   };
+}
+
+static void tun_build_iphdr(void *dest, uint16_t len, uint8_t protocol)
+{
+   struct iphdr iphdr = {
+   .ihl = sizeof(iphdr) / 4,
+   .version = 

Re: [PATCH RFC v5 04/10] tun: Unify vnet implementation

2024-10-12 Thread Akihiko Odaki

On 2024/10/09 22:55, Willem de Bruijn wrote:

Akihiko Odaki wrote:

Both tun and tap exposes the same set of virtio-net-related features.
Unify their implementations to ease future changes.

Signed-off-by: Akihiko Odaki 
---
  MAINTAINERS|   1 +
  drivers/net/tap.c  | 172 ++--
  drivers/net/tun.c  | 208 -
  drivers/net/tun_vnet.h | 181 ++


Same point: should not be in a header.

Also: I've looked into deduplicating code between the various tun, tap
and packet socket code as well.

In general it's a good idea. The main counter arguments is that such a
break in continuity also breaks backporting fixes to stable. So the
benefit must outweight that cost.

In this case, the benefits in terms of LoC are rather modest. Not sure
it's worth it.

Even more importantly: are the two code paths that you deduplicate
exactly identical? Often in the past the two subtly diverged over
time, e.g., due to new features added only to one of the two.


I find extracting the virtio_net-related code into functions is 
beneficial. For example, tun_get_user() is a big function and extracting 
the virtio_net-related code into tun_vnet_hdr_get() will ease 
understanding tun_get_user() when you are not interested in virtio_net. 
If virtio_net is your interest, you can look at this group of functions 
to figure out how they interact with each other.


Currently, the extracted code is almost identical for tun and tap so 
they can share it. We can copy the code back (but keep functions as 
semantic units) if they diverge in the future.




If so, call out any behavioral changes to either as a result of
deduplicating explicitly.


This adds an error message for GSO failure, which was missing for tap. I 
will note that in the next version.




Re: [PATCH RFC v4 0/9] tun: Introduce virtio-net hashing feature

2024-10-01 Thread Akihiko Odaki

On 2024/10/02 1:31, Stephen Hemminger wrote:

On Tue, 1 Oct 2024 14:54:29 +0900
Akihiko Odaki  wrote:


On 2024/09/30 0:33, Stephen Hemminger wrote:

On Sun, 29 Sep 2024 16:10:47 +0900
Akihiko Odaki  wrote:
   

On 2024/09/29 11:07, Jason Wang wrote:

On Fri, Sep 27, 2024 at 3:51 PM Akihiko Odaki  wrote:


On 2024/09/27 13:31, Jason Wang wrote:

On Fri, Sep 27, 2024 at 10:11 AM Akihiko Odaki  wrote:


On 2024/09/25 12:30, Jason Wang wrote:

On Tue, Sep 24, 2024 at 5:01 PM Akihiko Odaki  wrote:


virtio-net have two usage of hashes: one is RSS and another is hash
reporting. Conventionally the hash calculation was done by the VMM.
However, computing the hash after the queue was chosen defeats the
purpose of RSS.

Another approach is to use eBPF steering program. This approach has
another downside: it cannot report the calculated hash due to the
restrictive nature of eBPF.

Introduce the code to compute hashes to the kernel in order to overcome
thse challenges.

An alternative solution is to extend the eBPF steering program so that it
will be able to report to the userspace, but it is based on context
rewrites, which is in feature freeze. We can adopt kfuncs, but they will
not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM
and vhost_net).
 


I wonder if we could clone the skb and reuse some to store the hash,
then the steering eBPF program can access these fields without
introducing full RSS in the kernel?


I don't get how cloning the skb can solve the issue.

We can certainly implement Toeplitz function in the kernel or even with
tc-bpf to store a hash value that can be used for eBPF steering program
and virtio hash reporting. However we don't have a means of storing a
hash type, which is specific to virtio hash reporting and lacks a
corresponding skb field.


I may miss something but looking at sk_filter_is_valid_access(). It
looks to me we can make use of skb->cb[0..4]?


I didn't opt to using cb. Below is the rationale:

cb is for tail call so it means we reuse the field for a different
purpose. The context rewrite allows adding a field without increasing
the size of the underlying storage (the real sk_buff) so we should add a
new field instead of reusing an existing field to avoid confusion.

We are however no longer allowed to add a new field. In my
understanding, this is because it is an UAPI, and eBPF maintainers found
it is difficult to maintain its stability.

Reusing cb for hash reporting is a workaround to avoid having a new
field, but it does not solve the underlying problem (i.e., keeping eBPF
as stable as UAPI is unreasonably hard). In my opinion, adding an ioctl
is a reasonable option to keep the API as stable as other virtualization
UAPIs while respecting the underlying intention of the context rewrite
feature freeze.


Fair enough.

Btw, I remember DPDK implements tuntap RSS via eBPF as well (probably
via cls or other). It might worth to see if anything we miss here.


Thanks for the information. I wonder why they used cls instead of
steering program. Perhaps it may be due to compatibility with macvtap
and ipvtap, which don't steering program.

Their RSS implementation looks cleaner so I will improve my RSS
implementation accordingly.
  


DPDK needs to support flow rules. The specific case is where packets
are classified by a flow, then RSS is done across a subset of the queues.
The support for flow in TUN driver is more academic than useful,
I fixed it for current BPF, but doubt anyone is using it really.

A full steering program would be good, but would require much more
complexity to take a general set of flow rules then communicate that
to the steering program.
   


It reminded me of RSS context and flow filter. Some physical NICs
support to use a dedicated RSS context for packets matched with flow
filter, and virtio is also gaining corresponding features.

RSS context: https://github.com/oasis-tcs/virtio-spec/issues/178
Flow filter: https://github.com/oasis-tcs/virtio-spec/issues/179

I considered about the possibility of supporting these features with tc
instead of adding ioctls to tuntap, but it seems not appropriate for
virtualization use case.

In a virtualization use case, tuntap is configured according to requests
of guests, and the code processing these requests need to have minimal
permissions for security. This goal is achieved by passing a file
descriptor that represents a tuntap from a privileged process (e.g.,
libvirt) to the process handling guest requests (e.g., QEMU).

However, tc is configured with rtnetlink, which does not seem to have an
interface to delegate a permission for one particular device to another
process.

For now I'll continue working on the current approach that is based on
ioctl and lacks RSS context and flow filter features. Eventually they
are also likely to require new ioctls if they are to be supported with
vhost_net.


The DPDK flow handling (rte_flow) was started by Mellanox and many of
the 

Re: [PATCH RFC v5 06/10] tun: Introduce virtio-net hash reporting feature

2024-10-12 Thread Akihiko Odaki

On 2024/10/09 17:05, Jason Wang wrote:

On Tue, Oct 8, 2024 at 2:55 PM Akihiko Odaki  wrote:


Allow the guest to reuse the hash value to make receive steering
consistent between the host and guest, and to save hash computation.

Signed-off-by: Akihiko Odaki 


I wonder if this would cause overhead when hash reporting is not enabled?


It only adds two branches in the data path. The first one is in 
tun_vnet_hash_report(), which determines to add the hash value to 
sk_buff. The second one is in tun_vnet_hdr_from_skb(), which determines 
to report the added hash value.





---
  Documentation/networking/tuntap.rst |   7 +++
  drivers/net/Kconfig |   1 +
  drivers/net/tap.c   |  45 ++--


Tile should be for tap as well or is this just for tun?


It is also for tap. I will update the title in v6.




  drivers/net/tun.c   |  46 
  drivers/net/tun_vnet.h  | 102 +++-
  include/linux/if_tap.h  |   2 +
  include/uapi/linux/if_tun.h |  48 +
  7 files changed, 223 insertions(+), 28 deletions(-)

diff --git a/Documentation/networking/tuntap.rst 
b/Documentation/networking/tuntap.rst
index 4d7087f727be..86b4ae8caa8a 100644
--- a/Documentation/networking/tuntap.rst
+++ b/Documentation/networking/tuntap.rst
@@ -206,6 +206,13 @@ enable is true we enable it, otherwise we disable it::
return ioctl(fd, TUNSETQUEUE, (void *)&ifr);
}

+3.4 Reference
+-
+
+``linux/if_tun.h`` defines the interface described below:
+
+.. kernel-doc:: include/uapi/linux/if_tun.h
+
  Universal TUN/TAP device driver Frequently Asked Question
  =

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 9920b3a68ed1..e2a7bd703550 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -395,6 +395,7 @@ config TUN
 tristate "Universal TUN/TAP device driver support"
 depends on INET
 select CRC32
+   select SKB_EXTENSIONS


Then we need this for macvtap at least as well?


 help
   TUN/TAP provides packet reception and transmission for user space
   programs.  It can be viewed as a simple Point-to-Point or Ethernet
diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 9a34ceed0c2c..5e2fbe63ca47 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -179,6 +179,16 @@ static void tap_put_queue(struct tap_queue *q)
 sock_put(&q->sk);
  }

+static struct virtio_net_hash *tap_add_hash(struct sk_buff *skb)
+{
+   return (struct virtio_net_hash *)skb->cb;


Any reason that tap uses skb->cb but not skb extensions? (And is it
safe to use that without cloning?)


tun adds virtio_net_hash to a skb in ndo_select_queue(), but it does not 
immediately put it into its ptr_ring; instead ndo_start_xmit() does so. 
It is hard to ensure that nobody modifies skb->cb between the two calls.


The situation is different for tap. tap_handle_frame() adds 
virtio_net_hash to a skb and immediately adds it in its ptr_ring so 
nobody should touch it between that.





+}
+
+static const struct virtio_net_hash *tap_find_hash(const struct sk_buff *skb)
+{
+   return (const struct virtio_net_hash *)skb->cb;
+}
+
  /*
   * Select a queue based on the rxq of the device on which this packet
   * arrived. If the incoming device is not mq, calculate a flow hash
@@ -189,6 +199,7 @@ static void tap_put_queue(struct tap_queue *q)
  static struct tap_queue *tap_get_queue(struct tap_dev *tap,
struct sk_buff *skb)
  {
+   struct flow_keys_basic keys_basic;
 struct tap_queue *queue = NULL;
 /* Access to taps array is protected by rcu, but access to numvtaps
  * isn't. Below we use it to lookup a queue, but treat it as a hint
@@ -198,15 +209,32 @@ static struct tap_queue *tap_get_queue(struct tap_dev 
*tap,
 int numvtaps = READ_ONCE(tap->numvtaps);
 __u32 rxq;

+   *tap_add_hash(skb) = (struct virtio_net_hash) { .report = 
VIRTIO_NET_HASH_REPORT_NONE };
+
 if (!numvtaps)
 goto out;

 if (numvtaps == 1)
 goto single;

+   if (!skb->l4_hash && !skb->sw_hash) {
+   struct flow_keys keys;
+
+   skb_flow_dissect_flow_keys(skb, &keys, 
FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL);
+   rxq = flow_hash_from_keys(&keys);
+   keys_basic = (struct flow_keys_basic) {
+   .control = keys.control,
+   .basic = keys.basic
+   };
+   } else {
+   skb_flow_dissect_flow_keys_basic(NULL, skb, &keys_basic, NULL, 
0, 0, 0,
+
FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL);
+   rxq = skb->hash;
+   }
+
 /* Check if we can use 

Re: [PATCH RFC v5 07/10] tun: Introduce virtio-net RSS

2024-10-12 Thread Akihiko Odaki

On 2024/10/09 17:14, Jason Wang wrote:

On Tue, Oct 8, 2024 at 2:55 PM Akihiko Odaki  wrote:


RSS is a receive steering algorithm that can be negotiated to use with
virtio_net. Conventionally the hash calculation was done by the VMM.
However, computing the hash after the queue was chosen defeats the
purpose of RSS.

Another approach is to use eBPF steering program. This approach has
another downside: it cannot report the calculated hash due to the
restrictive nature of eBPF steering program.

Introduce the code to perform RSS to the kernel in order to overcome
thse challenges. An alternative solution is to extend the eBPF steering
program so that it will be able to report to the userspace, but I didn't
opt for it because extending the current mechanism of eBPF steering
program as is because it relies on legacy context rewriting, and
introducing kfunc-based eBPF will result in non-UAPI dependency while
the other relevant virtualization APIs such as KVM and vhost_net are
UAPIs.

Signed-off-by: Akihiko Odaki 
---
  drivers/net/tap.c   | 11 +-
  drivers/net/tun.c   | 57 ---
  drivers/net/tun_vnet.h  | 96 +
  include/linux/if_tap.h  |  4 +-
  include/uapi/linux/if_tun.h | 27 +
  5 files changed, 169 insertions(+), 26 deletions(-)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 5e2fbe63ca47..a58b83285af4 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -207,6 +207,7 @@ static struct tap_queue *tap_get_queue(struct tap_dev *tap,
  * racing against queue removal.
  */
 int numvtaps = READ_ONCE(tap->numvtaps);
+   struct tun_vnet_hash_container *vnet_hash = 
rcu_dereference(tap->vnet_hash);
 __u32 rxq;

 *tap_add_hash(skb) = (struct virtio_net_hash) { .report = 
VIRTIO_NET_HASH_REPORT_NONE };
@@ -217,6 +218,12 @@ static struct tap_queue *tap_get_queue(struct tap_dev *tap,
 if (numvtaps == 1)
 goto single;

+   if (vnet_hash && (vnet_hash->common.flags & TUN_VNET_HASH_RSS)) {
+   rxq = tun_vnet_rss_select_queue(numvtaps, vnet_hash, skb, 
tap_add_hash);
+   queue = rcu_dereference(tap->taps[rxq]);
+   goto out;
+   }
+
 if (!skb->l4_hash && !skb->sw_hash) {
 struct flow_keys keys;

@@ -234,7 +241,7 @@ static struct tap_queue *tap_get_queue(struct tap_dev *tap,

 /* Check if we can use flow to select a queue */
 if (rxq) {
-   tun_vnet_hash_report(&tap->vnet_hash, skb, &keys_basic, rxq, 
tap_add_hash);
+   tun_vnet_hash_report(vnet_hash, skb, &keys_basic, rxq, 
tap_add_hash);
 queue = rcu_dereference(tap->taps[rxq % numvtaps]);
 goto out;
 }
@@ -1058,7 +1065,7 @@ static long tap_ioctl(struct file *file, unsigned int cmd,
 tap = rtnl_dereference(q->tap);
 ret = tun_vnet_ioctl(&q->vnet_hdr_sz, &q->flags,
  tap ? &tap->vnet_hash : NULL, -EINVAL,
-cmd, sp);
+true, cmd, sp);
 rtnl_unlock();
 return ret;
 }
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 27308417b834..18528568aed7 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -209,7 +209,7 @@ struct tun_struct {
 struct bpf_prog __rcu *xdp_prog;
 struct tun_prog __rcu *steering_prog;
 struct tun_prog __rcu *filter_prog;
-   struct tun_vnet_hash vnet_hash;
+   struct tun_vnet_hash_container __rcu *vnet_hash;
 struct ethtool_link_ksettings link_ksettings;
 /* init args */
 struct file *file;
@@ -468,7 +468,9 @@ static const struct virtio_net_hash *tun_find_hash(const 
struct sk_buff *skb)
   * the userspace application move between processors, we may get a
   * different rxq no. here.
   */
-static u16 tun_automq_select_queue(struct tun_struct *tun, struct sk_buff *skb)
+static u16 tun_automq_select_queue(struct tun_struct *tun,
+  const struct tun_vnet_hash_container 
*vnet_hash,
+  struct sk_buff *skb)
  {
 struct flow_keys keys;
 struct flow_keys_basic keys_basic;
@@ -493,7 +495,7 @@ static u16 tun_automq_select_queue(struct tun_struct *tun, 
struct sk_buff *skb)
 .control = keys.control,
 .basic = keys.basic
 };
-   tun_vnet_hash_report(&tun->vnet_hash, skb, &keys_basic, skb->l4_hash ? 
skb->hash : txq,
+   tun_vnet_hash_report(vnet_hash, skb, &keys_basic, skb->l4_hash ? 
skb->hash : txq,
  tun_add_hash);

 return txq;
@@ -523,10 +525,17 @@ static u16 tun_select_queue(struct net_device *dev, 
struct sk_buf

Re: [PATCH RFC v5 01/10] virtio_net: Add functions for hashing

2024-10-12 Thread Akihiko Odaki

On 2024/10/09 22:51, Willem de Bruijn wrote:

Akihiko Odaki wrote:

They are useful to implement VIRTIO_NET_F_RSS and
VIRTIO_NET_F_HASH_REPORT.

Signed-off-by: Akihiko Odaki 
---
  include/linux/virtio_net.h | 188 +


No need for these to be in header files


I naively followed prior examples in this file. Do you have an 
alternative idea?




Re: [PATCH RFC v3 0/9] tun: Introduce virtio-net hashing feature

2024-09-23 Thread Akihiko Odaki

On 2024/09/15 21:48, Stephen Hemminger wrote:

On Sun, 15 Sep 2024 10:17:39 +0900
Akihiko Odaki  wrote:


virtio-net have two usage of hashes: one is RSS and another is hash
reporting. Conventionally the hash calculation was done by the VMM.
However, computing the hash after the queue was chosen defeats the
purpose of RSS.

Another approach is to use eBPF steering program. This approach has
another downside: it cannot report the calculated hash due to the
restrictive nature of eBPF.

Introduce the code to compute hashes to the kernel in order to overcome
thse challenges.

An alternative solution is to extend the eBPF steering program so that it
will be able to report to the userspace, but it is based on context
rewrites, which is in feature freeze. We can adopt kfuncs, but they will
not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM
and vhost_net).


This will be useful for DPDK. But there still are cases where custom
flow rules are needed. I.e the RSS happens after other TC rules.
It would be a good if skbedit supported RSS as an option.


Hi,

It is nice to hear about a use case other than QEMU or virtualization. I 
implemented RSS as tuntap ioctl because:

- It is easier to configure for the user of tuntap (e.g., QEMU)
- It implements hash reporting, which is specific to tuntap.

You can still add skbedit if you want to override RSS for some packets 
with filter. Please tell me if it is not sufficient for your use case.


Regards,
Akihiko Odaki



Re: [PATCH RFC v3 2/9] virtio_net: Add functions for hashing

2024-09-23 Thread Akihiko Odaki

On 2024/09/18 14:50, Willem de Bruijn wrote:

Akihiko Odaki wrote:

They are useful to implement VIRTIO_NET_F_RSS and
VIRTIO_NET_F_HASH_REPORT.

Signed-off-by: Akihiko Odaki 
---
  include/linux/virtio_net.h | 198 +
  1 file changed, 198 insertions(+)

diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index 6c395a2600e8..7ee2e2f2625a 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -9,6 +9,183 @@
  #include 
  #include 
  
+struct virtio_net_hash {

+   u32 value;
+   u16 report;
+};
+
+struct virtio_net_toeplitz_state {
+   u32 hash;
+   u32 key_buffer;
+   const __be32 *key;
+};
+
+#define VIRTIO_NET_SUPPORTED_HASH_TYPES (VIRTIO_NET_RSS_HASH_TYPE_IPv4 | \
+VIRTIO_NET_RSS_HASH_TYPE_TCPv4 | \
+VIRTIO_NET_RSS_HASH_TYPE_UDPv4 | \
+VIRTIO_NET_RSS_HASH_TYPE_IPv6 | \
+VIRTIO_NET_RSS_HASH_TYPE_TCPv6 | \
+VIRTIO_NET_RSS_HASH_TYPE_UDPv6)
+
+#define VIRTIO_NET_RSS_MAX_KEY_SIZE 40
+
+static inline void virtio_net_toeplitz(struct virtio_net_toeplitz_state *state,
+  const __be32 *input, size_t len)
+{
+   u32 key;
+
+   while (len) {
+   state->key++;
+   key = be32_to_cpu(*state->key);
+
+   for (u32 bit = BIT(31); bit; bit >>= 1) {
+   if (be32_to_cpu(*input) & bit)
+   state->hash ^= state->key_buffer;
+
+   state->key_buffer =
+   (state->key_buffer << 1) | !!(key & bit);
+   }
+
+   input++;
+   len--;
+   }
+}
+
+static inline u8 virtio_net_hash_key_length(u32 types)
+{
+   size_t len = 0;
+
+   if (types & VIRTIO_NET_HASH_REPORT_IPv4)
+   len = max(len,
+ sizeof(struct flow_dissector_key_ipv4_addrs));
+
+   if (types &
+   (VIRTIO_NET_HASH_REPORT_TCPv4 | VIRTIO_NET_HASH_REPORT_UDPv4))
+   len = max(len,
+ sizeof(struct flow_dissector_key_ipv4_addrs) +
+ sizeof(struct flow_dissector_key_ports));
+
+   if (types & VIRTIO_NET_HASH_REPORT_IPv6)
+   len = max(len,
+ sizeof(struct flow_dissector_key_ipv6_addrs));
+
+   if (types &
+   (VIRTIO_NET_HASH_REPORT_TCPv6 | VIRTIO_NET_HASH_REPORT_UDPv6))
+   len = max(len,
+ sizeof(struct flow_dissector_key_ipv6_addrs) +
+ sizeof(struct flow_dissector_key_ports));
+
+   return 4 + len;


Avoid raw constants like this 4. What field does it capture?


It is: sizeof_field(struct virtio_net_toeplitz_state, key_buffer)
I'll replace it with v4.



Instead of working from shortest to longest and using max, how about
the inverse and return as soon as a match is found.


I think it is less error-prone to use max() as it does not require to 
sort the numbers. The compiler should properly optimize it in the way 
you suggested.





+}
+
+static inline u32 virtio_net_hash_report(u32 types,
+struct flow_dissector_key_basic key)
+{
+   switch (key.n_proto) {
+   case htons(ETH_P_IP):
+   if (key.ip_proto == IPPROTO_TCP &&
+   (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv4))
+   return VIRTIO_NET_HASH_REPORT_TCPv4;
+
+   if (key.ip_proto == IPPROTO_UDP &&
+   (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv4))
+   return VIRTIO_NET_HASH_REPORT_UDPv4;
+
+   if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv4)
+   return VIRTIO_NET_HASH_REPORT_IPv4;
+
+   return VIRTIO_NET_HASH_REPORT_NONE;
+
+   case htons(ETH_P_IPV6):
+   if (key.ip_proto == IPPROTO_TCP &&
+   (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv6))
+   return VIRTIO_NET_HASH_REPORT_TCPv6;
+
+   if (key.ip_proto == IPPROTO_UDP &&
+   (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv6))
+   return VIRTIO_NET_HASH_REPORT_UDPv6;
+
+   if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv6)
+   return VIRTIO_NET_HASH_REPORT_IPv6;
+
+   return VIRTIO_NET_HASH_REPORT_NONE;
+
+   default:
+   return VIRTIO_NET_HASH_REPORT_NONE;
+   }
+}
+
+static inline bool virtio_net_hash_rss(const struct sk_buff *skb,
+  u32 types, const __be32 *key,
+  struct virtio_net_hash *hash)
+{
+   u16 report;


nit: move below the 

Re: [PATCH RFC v3 6/9] tun: Introduce virtio-net hash reporting feature

2024-09-23 Thread Akihiko Odaki

On 2024/09/18 15:17, Willem de Bruijn wrote:

Akihiko Odaki wrote:

Allow the guest to reuse the hash value to make receive steering
consistent between the host and guest, and to save hash computation.

Signed-off-by: Akihiko Odaki 
---
  Documentation/networking/tuntap.rst |   7 ++
  drivers/net/Kconfig |   1 +
  drivers/net/tun.c   | 146 +++-
  include/uapi/linux/if_tun.h |  44 +++
  4 files changed, 180 insertions(+), 18 deletions(-)

diff --git a/Documentation/networking/tuntap.rst 
b/Documentation/networking/tuntap.rst
index 4d7087f727be..86b4ae8caa8a 100644
--- a/Documentation/networking/tuntap.rst
+++ b/Documentation/networking/tuntap.rst
@@ -206,6 +206,13 @@ enable is true we enable it, otherwise we disable it::
return ioctl(fd, TUNSETQUEUE, (void *)&ifr);
}
  
+3.4 Reference

+-
+
+``linux/if_tun.h`` defines the interface described below:
+
+.. kernel-doc:: include/uapi/linux/if_tun.h
+
  Universal TUN/TAP device driver Frequently Asked Question
  =
  
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig

index 9920b3a68ed1..e2a7bd703550 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -395,6 +395,7 @@ config TUN
tristate "Universal TUN/TAP device driver support"
depends on INET
select CRC32
+   select SKB_EXTENSIONS
help
  TUN/TAP provides packet reception and transmission for user space
  programs.  It can be viewed as a simple Point-to-Point or Ethernet
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 9d93ab9ee58f..b8fcd71becac 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -173,6 +173,10 @@ struct tun_prog {
struct bpf_prog *prog;
  };
  
+struct tun_vnet_hash_container {

+   struct tun_vnet_hash common;
+};
+
  /* Since the socket were moved to tun_file, to preserve the behavior of 
persist
   * device, socket filter, sndbuf and vnet header size were restore when the
   * file were attached to a persist device.
@@ -210,6 +214,7 @@ struct tun_struct {
struct bpf_prog __rcu *xdp_prog;
struct tun_prog __rcu *steering_prog;
struct tun_prog __rcu *filter_prog;
+   struct tun_vnet_hash_container __rcu *vnet_hash;


This is just

+struct tun_vnet_hash {
+   u32 value;
+   u16 report;
+};

Can just be fields in the struct directly.


I will change to store struct tun_vnet_hash directly.



Also, only one bit really used for report, so probably can be
condensed further.


It is more than one bit; the report types are defined as follows:
#define VIRTIO_NET_HASH_REPORT_NONE0
#define VIRTIO_NET_HASH_REPORT_IPv41
#define VIRTIO_NET_HASH_REPORT_TCPv4   2
#define VIRTIO_NET_HASH_REPORT_UDPv4   3
#define VIRTIO_NET_HASH_REPORT_IPv64
#define VIRTIO_NET_HASH_REPORT_TCPv6   5
#define VIRTIO_NET_HASH_REPORT_UDPv6   6
#define VIRTIO_NET_HASH_REPORT_IPv6_EX 7
#define VIRTIO_NET_HASH_REPORT_TCPv6_EX8
#define VIRTIO_NET_HASH_REPORT_UDPv6_EX9




struct ethtool_link_ksettings link_ksettings;
/* init args */
struct file *file;
@@ -221,6 +226,11 @@ struct veth {
__be16 h_vlan_TCI;
  };
  
+static const struct tun_vnet_hash tun_vnet_hash_cap = {

+   .flags = TUN_VNET_HASH_REPORT,
+   .types = VIRTIO_NET_SUPPORTED_HASH_TYPES
+};
+
  static void tun_flow_init(struct tun_struct *tun);
  static void tun_flow_uninit(struct tun_struct *tun);
  
@@ -322,10 +332,17 @@ static long tun_set_vnet_be(struct tun_struct *tun, int __user *argp)

if (get_user(be, argp))
return -EFAULT;
  
-	if (be)

+   if (be) {
+   struct tun_vnet_hash_container *vnet_hash = 
rtnl_dereference(tun->vnet_hash);
+
+   if (!(tun->flags & TUN_VNET_LE) &&
+   vnet_hash && (vnet_hash->flags & TUN_VNET_HASH_REPORT))
+   return -EBUSY;
+


Doesn't be here imply !tun->flags & TUN_VNET_LE? Same again below.


Unfortunately no. TUN_VNET_LE and TUN_VNET_BE can be set at the same 
time, and TUN_VNET_LE is enforced in such a case.





tun->flags |= TUN_VNET_BE;
-   else
+   } else {
tun->flags &= ~TUN_VNET_BE;
+   }
  
  	return 0;

  }
@@ -522,14 +539,20 @@ static inline void tun_flow_save_rps_rxhash(struct 
tun_flow_entry *e, u32 hash)
   * the userspace application move between processors, we may get a
   * different rxq no. here.
   */
-static u16 tun_automq_select_queue(struct tun_struct *tun, struct sk_buff *skb)
+static u16 tun_automq_select_queue(struct tun_struct *tun, struct sk_buff *skb,
+  const struct tun_vnet_hash_container 
*vnet_hash)
  {
+   struct tun_vnet_hash_ext *ext;
+   struct flow_ke

Re: [PATCH RFC v3 7/9] tun: Introduce virtio-net RSS

2024-09-24 Thread Akihiko Odaki




On 2024/09/24 10:56, Akihiko Odaki wrote:

On 2024/09/18 15:28, Willem de Bruijn wrote:

Akihiko Odaki wrote:

RSS is a receive steering algorithm that can be negotiated to use with
virtio_net. Conventionally the hash calculation was done by the VMM.
However, computing the hash after the queue was chosen defeats the
purpose of RSS.

Another approach is to use eBPF steering program. This approach has
another downside: it cannot report the calculated hash due to the
restrictive nature of eBPF steering program.

Introduce the code to perform RSS to the kernel in order to overcome
thse challenges. An alternative solution is to extend the eBPF steering
program so that it will be able to report to the userspace, but I didn't
opt for it because extending the current mechanism of eBPF steering
program as is because it relies on legacy context rewriting, and
introducing kfunc-based eBPF will result in non-UAPI dependency while
the other relevant virtualization APIs such as KVM and vhost_net are
UAPIs.

Signed-off-by: Akihiko Odaki 
---
  drivers/net/tun.c   | 119 
+++-

  include/uapi/linux/if_tun.h |  27 ++
  2 files changed, 133 insertions(+), 13 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index b8fcd71becac..5a429b391144 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -175,6 +175,9 @@ struct tun_prog {
  struct tun_vnet_hash_container {
  struct tun_vnet_hash common;
+    struct tun_vnet_hash_rss rss;
+    __be32 rss_key[VIRTIO_NET_RSS_MAX_KEY_SIZE];
+    u16 rss_indirection_table[];
  };
  /* Since the socket were moved to tun_file, to preserve the 
behavior of persist

@@ -227,7 +230,7 @@ struct veth {
  };
  static const struct tun_vnet_hash tun_vnet_hash_cap = {
-    .flags = TUN_VNET_HASH_REPORT,
+    .flags = TUN_VNET_HASH_REPORT | TUN_VNET_HASH_RSS,
  .types = VIRTIO_NET_SUPPORTED_HASH_TYPES
  };
@@ -591,6 +594,36 @@ static u16 tun_ebpf_select_queue(struct 
tun_struct *tun, struct sk_buff *skb)

  return ret % numqueues;
  }
+static u16 tun_vnet_rss_select_queue(struct tun_struct *tun,
+ struct sk_buff *skb,
+ const struct tun_vnet_hash_container *vnet_hash)
+{
+    struct tun_vnet_hash_ext *ext;
+    struct virtio_net_hash hash;
+    u32 numqueues = READ_ONCE(tun->numqueues);
+    u16 txq, index;
+
+    if (!numqueues)
+    return 0;
+
+    if (!virtio_net_hash_rss(skb, vnet_hash->common.types, 
vnet_hash->rss_key,

+ &hash))
+    return vnet_hash->rss.unclassified_queue % numqueues;
+
+    if (vnet_hash->common.flags & TUN_VNET_HASH_REPORT) {
+    ext = skb_ext_add(skb, SKB_EXT_TUN_VNET_HASH);
+    if (ext) {
+    ext->value = hash.value;
+    ext->report = hash.report;
+    }
+    }
+
+    index = hash.value & vnet_hash->rss.indirection_table_mask;
+    txq = READ_ONCE(vnet_hash->rss_indirection_table[index]);
+
+    return txq % numqueues;
+}
+
  static u16 tun_select_queue(struct net_device *dev, struct sk_buff 
*skb,

  struct net_device *sb_dev)
  {
@@ -603,7 +636,10 @@ static u16 tun_select_queue(struct net_device 
*dev, struct sk_buff *skb,

  } else {
  struct tun_vnet_hash_container *vnet_hash = 
rcu_dereference(tun->vnet_hash);

-    ret = tun_automq_select_queue(tun, skb, vnet_hash);
+    if (vnet_hash && (vnet_hash->common.flags & TUN_VNET_HASH_RSS))
+    ret = tun_vnet_rss_select_queue(tun, skb, vnet_hash);
+    else
+    ret = tun_automq_select_queue(tun, skb, vnet_hash);
  }
  rcu_read_unlock();
@@ -3085,13 +3121,9 @@ static int tun_set_queue(struct file *file, 
struct ifreq *ifr)

  }
  static int tun_set_ebpf(struct tun_struct *tun, struct tun_prog 
__rcu **prog_p,

-    void __user *data)
+    int fd)
  {
  struct bpf_prog *prog;
-    int fd;
-
-    if (copy_from_user(&fd, data, sizeof(fd)))
-    return -EFAULT;
  if (fd == -1) {
  prog = NULL;
@@ -3157,6 +3189,7 @@ static long __tun_chr_ioctl(struct file *file, 
unsigned int cmd,

  int ifindex;
  int sndbuf;
  int vnet_hdr_sz;
+    int fd;
  int le;
  int ret;
  bool do_notify = false;
@@ -3460,11 +3493,27 @@ static long __tun_chr_ioctl(struct file 
*file, unsigned int cmd,

  break;
  case TUNSETSTEERINGEBPF:
-    ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
+    if (get_user(fd, (int __user *)argp)) {
+    ret = -EFAULT;
+    break;
+    }
+
+    vnet_hash = rtnl_dereference(tun->vnet_hash);
+    if (fd != -1 && vnet_hash && (vnet_hash->common.flags & 
TUN_VNET_HASH_RSS)) {

+    ret = -EBUSY;
+    break;
+    }
+
+    ret = tun_set_ebpf(tun, &tun->steering_prog, fd);
  break;
  case TUNSETFILTEREBPF:
-    ret = tun

Re: [PATCH RFC v3 7/9] tun: Introduce virtio-net RSS

2024-09-24 Thread Akihiko Odaki

On 2024/09/18 15:28, Willem de Bruijn wrote:

Akihiko Odaki wrote:

RSS is a receive steering algorithm that can be negotiated to use with
virtio_net. Conventionally the hash calculation was done by the VMM.
However, computing the hash after the queue was chosen defeats the
purpose of RSS.

Another approach is to use eBPF steering program. This approach has
another downside: it cannot report the calculated hash due to the
restrictive nature of eBPF steering program.

Introduce the code to perform RSS to the kernel in order to overcome
thse challenges. An alternative solution is to extend the eBPF steering
program so that it will be able to report to the userspace, but I didn't
opt for it because extending the current mechanism of eBPF steering
program as is because it relies on legacy context rewriting, and
introducing kfunc-based eBPF will result in non-UAPI dependency while
the other relevant virtualization APIs such as KVM and vhost_net are
UAPIs.

Signed-off-by: Akihiko Odaki 
---
  drivers/net/tun.c   | 119 +++-
  include/uapi/linux/if_tun.h |  27 ++
  2 files changed, 133 insertions(+), 13 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index b8fcd71becac..5a429b391144 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -175,6 +175,9 @@ struct tun_prog {
  
  struct tun_vnet_hash_container {

struct tun_vnet_hash common;
+   struct tun_vnet_hash_rss rss;
+   __be32 rss_key[VIRTIO_NET_RSS_MAX_KEY_SIZE];
+   u16 rss_indirection_table[];
  };
  
  /* Since the socket were moved to tun_file, to preserve the behavior of persist

@@ -227,7 +230,7 @@ struct veth {
  };
  
  static const struct tun_vnet_hash tun_vnet_hash_cap = {

-   .flags = TUN_VNET_HASH_REPORT,
+   .flags = TUN_VNET_HASH_REPORT | TUN_VNET_HASH_RSS,
.types = VIRTIO_NET_SUPPORTED_HASH_TYPES
  };
  
@@ -591,6 +594,36 @@ static u16 tun_ebpf_select_queue(struct tun_struct *tun, struct sk_buff *skb)

return ret % numqueues;
  }
  
+static u16 tun_vnet_rss_select_queue(struct tun_struct *tun,

+struct sk_buff *skb,
+const struct tun_vnet_hash_container 
*vnet_hash)
+{
+   struct tun_vnet_hash_ext *ext;
+   struct virtio_net_hash hash;
+   u32 numqueues = READ_ONCE(tun->numqueues);
+   u16 txq, index;
+
+   if (!numqueues)
+   return 0;
+
+   if (!virtio_net_hash_rss(skb, vnet_hash->common.types, 
vnet_hash->rss_key,
+&hash))
+   return vnet_hash->rss.unclassified_queue % numqueues;
+
+   if (vnet_hash->common.flags & TUN_VNET_HASH_REPORT) {
+   ext = skb_ext_add(skb, SKB_EXT_TUN_VNET_HASH);
+   if (ext) {
+   ext->value = hash.value;
+   ext->report = hash.report;
+   }
+   }
+
+   index = hash.value & vnet_hash->rss.indirection_table_mask;
+   txq = READ_ONCE(vnet_hash->rss_indirection_table[index]);
+
+   return txq % numqueues;
+}
+
  static u16 tun_select_queue(struct net_device *dev, struct sk_buff *skb,
struct net_device *sb_dev)
  {
@@ -603,7 +636,10 @@ static u16 tun_select_queue(struct net_device *dev, struct 
sk_buff *skb,
} else {
struct tun_vnet_hash_container *vnet_hash = 
rcu_dereference(tun->vnet_hash);
  
-		ret = tun_automq_select_queue(tun, skb, vnet_hash);

+   if (vnet_hash && (vnet_hash->common.flags & TUN_VNET_HASH_RSS))
+   ret = tun_vnet_rss_select_queue(tun, skb, vnet_hash);
+   else
+   ret = tun_automq_select_queue(tun, skb, vnet_hash);
}
rcu_read_unlock();
  
@@ -3085,13 +3121,9 @@ static int tun_set_queue(struct file *file, struct ifreq *ifr)

  }
  
  static int tun_set_ebpf(struct tun_struct *tun, struct tun_prog __rcu **prog_p,

-   void __user *data)
+   int fd)
  {
struct bpf_prog *prog;
-   int fd;
-
-   if (copy_from_user(&fd, data, sizeof(fd)))
-   return -EFAULT;
  
  	if (fd == -1) {

prog = NULL;
@@ -3157,6 +3189,7 @@ static long __tun_chr_ioctl(struct file *file, unsigned 
int cmd,
int ifindex;
int sndbuf;
int vnet_hdr_sz;
+   int fd;
int le;
int ret;
bool do_notify = false;
@@ -3460,11 +3493,27 @@ static long __tun_chr_ioctl(struct file *file, unsigned 
int cmd,
break;
  
  	case TUNSETSTEERINGEBPF:

-   ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
+   if (get_user(fd, (int __user *)argp)) {
+   ret = -EFAULT;
+   break;
+   }
+
+   vnet_hash = rtnl_dereference(tun->vnet_

[PATCH RFC v4 0/9] tun: Introduce virtio-net hashing feature

2024-09-24 Thread Akihiko Odaki
virtio-net have two usage of hashes: one is RSS and another is hash
reporting. Conventionally the hash calculation was done by the VMM.
However, computing the hash after the queue was chosen defeats the
purpose of RSS.

Another approach is to use eBPF steering program. This approach has
another downside: it cannot report the calculated hash due to the
restrictive nature of eBPF.

Introduce the code to compute hashes to the kernel in order to overcome
thse challenges.

An alternative solution is to extend the eBPF steering program so that it
will be able to report to the userspace, but it is based on context
rewrites, which is in feature freeze. We can adopt kfuncs, but they will
not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM
and vhost_net).

The patches for QEMU to use this new feature was submitted as RFC and
is available at:
https://patchew.org/QEMU/20240915-hash-v3-0-79cb08d28...@daynix.com/

This work was presented at LPC 2024:
https://lpc.events/event/18/contributions/1963/

V1 -> V2:
  Changed to introduce a new BPF program type.

Signed-off-by: Akihiko Odaki 
---
Changes in v4:
- Moved tun_vnet_hash_ext to if_tun.h.
- Renamed virtio_net_toeplitz() to virtio_net_toeplitz_calc().
- Replaced htons() with cpu_to_be16().
- Changed virtio_net_hash_rss() to return void.
- Reordered variable declarations in virtio_net_hash_rss().
- Removed virtio_net_hdr_v1_hash_from_skb().
- Updated messages of "tap: Pad virtio header with zero" and
  "tun: Pad virtio header with zero".
- Fixed vnet_hash allocation size.
- Ensured to free vnet_hash when destructing tun_struct.
- Link to v3: 
https://lore.kernel.org/r/20240915-rss-v3-0-c630015db...@daynix.com

Changes in v3:
- Reverted back to add ioctl.
- Split patch "tun: Introduce virtio-net hashing feature" into
  "tun: Introduce virtio-net hash reporting feature" and
  "tun: Introduce virtio-net RSS".
- Changed to reuse hash values computed for automq instead of performing
  RSS hashing when hash reporting is requested but RSS is not.
- Extracted relevant data from struct tun_struct to keep it minimal.
- Added kernel-doc.
- Changed to allow calling TUNGETVNETHASHCAP before TUNSETIFF.
- Initialized num_buffers with 1.
- Added a test case for unclassified packets.
- Fixed error handling in tests.
- Changed tests to verify that the queue index will not overflow.
- Rebased.
- Link to v2: 
https://lore.kernel.org/r/20231015141644.260646-1-akihiko.od...@daynix.com

---
Akihiko Odaki (9):
  skbuff: Introduce SKB_EXT_TUN_VNET_HASH
  virtio_net: Add functions for hashing
  net: flow_dissector: Export flow_keys_dissector_symmetric
  tap: Pad virtio header with zero
  tun: Pad virtio header with zero
  tun: Introduce virtio-net hash reporting feature
  tun: Introduce virtio-net RSS
  selftest: tun: Add tests for virtio-net hashing
  vhost/net: Support VIRTIO_NET_F_HASH_REPORT

 Documentation/networking/tuntap.rst  |   7 +
 drivers/net/Kconfig  |   1 +
 drivers/net/tap.c|   2 +-
 drivers/net/tun.c| 255 --
 drivers/vhost/net.c  |  16 +-
 include/linux/if_tun.h   |   5 +
 include/linux/skbuff.h   |   3 +
 include/linux/virtio_net.h   | 174 +
 include/net/flow_dissector.h |   1 +
 include/uapi/linux/if_tun.h  |  71 
 net/core/flow_dissector.c|   3 +-
 net/core/skbuff.c|   4 +
 tools/testing/selftests/net/Makefile |   2 +-
 tools/testing/selftests/net/tun.c| 666 ++-
 14 files changed, 1170 insertions(+), 40 deletions(-)
---
base-commit: 752ebcbe87aceeb6334e846a466116197711a982
change-id: 20240403-rss-e737d89efa77

Best regards,
-- 
Akihiko Odaki 




[PATCH RFC v4 2/9] virtio_net: Add functions for hashing

2024-09-24 Thread Akihiko Odaki
They are useful to implement VIRTIO_NET_F_RSS and
VIRTIO_NET_F_HASH_REPORT.

Signed-off-by: Akihiko Odaki 
---
 include/linux/virtio_net.h | 174 +
 1 file changed, 174 insertions(+)

diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index 276ca543ef44..f7a4149efb3e 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -9,6 +9,180 @@
 #include 
 #include 
 
+struct virtio_net_hash {
+   u32 value;
+   u16 report;
+};
+
+struct virtio_net_toeplitz_state {
+   u32 hash;
+   u32 key_buffer;
+   const __be32 *key;
+};
+
+#define VIRTIO_NET_SUPPORTED_HASH_TYPES (VIRTIO_NET_RSS_HASH_TYPE_IPv4 | \
+VIRTIO_NET_RSS_HASH_TYPE_TCPv4 | \
+VIRTIO_NET_RSS_HASH_TYPE_UDPv4 | \
+VIRTIO_NET_RSS_HASH_TYPE_IPv6 | \
+VIRTIO_NET_RSS_HASH_TYPE_TCPv6 | \
+VIRTIO_NET_RSS_HASH_TYPE_UDPv6)
+
+#define VIRTIO_NET_RSS_MAX_KEY_SIZE 40
+
+static inline void virtio_net_toeplitz_calc(struct virtio_net_toeplitz_state 
*state,
+   const __be32 *input, size_t len)
+{
+   u32 key;
+
+   while (len) {
+   state->key++;
+   key = be32_to_cpu(*state->key);
+
+   for (u32 bit = BIT(31); bit; bit >>= 1) {
+   if (be32_to_cpu(*input) & bit)
+   state->hash ^= state->key_buffer;
+
+   state->key_buffer =
+   (state->key_buffer << 1) | !!(key & bit);
+   }
+
+   input++;
+   len--;
+   }
+}
+
+static inline u8 virtio_net_hash_key_length(u32 types)
+{
+   size_t len = 0;
+
+   if (types & VIRTIO_NET_HASH_REPORT_IPv4)
+   len = max(len,
+ sizeof(struct flow_dissector_key_ipv4_addrs));
+
+   if (types &
+   (VIRTIO_NET_HASH_REPORT_TCPv4 | VIRTIO_NET_HASH_REPORT_UDPv4))
+   len = max(len,
+ sizeof(struct flow_dissector_key_ipv4_addrs) +
+ sizeof(struct flow_dissector_key_ports));
+
+   if (types & VIRTIO_NET_HASH_REPORT_IPv6)
+   len = max(len,
+ sizeof(struct flow_dissector_key_ipv6_addrs));
+
+   if (types &
+   (VIRTIO_NET_HASH_REPORT_TCPv6 | VIRTIO_NET_HASH_REPORT_UDPv6))
+   len = max(len,
+ sizeof(struct flow_dissector_key_ipv6_addrs) +
+ sizeof(struct flow_dissector_key_ports));
+
+   return sizeof_field(struct virtio_net_toeplitz_state, key_buffer) + len;
+}
+
+static inline u32 virtio_net_hash_report(u32 types,
+struct flow_dissector_key_basic key)
+{
+   switch (key.n_proto) {
+   case cpu_to_be16(ETH_P_IP):
+   if (key.ip_proto == IPPROTO_TCP &&
+   (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv4))
+   return VIRTIO_NET_HASH_REPORT_TCPv4;
+
+   if (key.ip_proto == IPPROTO_UDP &&
+   (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv4))
+   return VIRTIO_NET_HASH_REPORT_UDPv4;
+
+   if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv4)
+   return VIRTIO_NET_HASH_REPORT_IPv4;
+
+   return VIRTIO_NET_HASH_REPORT_NONE;
+
+   case cpu_to_be16(ETH_P_IPV6):
+   if (key.ip_proto == IPPROTO_TCP &&
+   (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv6))
+   return VIRTIO_NET_HASH_REPORT_TCPv6;
+
+   if (key.ip_proto == IPPROTO_UDP &&
+   (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv6))
+   return VIRTIO_NET_HASH_REPORT_UDPv6;
+
+   if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv6)
+   return VIRTIO_NET_HASH_REPORT_IPv6;
+
+   return VIRTIO_NET_HASH_REPORT_NONE;
+
+   default:
+   return VIRTIO_NET_HASH_REPORT_NONE;
+   }
+}
+
+static inline void virtio_net_hash_rss(const struct sk_buff *skb,
+  u32 types, const __be32 *key,
+  struct virtio_net_hash *hash)
+{
+   struct virtio_net_toeplitz_state toeplitz_state = {
+   .key_buffer = be32_to_cpu(*key),
+   .key = key
+   };
+   struct flow_keys flow;
+   u16 report;
+
+   if (!skb_flow_dissect_flow_keys(skb, &flow, 0)) {
+   hash->report = VIRTIO_NET_HASH_REPORT_NONE;
+   return;
+   }
+
+   report = virtio_net_hash_report(types, flow.basic);
+
+   switch (report) {
+   case VIRTIO_NET_HASH_RE

[PATCH RFC v4 3/9] net: flow_dissector: Export flow_keys_dissector_symmetric

2024-09-24 Thread Akihiko Odaki
flow_keys_dissector_symmetric is useful to derive a symmetric hash
and to know its source such as IPv4, IPv6, TCP, and UDP.

Signed-off-by: Akihiko Odaki 
---
 include/net/flow_dissector.h | 1 +
 net/core/flow_dissector.c| 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index ced79dc8e856..d01c1ec77b7d 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -423,6 +423,7 @@ __be32 flow_get_u32_src(const struct flow_keys *flow);
 __be32 flow_get_u32_dst(const struct flow_keys *flow);
 
 extern struct flow_dissector flow_keys_dissector;
+extern struct flow_dissector flow_keys_dissector_symmetric;
 extern struct flow_dissector flow_keys_basic_dissector;
 
 /* struct flow_keys_digest:
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 0e638a37aa09..9822988f2d49 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -1852,7 +1852,8 @@ void make_flow_keys_digest(struct flow_keys_digest 
*digest,
 }
 EXPORT_SYMBOL(make_flow_keys_digest);
 
-static struct flow_dissector flow_keys_dissector_symmetric __read_mostly;
+struct flow_dissector flow_keys_dissector_symmetric __read_mostly;
+EXPORT_SYMBOL(flow_keys_dissector_symmetric);
 
 u32 __skb_get_hash_symmetric_net(const struct net *net, const struct sk_buff 
*skb)
 {

-- 
2.46.0




[PATCH RFC v4 1/9] skbuff: Introduce SKB_EXT_TUN_VNET_HASH

2024-09-24 Thread Akihiko Odaki
This new extension will be used by tun to carry the hash values and
types to report with virtio-net headers.

Signed-off-by: Akihiko Odaki 
---
 include/linux/if_tun.h | 5 +
 include/linux/skbuff.h | 3 +++
 net/core/skbuff.c  | 4 
 3 files changed, 12 insertions(+)

diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
index 043d442994b0..47034aede329 100644
--- a/include/linux/if_tun.h
+++ b/include/linux/if_tun.h
@@ -9,6 +9,11 @@
 #include 
 #include 
 
+struct tun_vnet_hash_ext {
+   u32 value;
+   u16 report;
+};
+
 #define TUN_XDP_FLAG 0x1UL
 
 #define TUN_MSG_UBUF 1
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 29c3ea5b6e93..a361c4150144 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -4718,6 +4718,9 @@ enum skb_ext_id {
 #endif
 #if IS_ENABLED(CONFIG_MCTP_FLOWS)
SKB_EXT_MCTP,
+#endif
+#if IS_ENABLED(CONFIG_TUN)
+   SKB_EXT_TUN_VNET_HASH,
 #endif
SKB_EXT_NUM, /* must be last */
 };
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 83f8cd8aa2d1..997d79d5612c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -60,6 +60,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -4979,6 +4980,9 @@ static const u8 skb_ext_type_len[] = {
 #if IS_ENABLED(CONFIG_MCTP_FLOWS)
[SKB_EXT_MCTP] = SKB_EXT_CHUNKSIZEOF(struct mctp_flow),
 #endif
+#if IS_ENABLED(CONFIG_TUN)
+   [SKB_EXT_TUN_VNET_HASH] = SKB_EXT_CHUNKSIZEOF(struct tun_vnet_hash_ext),
+#endif
 };
 
 static __always_inline unsigned int skb_ext_total_length(void)

-- 
2.46.0




[PATCH RFC v4 6/9] tun: Introduce virtio-net hash reporting feature

2024-09-24 Thread Akihiko Odaki
Allow the guest to reuse the hash value to make receive steering
consistent between the host and guest, and to save hash computation.

Signed-off-by: Akihiko Odaki 
---
 Documentation/networking/tuntap.rst |   7 +++
 drivers/net/Kconfig |   1 +
 drivers/net/tun.c   | 117 +++-
 include/uapi/linux/if_tun.h |  44 ++
 4 files changed, 155 insertions(+), 14 deletions(-)

diff --git a/Documentation/networking/tuntap.rst 
b/Documentation/networking/tuntap.rst
index 4d7087f727be..86b4ae8caa8a 100644
--- a/Documentation/networking/tuntap.rst
+++ b/Documentation/networking/tuntap.rst
@@ -206,6 +206,13 @@ enable is true we enable it, otherwise we disable it::
   return ioctl(fd, TUNSETQUEUE, (void *)&ifr);
   }
 
+3.4 Reference
+-
+
+``linux/if_tun.h`` defines the interface described below:
+
+.. kernel-doc:: include/uapi/linux/if_tun.h
+
 Universal TUN/TAP device driver Frequently Asked Question
 =
 
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 9920b3a68ed1..e2a7bd703550 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -395,6 +395,7 @@ config TUN
tristate "Universal TUN/TAP device driver support"
depends on INET
select CRC32
+   select SKB_EXTENSIONS
help
  TUN/TAP provides packet reception and transmission for user space
  programs.  It can be viewed as a simple Point-to-Point or Ethernet
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 9d93ab9ee58f..986e4a5bf04d 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -173,6 +173,10 @@ struct tun_prog {
struct bpf_prog *prog;
 };
 
+struct tun_vnet_hash_container {
+   struct tun_vnet_hash common;
+};
+
 /* Since the socket were moved to tun_file, to preserve the behavior of persist
  * device, socket filter, sndbuf and vnet header size were restore when the
  * file were attached to a persist device.
@@ -210,6 +214,7 @@ struct tun_struct {
struct bpf_prog __rcu *xdp_prog;
struct tun_prog __rcu *steering_prog;
struct tun_prog __rcu *filter_prog;
+   struct tun_vnet_hash vnet_hash;
struct ethtool_link_ksettings link_ksettings;
/* init args */
struct file *file;
@@ -221,6 +226,11 @@ struct veth {
__be16 h_vlan_TCI;
 };
 
+static const struct tun_vnet_hash tun_vnet_hash_cap = {
+   .flags = TUN_VNET_HASH_REPORT,
+   .types = VIRTIO_NET_SUPPORTED_HASH_TYPES
+};
+
 static void tun_flow_init(struct tun_struct *tun);
 static void tun_flow_uninit(struct tun_struct *tun);
 
@@ -322,10 +332,15 @@ static long tun_set_vnet_be(struct tun_struct *tun, int 
__user *argp)
if (get_user(be, argp))
return -EFAULT;
 
-   if (be)
+   if (be) {
+   if (!(tun->flags & TUN_VNET_LE) &&
+   (tun->vnet_hash.flags & TUN_VNET_HASH_REPORT))
+   return -EBUSY;
+
tun->flags |= TUN_VNET_BE;
-   else
+   } else {
tun->flags &= ~TUN_VNET_BE;
+   }
 
return 0;
 }
@@ -524,12 +539,17 @@ static inline void tun_flow_save_rps_rxhash(struct 
tun_flow_entry *e, u32 hash)
  */
 static u16 tun_automq_select_queue(struct tun_struct *tun, struct sk_buff *skb)
 {
+   struct tun_vnet_hash_ext *ext;
+   struct flow_keys keys;
struct tun_flow_entry *e;
u32 txq, numqueues;
 
numqueues = READ_ONCE(tun->numqueues);
 
-   txq = __skb_get_hash_symmetric(skb);
+   memset(&keys, 0, sizeof(keys));
+   skb_flow_dissect(skb, &flow_keys_dissector_symmetric, &keys, 0);
+
+   txq = flow_hash_from_keys(&keys);
e = tun_flow_find(&tun->flows[tun_hashfn(txq)], txq);
if (e) {
tun_flow_save_rps_rxhash(e, txq);
@@ -538,6 +558,16 @@ static u16 tun_automq_select_queue(struct tun_struct *tun, 
struct sk_buff *skb)
txq = reciprocal_scale(txq, numqueues);
}
 
+   if (tun->vnet_hash.flags & TUN_VNET_HASH_REPORT) {
+   ext = skb_ext_add(skb, SKB_EXT_TUN_VNET_HASH);
+   if (ext) {
+   u32 types = tun->vnet_hash.types;
+
+   ext->report = virtio_net_hash_report(types, keys.basic);
+   ext->value = skb->l4_hash ? skb->hash : txq;
+   }
+   }
+
return txq;
 }
 
@@ -2120,33 +2150,58 @@ static ssize_t tun_put_user(struct tun_struct *tun,
}
 
if (vnet_hdr_sz) {
-   struct virtio_net_hdr gso;
+   struct tun_vnet_hash_ext *ext;
+   size_t vnet_hdr_content_sz = sizeof(struct virtio_net_hdr);
+   union {
+   struct virtio_net_hdr hdr;
+   struct virtio_net_hdr_v1_hash hdr_v1_hash;

[PATCH RFC v4 4/9] tap: Pad virtio header with zero

2024-09-24 Thread Akihiko Odaki
tap used to simply advance iov_iter when it needs to pad virtio header,
which leaves the garbage in the buffer as is. This is especially
problematic when tap starts to allow enabling the hash reporting
feature; even if the feature is enabled, the packet may lack a hash
value and may contain a hole in the virtio header because the packet
arrived before the feature gets enabled or does not contain the
header fields to be hashed. If the hole is not filled with zero, it is
impossible to tell if the packet lacks a hash value.

In theory, a user of tap can fill the buffer with zero before calling
read() to avoid such a problem, but leaving the garbage in the buffer is
awkward anyway so fill the buffer in tap.

Signed-off-by: Akihiko Odaki 
---
 drivers/net/tap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 77574f7a3bd4..ba044302ccc6 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -813,7 +813,7 @@ static ssize_t tap_put_user(struct tap_queue *q,
sizeof(vnet_hdr))
return -EFAULT;
 
-   iov_iter_advance(iter, vnet_hdr_len - sizeof(vnet_hdr));
+   iov_iter_zero(vnet_hdr_len - sizeof(vnet_hdr), iter);
}
total = vnet_hdr_len;
total += skb->len;

-- 
2.46.0




[PATCH RFC v4 5/9] tun: Pad virtio header with zero

2024-09-24 Thread Akihiko Odaki
tun used to simply advance iov_iter when it needs to pad virtio header,
which leaves the garbage in the buffer as is. This is especially
problematic when tun starts to allow enabling the hash reporting
feature; even if the feature is enabled, the packet may lack a hash
value and may contain a hole in the virtio header because the packet
arrived before the feature gets enabled or does not contain the
header fields to be hashed. If the hole is not filled with zero, it is
impossible to tell if the packet lacks a hash value.

In theory, a user of tun can fill the buffer with zero before calling
read() to avoid such a problem, but leaving the garbage in the buffer is
awkward anyway so fill the buffer in tun.

Signed-off-by: Akihiko Odaki 
---
 drivers/net/tun.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 1d06c560c5e6..9d93ab9ee58f 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2073,7 +2073,7 @@ static ssize_t tun_put_user_xdp(struct tun_struct *tun,
if (unlikely(copy_to_iter(&gso, sizeof(gso), iter) !=
 sizeof(gso)))
return -EFAULT;
-   iov_iter_advance(iter, vnet_hdr_sz - sizeof(gso));
+   iov_iter_zero(vnet_hdr_sz - sizeof(gso), iter);
}
 
ret = copy_to_iter(xdp_frame->data, size, iter) + vnet_hdr_sz;
@@ -2146,7 +2146,7 @@ static ssize_t tun_put_user(struct tun_struct *tun,
if (copy_to_iter(&gso, sizeof(gso), iter) != sizeof(gso))
return -EFAULT;
 
-   iov_iter_advance(iter, vnet_hdr_sz - sizeof(gso));
+   iov_iter_zero(vnet_hdr_sz - sizeof(gso), iter);
}
 
if (vlan_hlen) {

-- 
2.46.0




[PATCH RFC v4 8/9] selftest: tun: Add tests for virtio-net hashing

2024-09-24 Thread Akihiko Odaki
The added tests confirm tun can perform RSS and hash reporting, and
reject invalid configurations for them.

Signed-off-by: Akihiko Odaki 
---
 tools/testing/selftests/net/Makefile |   2 +-
 tools/testing/selftests/net/tun.c| 666 ++-
 2 files changed, 660 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/net/Makefile 
b/tools/testing/selftests/net/Makefile
index 9d5aa817411b..8e2ab5068171 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -110,6 +110,6 @@ $(OUTPUT)/reuseport_bpf_numa: LDLIBS += -lnuma
 $(OUTPUT)/tcp_mmap: LDLIBS += -lpthread -lcrypto
 $(OUTPUT)/tcp_inq: LDLIBS += -lpthread
 $(OUTPUT)/bind_bhash: LDLIBS += -lpthread
-$(OUTPUT)/io_uring_zerocopy_tx: CFLAGS += -I../../../include/
+$(OUTPUT)/io_uring_zerocopy_tx $(OUTPUT)/tun: CFLAGS += -I../../../include/
 
 include bpf.mk
diff --git a/tools/testing/selftests/net/tun.c 
b/tools/testing/selftests/net/tun.c
index fa83918b62d1..f46affa39d5c 100644
--- a/tools/testing/selftests/net/tun.c
+++ b/tools/testing/selftests/net/tun.c
@@ -2,21 +2,37 @@
 
 #define _GNU_SOURCE
 
+#include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
-#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
 #include 
+#include 
 #include 
 #include 
-#include 
-#include 
+#include 
+#include 
+#include 
+#include 
 
 #include "../kselftest_harness.h"
 
+#define TUN_HWADDR_SOURCE { 0x02, 0x00, 0x00, 0x00, 0x00, 0x00 }
+#define TUN_HWADDR_DEST { 0x02, 0x00, 0x00, 0x00, 0x00, 0x01 }
+#define TUN_IPADDR_SOURCE htonl((172 << 24) | (17 << 16) | 0)
+#define TUN_IPADDR_DEST htonl((172 << 24) | (17 << 16) | 1)
+
 static int tun_attach(int fd, char *dev)
 {
struct ifreq ifr;
@@ -39,7 +55,7 @@ static int tun_detach(int fd, char *dev)
return ioctl(fd, TUNSETQUEUE, (void *) &ifr);
 }
 
-static int tun_alloc(char *dev)
+static int tun_alloc(char *dev, short flags)
 {
struct ifreq ifr;
int fd, err;
@@ -52,7 +68,8 @@ static int tun_alloc(char *dev)
 
memset(&ifr, 0, sizeof(ifr));
strcpy(ifr.ifr_name, dev);
-   ifr.ifr_flags = IFF_TAP | IFF_NAPI | IFF_MULTI_QUEUE;
+   ifr.ifr_flags = flags | IFF_TAP | IFF_NAPI | IFF_NO_PI |
+   IFF_MULTI_QUEUE;
 
err = ioctl(fd, TUNSETIFF, (void *) &ifr);
if (err < 0) {
@@ -64,6 +81,40 @@ static int tun_alloc(char *dev)
return fd;
 }
 
+static bool tun_add_to_bridge(int local_fd, const char *name)
+{
+   struct ifreq ifreq = {
+   .ifr_name = "xbridge",
+   .ifr_ifindex = if_nametoindex(name)
+   };
+
+   if (!ifreq.ifr_ifindex) {
+   perror("if_nametoindex");
+   return false;
+   }
+
+   if (ioctl(local_fd, SIOCBRADDIF, &ifreq)) {
+   perror("SIOCBRADDIF");
+   return false;
+   }
+
+   return true;
+}
+
+static bool tun_set_flags(int local_fd, const char *name, short flags)
+{
+   struct ifreq ifreq = { .ifr_flags = flags };
+
+   strcpy(ifreq.ifr_name, name);
+
+   if (ioctl(local_fd, SIOCSIFFLAGS, &ifreq)) {
+   perror("SIOCSIFFLAGS");
+   return false;
+   }
+
+   return true;
+}
+
 static int tun_delete(char *dev)
 {
struct {
@@ -102,6 +153,159 @@ static int tun_delete(char *dev)
return ret;
 }
 
+static uint32_t tun_sum(const void *buf, size_t len)
+{
+   const uint16_t *sbuf = buf;
+   uint32_t sum = 0;
+
+   while (len > 1) {
+   sum += *sbuf++;
+   len -= 2;
+   }
+
+   if (len)
+   sum += *(uint8_t *)sbuf;
+
+   return sum;
+}
+
+static uint16_t tun_build_ip_check(uint32_t sum)
+{
+   return ~((sum & 0x) + (sum >> 16));
+}
+
+static uint32_t tun_build_ip_pseudo_sum(const void *iphdr)
+{
+   uint16_t tot_len = ntohs(((struct iphdr *)iphdr)->tot_len);
+
+   return tun_sum((char *)iphdr + offsetof(struct iphdr, saddr), 8) +
+  htons(((struct iphdr *)iphdr)->protocol) +
+  htons(tot_len - sizeof(struct iphdr));
+}
+
+static uint32_t tun_build_ipv6_pseudo_sum(const void *ipv6hdr)
+{
+   return tun_sum((char *)ipv6hdr + offsetof(struct ipv6hdr, saddr), 32) +
+  ((struct ipv6hdr *)ipv6hdr)->payload_len +
+  htons(((struct ipv6hdr *)ipv6hdr)->nexthdr);
+}
+
+static void tun_build_ethhdr(struct ethhdr *ethhdr, uint16_t proto)
+{
+   *ethhdr = (struct ethhdr) {
+   .h_dest = TUN_HWADDR_DEST,
+   .h_source = TUN_HWADDR_SOURCE,
+   .h_proto = htons(proto)
+   };
+}
+
+static void tun_build_iphdr(void *dest, uint16_t len, uint8_t protocol)
+{
+   struct iphdr iphdr = {
+   .ihl = sizeof(iphdr) / 4,
+   .version = 

[PATCH RFC v4 7/9] tun: Introduce virtio-net RSS

2024-09-24 Thread Akihiko Odaki
RSS is a receive steering algorithm that can be negotiated to use with
virtio_net. Conventionally the hash calculation was done by the VMM.
However, computing the hash after the queue was chosen defeats the
purpose of RSS.

Another approach is to use eBPF steering program. This approach has
another downside: it cannot report the calculated hash due to the
restrictive nature of eBPF steering program.

Introduce the code to perform RSS to the kernel in order to overcome
thse challenges. An alternative solution is to extend the eBPF steering
program so that it will be able to report to the userspace, but I didn't
opt for it because extending the current mechanism of eBPF steering
program as is because it relies on legacy context rewriting, and
introducing kfunc-based eBPF will result in non-UAPI dependency while
the other relevant virtualization APIs such as KVM and vhost_net are
UAPIs.

Signed-off-by: Akihiko Odaki 
---
 drivers/net/tun.c   | 158 ++--
 include/uapi/linux/if_tun.h |  27 
 2 files changed, 163 insertions(+), 22 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 986e4a5bf04d..680eb4561a7f 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -175,6 +175,9 @@ struct tun_prog {
 
 struct tun_vnet_hash_container {
struct tun_vnet_hash common;
+   struct tun_vnet_hash_rss rss;
+   __be32 rss_key[VIRTIO_NET_RSS_MAX_KEY_SIZE];
+   u16 rss_indirection_table[];
 };
 
 /* Since the socket were moved to tun_file, to preserve the behavior of persist
@@ -214,7 +217,7 @@ struct tun_struct {
struct bpf_prog __rcu *xdp_prog;
struct tun_prog __rcu *steering_prog;
struct tun_prog __rcu *filter_prog;
-   struct tun_vnet_hash vnet_hash;
+   struct tun_vnet_hash_container __rcu *vnet_hash;
struct ethtool_link_ksettings link_ksettings;
/* init args */
struct file *file;
@@ -227,7 +230,7 @@ struct veth {
 };
 
 static const struct tun_vnet_hash tun_vnet_hash_cap = {
-   .flags = TUN_VNET_HASH_REPORT,
+   .flags = TUN_VNET_HASH_REPORT | TUN_VNET_HASH_RSS,
.types = VIRTIO_NET_SUPPORTED_HASH_TYPES
 };
 
@@ -333,8 +336,10 @@ static long tun_set_vnet_be(struct tun_struct *tun, int 
__user *argp)
return -EFAULT;
 
if (be) {
+   struct tun_vnet_hash_container *vnet_hash = 
rtnl_dereference(tun->vnet_hash);
+
if (!(tun->flags & TUN_VNET_LE) &&
-   (tun->vnet_hash.flags & TUN_VNET_HASH_REPORT))
+   vnet_hash && (vnet_hash->flags & TUN_VNET_HASH_REPORT))
return -EBUSY;
 
tun->flags |= TUN_VNET_BE;
@@ -537,7 +542,8 @@ static inline void tun_flow_save_rps_rxhash(struct 
tun_flow_entry *e, u32 hash)
  * the userspace application move between processors, we may get a
  * different rxq no. here.
  */
-static u16 tun_automq_select_queue(struct tun_struct *tun, struct sk_buff *skb)
+static u16 tun_automq_select_queue(struct tun_struct *tun, struct sk_buff *skb,
+  const struct tun_vnet_hash_container 
*vnet_hash)
 {
struct tun_vnet_hash_ext *ext;
struct flow_keys keys;
@@ -558,10 +564,10 @@ static u16 tun_automq_select_queue(struct tun_struct 
*tun, struct sk_buff *skb)
txq = reciprocal_scale(txq, numqueues);
}
 
-   if (tun->vnet_hash.flags & TUN_VNET_HASH_REPORT) {
+   if (vnet_hash && (vnet_hash->common.flags & TUN_VNET_HASH_REPORT)) {
ext = skb_ext_add(skb, SKB_EXT_TUN_VNET_HASH);
if (ext) {
-   u32 types = tun->vnet_hash.types;
+   u32 types = vnet_hash->common.types;
 
ext->report = virtio_net_hash_report(types, keys.basic);
ext->value = skb->l4_hash ? skb->hash : txq;
@@ -588,6 +594,37 @@ static u16 tun_ebpf_select_queue(struct tun_struct *tun, 
struct sk_buff *skb)
return ret % numqueues;
 }
 
+static u16 tun_vnet_rss_select_queue(struct tun_struct *tun,
+struct sk_buff *skb,
+const struct tun_vnet_hash_container 
*vnet_hash)
+{
+   struct tun_vnet_hash_ext *ext;
+   struct virtio_net_hash hash;
+   u32 numqueues = READ_ONCE(tun->numqueues);
+   u16 txq, index;
+
+   if (!numqueues)
+   return 0;
+
+   virtio_net_hash_rss(skb, vnet_hash->common.types, vnet_hash->rss_key, 
&hash);
+
+   if (!hash.report)
+   return vnet_hash->rss.unclassified_queue % numqueues;
+
+   if (vnet_hash->common.flags & TUN_VNET_HASH_REPORT) {
+   ext = skb_ext_add(skb, SKB_EXT_TUN_VNET_HASH);
+   if (ext) {
+   ext->value = hash.value;
+   ext->r

[PATCH RFC v4 9/9] vhost/net: Support VIRTIO_NET_F_HASH_REPORT

2024-09-24 Thread Akihiko Odaki
VIRTIO_NET_F_HASH_REPORT allows to report hash values calculated on the
host. When VHOST_NET_F_VIRTIO_NET_HDR is employed, it will report no
hash values (i.e., the hash_report member is always set to
VIRTIO_NET_HASH_REPORT_NONE). Otherwise, the values reported by the
underlying socket will be reported.

VIRTIO_NET_F_HASH_REPORT requires VIRTIO_F_VERSION_1.

Signed-off-by: Akihiko Odaki 
---
 drivers/vhost/net.c | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index f16279351db5..ec1167a782ec 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -73,6 +73,7 @@ enum {
VHOST_NET_FEATURES = VHOST_FEATURES |
 (1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
 (1ULL << VIRTIO_NET_F_MRG_RXBUF) |
+(1ULL << VIRTIO_NET_F_HASH_REPORT) |
 (1ULL << VIRTIO_F_ACCESS_PLATFORM) |
 (1ULL << VIRTIO_F_RING_RESET)
 };
@@ -1604,10 +1605,13 @@ static int vhost_net_set_features(struct vhost_net *n, 
u64 features)
size_t vhost_hlen, sock_hlen, hdr_len;
int i;
 
-   hdr_len = (features & ((1ULL << VIRTIO_NET_F_MRG_RXBUF) |
-  (1ULL << VIRTIO_F_VERSION_1))) ?
-   sizeof(struct virtio_net_hdr_mrg_rxbuf) :
-   sizeof(struct virtio_net_hdr);
+   if (features & (1ULL << VIRTIO_NET_F_HASH_REPORT))
+   hdr_len = sizeof(struct virtio_net_hdr_v1_hash);
+   else if (features & ((1ULL << VIRTIO_NET_F_MRG_RXBUF) |
+(1ULL << VIRTIO_F_VERSION_1)))
+   hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
+   else
+   hdr_len = sizeof(struct virtio_net_hdr);
if (features & (1 << VHOST_NET_F_VIRTIO_NET_HDR)) {
/* vhost provides vnet_hdr */
vhost_hlen = hdr_len;
@@ -1688,6 +1692,10 @@ static long vhost_net_ioctl(struct file *f, unsigned int 
ioctl,
return -EFAULT;
if (features & ~VHOST_NET_FEATURES)
return -EOPNOTSUPP;
+   if ((features & ((1ULL << VIRTIO_F_VERSION_1) |
+(1ULL << VIRTIO_NET_F_HASH_REPORT))) ==
+   (1ULL << VIRTIO_NET_F_HASH_REPORT))
+   return -EINVAL;
return vhost_net_set_features(n, features);
case VHOST_GET_BACKEND_FEATURES:
features = VHOST_NET_BACKEND_FEATURES;

-- 
2.46.0




Re: [PATCH RFC v4 7/9] tun: Introduce virtio-net RSS

2024-09-26 Thread Akihiko Odaki

On 2024/09/24 22:05, Simon Horman wrote:

On Tue, Sep 24, 2024 at 11:01:12AM +0200, Akihiko Odaki wrote:

RSS is a receive steering algorithm that can be negotiated to use with
virtio_net. Conventionally the hash calculation was done by the VMM.
However, computing the hash after the queue was chosen defeats the
purpose of RSS.

Another approach is to use eBPF steering program. This approach has
another downside: it cannot report the calculated hash due to the
restrictive nature of eBPF steering program.

Introduce the code to perform RSS to the kernel in order to overcome
thse challenges. An alternative solution is to extend the eBPF steering
program so that it will be able to report to the userspace, but I didn't
opt for it because extending the current mechanism of eBPF steering
program as is because it relies on legacy context rewriting, and
introducing kfunc-based eBPF will result in non-UAPI dependency while
the other relevant virtualization APIs such as KVM and vhost_net are
UAPIs.

Signed-off-by: Akihiko Odaki 


...


diff --git a/drivers/net/tun.c b/drivers/net/tun.c


...


@@ -333,8 +336,10 @@ static long tun_set_vnet_be(struct tun_struct *tun, int 
__user *argp)
return -EFAULT;
  
  	if (be) {

+   struct tun_vnet_hash_container *vnet_hash = 
rtnl_dereference(tun->vnet_hash);
+
if (!(tun->flags & TUN_VNET_LE) &&
-   (tun->vnet_hash.flags & TUN_VNET_HASH_REPORT))
+   vnet_hash && (vnet_hash->flags & TUN_VNET_HASH_REPORT))


Hi Odaki-san,

I am wondering if the above should this be vnet_hash->common.flags?
I am seeing this:

../drivers/net/tun.c:342:44: error: ‘struct tun_vnet_hash_container’ has no 
member named ‘flags’
   342 | vnet_hash && (vnet_hash->flags & 
TUN_VNET_HASH_REPORT))

...


You are right. I couldn't notice this error because I was testing 
without CONFIG_TUN_VNET_CROSS_LE; I'll test with the configuration and 
submit a new version with fix.


Regards,
Akihiko Odaki



Re: [PATCH RFC v4 0/9] tun: Introduce virtio-net hashing feature

2024-09-26 Thread Akihiko Odaki

On 2024/09/25 12:30, Jason Wang wrote:

On Tue, Sep 24, 2024 at 5:01 PM Akihiko Odaki  wrote:


virtio-net have two usage of hashes: one is RSS and another is hash
reporting. Conventionally the hash calculation was done by the VMM.
However, computing the hash after the queue was chosen defeats the
purpose of RSS.

Another approach is to use eBPF steering program. This approach has
another downside: it cannot report the calculated hash due to the
restrictive nature of eBPF.

Introduce the code to compute hashes to the kernel in order to overcome
thse challenges.

An alternative solution is to extend the eBPF steering program so that it
will be able to report to the userspace, but it is based on context
rewrites, which is in feature freeze. We can adopt kfuncs, but they will
not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM
and vhost_net).



I wonder if we could clone the skb and reuse some to store the hash,
then the steering eBPF program can access these fields without
introducing full RSS in the kernel?


I don't get how cloning the skb can solve the issue.

We can certainly implement Toeplitz function in the kernel or even with 
tc-bpf to store a hash value that can be used for eBPF steering program 
and virtio hash reporting. However we don't have a means of storing a 
hash type, which is specific to virtio hash reporting and lacks a 
corresponding skb field.


Regards,
Akihiko Odaki



Re: [PATCH RFC v4 0/9] tun: Introduce virtio-net hashing feature

2024-09-27 Thread Akihiko Odaki

On 2024/09/27 13:31, Jason Wang wrote:

On Fri, Sep 27, 2024 at 10:11 AM Akihiko Odaki  wrote:


On 2024/09/25 12:30, Jason Wang wrote:

On Tue, Sep 24, 2024 at 5:01 PM Akihiko Odaki  wrote:


virtio-net have two usage of hashes: one is RSS and another is hash
reporting. Conventionally the hash calculation was done by the VMM.
However, computing the hash after the queue was chosen defeats the
purpose of RSS.

Another approach is to use eBPF steering program. This approach has
another downside: it cannot report the calculated hash due to the
restrictive nature of eBPF.

Introduce the code to compute hashes to the kernel in order to overcome
thse challenges.

An alternative solution is to extend the eBPF steering program so that it
will be able to report to the userspace, but it is based on context
rewrites, which is in feature freeze. We can adopt kfuncs, but they will
not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM
and vhost_net).



I wonder if we could clone the skb and reuse some to store the hash,
then the steering eBPF program can access these fields without
introducing full RSS in the kernel?


I don't get how cloning the skb can solve the issue.

We can certainly implement Toeplitz function in the kernel or even with
tc-bpf to store a hash value that can be used for eBPF steering program
and virtio hash reporting. However we don't have a means of storing a
hash type, which is specific to virtio hash reporting and lacks a
corresponding skb field.


I may miss something but looking at sk_filter_is_valid_access(). It
looks to me we can make use of skb->cb[0..4]?


I didn't opt to using cb. Below is the rationale:

cb is for tail call so it means we reuse the field for a different 
purpose. The context rewrite allows adding a field without increasing 
the size of the underlying storage (the real sk_buff) so we should add a 
new field instead of reusing an existing field to avoid confusion.


We are however no longer allowed to add a new field. In my 
understanding, this is because it is an UAPI, and eBPF maintainers found 
it is difficult to maintain its stability.


Reusing cb for hash reporting is a workaround to avoid having a new 
field, but it does not solve the underlying problem (i.e., keeping eBPF 
as stable as UAPI is unreasonably hard). In my opinion, adding an ioctl 
is a reasonable option to keep the API as stable as other virtualization 
UAPIs while respecting the underlying intention of the context rewrite 
feature freeze.


Regards,
Akihiko Odaki



Re: [PATCH RFC v4 0/9] tun: Introduce virtio-net hashing feature

2024-09-29 Thread Akihiko Odaki

On 2024/09/29 11:07, Jason Wang wrote:

On Fri, Sep 27, 2024 at 3:51 PM Akihiko Odaki  wrote:


On 2024/09/27 13:31, Jason Wang wrote:

On Fri, Sep 27, 2024 at 10:11 AM Akihiko Odaki  wrote:


On 2024/09/25 12:30, Jason Wang wrote:

On Tue, Sep 24, 2024 at 5:01 PM Akihiko Odaki  wrote:


virtio-net have two usage of hashes: one is RSS and another is hash
reporting. Conventionally the hash calculation was done by the VMM.
However, computing the hash after the queue was chosen defeats the
purpose of RSS.

Another approach is to use eBPF steering program. This approach has
another downside: it cannot report the calculated hash due to the
restrictive nature of eBPF.

Introduce the code to compute hashes to the kernel in order to overcome
thse challenges.

An alternative solution is to extend the eBPF steering program so that it
will be able to report to the userspace, but it is based on context
rewrites, which is in feature freeze. We can adopt kfuncs, but they will
not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM
and vhost_net).



I wonder if we could clone the skb and reuse some to store the hash,
then the steering eBPF program can access these fields without
introducing full RSS in the kernel?


I don't get how cloning the skb can solve the issue.

We can certainly implement Toeplitz function in the kernel or even with
tc-bpf to store a hash value that can be used for eBPF steering program
and virtio hash reporting. However we don't have a means of storing a
hash type, which is specific to virtio hash reporting and lacks a
corresponding skb field.


I may miss something but looking at sk_filter_is_valid_access(). It
looks to me we can make use of skb->cb[0..4]?


I didn't opt to using cb. Below is the rationale:

cb is for tail call so it means we reuse the field for a different
purpose. The context rewrite allows adding a field without increasing
the size of the underlying storage (the real sk_buff) so we should add a
new field instead of reusing an existing field to avoid confusion.

We are however no longer allowed to add a new field. In my
understanding, this is because it is an UAPI, and eBPF maintainers found
it is difficult to maintain its stability.

Reusing cb for hash reporting is a workaround to avoid having a new
field, but it does not solve the underlying problem (i.e., keeping eBPF
as stable as UAPI is unreasonably hard). In my opinion, adding an ioctl
is a reasonable option to keep the API as stable as other virtualization
UAPIs while respecting the underlying intention of the context rewrite
feature freeze.


Fair enough.

Btw, I remember DPDK implements tuntap RSS via eBPF as well (probably
via cls or other). It might worth to see if anything we miss here.


Thanks for the information. I wonder why they used cls instead of 
steering program. Perhaps it may be due to compatibility with macvtap 
and ipvtap, which don't steering program.


Their RSS implementation looks cleaner so I will improve my RSS 
implementation accordingly.


Regards,
Akihiko Odaki



Re: [PATCH RFC v4 0/9] tun: Introduce virtio-net hashing feature

2024-09-30 Thread Akihiko Odaki

On 2024/09/30 0:33, Stephen Hemminger wrote:

On Sun, 29 Sep 2024 16:10:47 +0900
Akihiko Odaki  wrote:


On 2024/09/29 11:07, Jason Wang wrote:

On Fri, Sep 27, 2024 at 3:51 PM Akihiko Odaki  wrote:


On 2024/09/27 13:31, Jason Wang wrote:

On Fri, Sep 27, 2024 at 10:11 AM Akihiko Odaki  wrote:


On 2024/09/25 12:30, Jason Wang wrote:

On Tue, Sep 24, 2024 at 5:01 PM Akihiko Odaki  wrote:


virtio-net have two usage of hashes: one is RSS and another is hash
reporting. Conventionally the hash calculation was done by the VMM.
However, computing the hash after the queue was chosen defeats the
purpose of RSS.

Another approach is to use eBPF steering program. This approach has
another downside: it cannot report the calculated hash due to the
restrictive nature of eBPF.

Introduce the code to compute hashes to the kernel in order to overcome
thse challenges.

An alternative solution is to extend the eBPF steering program so that it
will be able to report to the userspace, but it is based on context
rewrites, which is in feature freeze. We can adopt kfuncs, but they will
not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM
and vhost_net).
  


I wonder if we could clone the skb and reuse some to store the hash,
then the steering eBPF program can access these fields without
introducing full RSS in the kernel?


I don't get how cloning the skb can solve the issue.

We can certainly implement Toeplitz function in the kernel or even with
tc-bpf to store a hash value that can be used for eBPF steering program
and virtio hash reporting. However we don't have a means of storing a
hash type, which is specific to virtio hash reporting and lacks a
corresponding skb field.


I may miss something but looking at sk_filter_is_valid_access(). It
looks to me we can make use of skb->cb[0..4]?


I didn't opt to using cb. Below is the rationale:

cb is for tail call so it means we reuse the field for a different
purpose. The context rewrite allows adding a field without increasing
the size of the underlying storage (the real sk_buff) so we should add a
new field instead of reusing an existing field to avoid confusion.

We are however no longer allowed to add a new field. In my
understanding, this is because it is an UAPI, and eBPF maintainers found
it is difficult to maintain its stability.

Reusing cb for hash reporting is a workaround to avoid having a new
field, but it does not solve the underlying problem (i.e., keeping eBPF
as stable as UAPI is unreasonably hard). In my opinion, adding an ioctl
is a reasonable option to keep the API as stable as other virtualization
UAPIs while respecting the underlying intention of the context rewrite
feature freeze.


Fair enough.

Btw, I remember DPDK implements tuntap RSS via eBPF as well (probably
via cls or other). It might worth to see if anything we miss here.


Thanks for the information. I wonder why they used cls instead of
steering program. Perhaps it may be due to compatibility with macvtap
and ipvtap, which don't steering program.

Their RSS implementation looks cleaner so I will improve my RSS
implementation accordingly.



DPDK needs to support flow rules. The specific case is where packets
are classified by a flow, then RSS is done across a subset of the queues.
The support for flow in TUN driver is more academic than useful,
I fixed it for current BPF, but doubt anyone is using it really.

A full steering program would be good, but would require much more
complexity to take a general set of flow rules then communicate that
to the steering program.



It reminded me of RSS context and flow filter. Some physical NICs 
support to use a dedicated RSS context for packets matched with flow 
filter, and virtio is also gaining corresponding features.


RSS context: https://github.com/oasis-tcs/virtio-spec/issues/178
Flow filter: https://github.com/oasis-tcs/virtio-spec/issues/179

I considered about the possibility of supporting these features with tc 
instead of adding ioctls to tuntap, but it seems not appropriate for 
virtualization use case.


In a virtualization use case, tuntap is configured according to requests 
of guests, and the code processing these requests need to have minimal 
permissions for security. This goal is achieved by passing a file 
descriptor that represents a tuntap from a privileged process (e.g., 
libvirt) to the process handling guest requests (e.g., QEMU).


However, tc is configured with rtnetlink, which does not seem to have an 
interface to delegate a permission for one particular device to another 
process.


For now I'll continue working on the current approach that is based on 
ioctl and lacks RSS context and flow filter features. Eventually they 
are also likely to require new ioctls if they are to be supported with 
vhost_net.


Regards,
Akihiko Odaki



Re: [PATCH RFC v3 2/9] virtio_net: Add functions for hashing

2024-09-19 Thread Akihiko Odaki

On 2024/09/16 10:01, gur.st...@huawei.com wrote:

+
+static inline void virtio_net_toeplitz(struct virtio_net_toeplitz_state *state,
+  const __be32 *input, size_t len)

The function calculates a hash value but its name does not make it
clear. Consider adding a 'calc'.

+{
+   u32 key;
+
+   while (len) {
+   state->key++;
+   key = be32_to_cpu(*state->key);

You perform be32_to_cpu to support both CPU endianities.
If you will follow with an unconditional swab32, you could run the
following loop on a more natural 0 to 31 always referring to bit 0
and avoiding !!(key & bit):

key = swab32(be32_to_cpu(*state->key));
for (i = 0; i < 32; i++, key >>= 1) {
if (be32_to_cpu(*input) & 1)
state->hash ^= state->key_buffer;
state->key_buffer = (state->key_buffer << 1) | (key & 1);
}



Fixing myself, in previous version 'input' was tested against same bit.
Advantage is less clear now, replacing !! with extra shift.
However, since little endian CPUs are more common, the combination of
swab32(be32_to_cpu(x) will actually become a nop.
Similar tactic may be applied to 'input' by assigning it to local
variable. This may produce more efficient version but not necessary
easier to understand.

key = bswap32(be32_to_cpu(*state->key));
for (u32 bit = BIT(31); bit; bit >>= 1, key >>= 1) {
if (be32_to_cpu(*input) & bit)
state->hash ^= state->key_buffer;
state->key_buffer =
(state->key_buffer << 1) | (key & 1);
}


This unfortunately does not work. swab32() works at *byte*-level but we 
need to reverse the order of *bits*. bitrev32() is what we need, but it 
cannot eliminate be32_to_cpu().


Regards,
Akihiko Odaki



[PATCH net-next v5 0/7] tun: Unify vnet implementation

2025-02-04 Thread Akihiko Odaki
When I implemented virtio's hash-related features to tun/tap [1],
I found tun/tap does not fill the entire region reserved for the virtio
header, leaving some uninitialized hole in the middle of the buffer
after read()/recvmesg().

This series fills the uninitialized hole. More concretely, the
num_buffers field will be initialized with 1, and the other fields will
be inialized with 0. Setting the num_buffers field to 1 is mandated by
virtio 1.0 [2].

The change to virtio header is preceded by another change that refactors
tun and tap to unify their virtio-related code.

[1]: https://lore.kernel.org/r/20241008-rss-v5-0-f3cf68df0...@daynix.com
[2]: https://lore.kernel.org/r/20241227084256-mutt-send-email-...@kernel.org/

Signed-off-by: Akihiko Odaki 
---
Changes in v5:
- s/vnet_hdr_len_sz/vnet_hdr_sz/ for patch "tun: Decouple vnet handling"
  (Willem de Bruijn)
- Changed to inline vnet implementations to TUN and TAP.
- Dropped patch "tun: Avoid double-tracking iov_iter length changes" and
  "tap: Avoid double-tracking iov_iter length changes".
- Link to v4: 
https://lore.kernel.org/r/20250120-tun-v4-0-ee81dda03...@daynix.com

Changes in v4:
- s/sz/vnet_hdr_len_sz/ for patch "tun: Decouple vnet handling"
  (Willem de Bruijn)
- Reverted to add CONFIG_TUN_VNET.
- Link to v3: 
https://lore.kernel.org/r/20250116-tun-v3-0-c6b2871e9...@daynix.com

Changes in v3:
- Dropped changes to fill the vnet header.
- Splitted patch "tun: Unify vnet implementation".
- Reverted spurious changes in patch "tun: Unify vnet implementation".
- Merged tun_vnet.c into TAP.
- Link to v2: 
https://lore.kernel.org/r/20250109-tun-v2-0-388d7d5a2...@daynix.com

Changes in v2:
- Fixed num_buffers endian.
- Link to v1: 
https://lore.kernel.org/r/20250108-tun-v1-0-67d784b34...@daynix.com

---
Akihiko Odaki (7):
  tun: Refactor CONFIG_TUN_VNET_CROSS_LE
  tun: Keep hdr_len in tun_get_user()
  tun: Decouple vnet from tun_struct
  tun: Decouple vnet handling
  tun: Extract the vnet handling code
  tap: Keep hdr_len in tap_get_user()
  tap: Use tun's vnet-related code

 MAINTAINERS|   2 +-
 drivers/net/tap.c  | 168 ++
 drivers/net/tun.c  | 193 ++---
 drivers/net/tun_vnet.h | 184 ++
 4 files changed, 231 insertions(+), 316 deletions(-)
---
base-commit: a32e14f8aef69b42826cf0998b068a43d486a9e9
change-id: 20241230-tun-66e10a49b0c7

Best regards,
-- 
Akihiko Odaki 




[PATCH net-next v5 3/7] tun: Decouple vnet from tun_struct

2025-02-04 Thread Akihiko Odaki
Decouple vnet-related functions from tun_struct so that we can reuse
them for tap in the future.

Signed-off-by: Akihiko Odaki 
Reviewed-by: Willem de Bruijn 
---
 drivers/net/tun.c | 51 ++-
 1 file changed, 26 insertions(+), 25 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 
9d4aabc3b63c8f9baab82d7ab2bba567e9075484..8ddd4b352f0307e52cdff75254b5ac1d259d51f8
 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -298,16 +298,16 @@ static bool tun_napi_frags_enabled(const struct tun_file 
*tfile)
return tfile->napi_frags_enabled;
 }
 
-static inline bool tun_legacy_is_little_endian(struct tun_struct *tun)
+static inline bool tun_legacy_is_little_endian(unsigned int flags)
 {
return !(IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE) &&
-(tun->flags & TUN_VNET_BE)) &&
+(flags & TUN_VNET_BE)) &&
virtio_legacy_is_little_endian();
 }
 
-static long tun_get_vnet_be(struct tun_struct *tun, int __user *argp)
+static long tun_get_vnet_be(unsigned int flags, int __user *argp)
 {
-   int be = !!(tun->flags & TUN_VNET_BE);
+   int be = !!(flags & TUN_VNET_BE);
 
if (!IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE))
return -EINVAL;
@@ -318,7 +318,7 @@ static long tun_get_vnet_be(struct tun_struct *tun, int 
__user *argp)
return 0;
 }
 
-static long tun_set_vnet_be(struct tun_struct *tun, int __user *argp)
+static long tun_set_vnet_be(unsigned int *flags, int __user *argp)
 {
int be;
 
@@ -329,27 +329,26 @@ static long tun_set_vnet_be(struct tun_struct *tun, int 
__user *argp)
return -EFAULT;
 
if (be)
-   tun->flags |= TUN_VNET_BE;
+   *flags |= TUN_VNET_BE;
else
-   tun->flags &= ~TUN_VNET_BE;
+   *flags &= ~TUN_VNET_BE;
 
return 0;
 }
 
-static inline bool tun_is_little_endian(struct tun_struct *tun)
+static inline bool tun_is_little_endian(unsigned int flags)
 {
-   return tun->flags & TUN_VNET_LE ||
-   tun_legacy_is_little_endian(tun);
+   return flags & TUN_VNET_LE || tun_legacy_is_little_endian(flags);
 }
 
-static inline u16 tun16_to_cpu(struct tun_struct *tun, __virtio16 val)
+static inline u16 tun16_to_cpu(unsigned int flags, __virtio16 val)
 {
-   return __virtio16_to_cpu(tun_is_little_endian(tun), val);
+   return __virtio16_to_cpu(tun_is_little_endian(flags), val);
 }
 
-static inline __virtio16 cpu_to_tun16(struct tun_struct *tun, u16 val)
+static inline __virtio16 cpu_to_tun16(unsigned int flags, u16 val)
 {
-   return __cpu_to_virtio16(tun_is_little_endian(tun), val);
+   return __cpu_to_virtio16(tun_is_little_endian(flags), val);
 }
 
 static inline u32 tun_hashfn(u32 rxhash)
@@ -1765,6 +1764,7 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
 
if (tun->flags & IFF_VNET_HDR) {
int vnet_hdr_sz = READ_ONCE(tun->vnet_hdr_sz);
+   int flags = tun->flags;
 
if (len < vnet_hdr_sz)
return -EINVAL;
@@ -1773,11 +1773,11 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
if (!copy_from_iter_full(&gso, sizeof(gso), from))
return -EFAULT;
 
-   hdr_len = tun16_to_cpu(tun, gso.hdr_len);
+   hdr_len = tun16_to_cpu(flags, gso.hdr_len);
 
if (gso.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
-   hdr_len = max(tun16_to_cpu(tun, gso.csum_start) + 
tun16_to_cpu(tun, gso.csum_offset) + 2, hdr_len);
-   gso.hdr_len = cpu_to_tun16(tun, hdr_len);
+   hdr_len = max(tun16_to_cpu(flags, gso.csum_start) + 
tun16_to_cpu(flags, gso.csum_offset) + 2, hdr_len);
+   gso.hdr_len = cpu_to_tun16(flags, hdr_len);
}
 
if (hdr_len > len)
@@ -1856,7 +1856,7 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
}
}
 
-   if (virtio_net_hdr_to_skb(skb, &gso, tun_is_little_endian(tun))) {
+   if (virtio_net_hdr_to_skb(skb, &gso, tun_is_little_endian(tun->flags))) 
{
atomic_long_inc(&tun->rx_frame_errors);
err = -EINVAL;
goto free_skb;
@@ -2110,23 +2110,24 @@ static ssize_t tun_put_user(struct tun_struct *tun,
 
if (vnet_hdr_sz) {
struct virtio_net_hdr gso;
+   int flags = tun->flags;
 
if (iov_iter_count(iter) < vnet_hdr_sz)
return -EINVAL;
 
if (virtio_net_hdr_from_skb(skb, &gso,
-   tun_is_little_endian(tun), true,
+   tun_is_little_end

[PATCH net-next v5 2/7] tun: Keep hdr_len in tun_get_user()

2025-02-04 Thread Akihiko Odaki
hdr_len is repeatedly used so keep it in a local variable.

Signed-off-by: Akihiko Odaki 
Reviewed-by: Willem de Bruijn 
---
 drivers/net/tun.c | 24 +++-
 1 file changed, 11 insertions(+), 13 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 
452fc5104260fe7ff5fdd5cedc5d2647cbe35c79..9d4aabc3b63c8f9baab82d7ab2bba567e9075484
 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1746,6 +1746,7 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
struct virtio_net_hdr gso = { 0 };
int good_linear;
int copylen;
+   int hdr_len = 0;
bool zerocopy = false;
int err;
u32 rxhash = 0;
@@ -1772,19 +1773,21 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
if (!copy_from_iter_full(&gso, sizeof(gso), from))
return -EFAULT;
 
-   if ((gso.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
-   tun16_to_cpu(tun, gso.csum_start) + tun16_to_cpu(tun, 
gso.csum_offset) + 2 > tun16_to_cpu(tun, gso.hdr_len))
-   gso.hdr_len = cpu_to_tun16(tun, tun16_to_cpu(tun, 
gso.csum_start) + tun16_to_cpu(tun, gso.csum_offset) + 2);
+   hdr_len = tun16_to_cpu(tun, gso.hdr_len);
 
-   if (tun16_to_cpu(tun, gso.hdr_len) > len)
+   if (gso.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
+   hdr_len = max(tun16_to_cpu(tun, gso.csum_start) + 
tun16_to_cpu(tun, gso.csum_offset) + 2, hdr_len);
+   gso.hdr_len = cpu_to_tun16(tun, hdr_len);
+   }
+
+   if (hdr_len > len)
return -EINVAL;
iov_iter_advance(from, vnet_hdr_sz - sizeof(gso));
}
 
if ((tun->flags & TUN_TYPE_MASK) == IFF_TAP) {
align += NET_IP_ALIGN;
-   if (unlikely(len < ETH_HLEN ||
-(gso.hdr_len && tun16_to_cpu(tun, gso.hdr_len) < 
ETH_HLEN)))
+   if (unlikely(len < ETH_HLEN || (hdr_len && hdr_len < ETH_HLEN)))
return -EINVAL;
}
 
@@ -1797,9 +1800,7 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
 * enough room for skb expand head in case it is used.
 * The rest of the buffer is mapped from userspace.
 */
-   copylen = gso.hdr_len ? tun16_to_cpu(tun, gso.hdr_len) : 
GOODCOPY_LEN;
-   if (copylen > good_linear)
-   copylen = good_linear;
+   copylen = min(hdr_len ? hdr_len : GOODCOPY_LEN, good_linear);
linear = copylen;
iov_iter_advance(&i, copylen);
if (iov_iter_npages(&i, INT_MAX) <= MAX_SKB_FRAGS)
@@ -1820,10 +1821,7 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
} else {
if (!zerocopy) {
copylen = len;
-   if (tun16_to_cpu(tun, gso.hdr_len) > good_linear)
-   linear = good_linear;
-   else
-   linear = tun16_to_cpu(tun, gso.hdr_len);
+   linear = min(hdr_len, good_linear);
}
 
if (frags) {

-- 
2.48.1




[PATCH net-next v5 1/7] tun: Refactor CONFIG_TUN_VNET_CROSS_LE

2025-02-04 Thread Akihiko Odaki
Check IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE) to save some lines and make
future changes easier.

Signed-off-by: Akihiko Odaki 
Reviewed-by: Willem de Bruijn 
---
 drivers/net/tun.c | 26 --
 1 file changed, 8 insertions(+), 18 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 
e816aaba8e5f2ed06f8832f79553b6c976e75bb8..452fc5104260fe7ff5fdd5cedc5d2647cbe35c79
 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -298,10 +298,10 @@ static bool tun_napi_frags_enabled(const struct tun_file 
*tfile)
return tfile->napi_frags_enabled;
 }
 
-#ifdef CONFIG_TUN_VNET_CROSS_LE
 static inline bool tun_legacy_is_little_endian(struct tun_struct *tun)
 {
-   return tun->flags & TUN_VNET_BE ? false :
+   return !(IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE) &&
+(tun->flags & TUN_VNET_BE)) &&
virtio_legacy_is_little_endian();
 }
 
@@ -309,6 +309,9 @@ static long tun_get_vnet_be(struct tun_struct *tun, int 
__user *argp)
 {
int be = !!(tun->flags & TUN_VNET_BE);
 
+   if (!IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE))
+   return -EINVAL;
+
if (put_user(be, argp))
return -EFAULT;
 
@@ -319,6 +322,9 @@ static long tun_set_vnet_be(struct tun_struct *tun, int 
__user *argp)
 {
int be;
 
+   if (!IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE))
+   return -EINVAL;
+
if (get_user(be, argp))
return -EFAULT;
 
@@ -329,22 +335,6 @@ static long tun_set_vnet_be(struct tun_struct *tun, int 
__user *argp)
 
return 0;
 }
-#else
-static inline bool tun_legacy_is_little_endian(struct tun_struct *tun)
-{
-   return virtio_legacy_is_little_endian();
-}
-
-static long tun_get_vnet_be(struct tun_struct *tun, int __user *argp)
-{
-   return -EINVAL;
-}
-
-static long tun_set_vnet_be(struct tun_struct *tun, int __user *argp)
-{
-   return -EINVAL;
-}
-#endif /* CONFIG_TUN_VNET_CROSS_LE */
 
 static inline bool tun_is_little_endian(struct tun_struct *tun)
 {

-- 
2.48.1




[PATCH net-next v5 5/7] tun: Extract the vnet handling code

2025-02-04 Thread Akihiko Odaki
The vnet handling code will be reused by tap.

Signed-off-by: Akihiko Odaki 
---
 MAINTAINERS|   2 +-
 drivers/net/tun.c  | 179 +--
 drivers/net/tun_vnet.h | 184 +
 3 files changed, 187 insertions(+), 178 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 
910305c11e8a882da5b49ce5bd55011b93f28c32..bc32b7e23c79ab80b19c8207f14c5e51a47ec89f
 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -23902,7 +23902,7 @@ W:  http://vtun.sourceforge.net/tun
 F: Documentation/networking/tuntap.rst
 F: arch/um/os-Linux/drivers/
 F: drivers/net/tap.c
-F: drivers/net/tun.c
+F: drivers/net/tun*
 
 TURBOCHANNEL SUBSYSTEM
 M: "Maciej W. Rozycki" 
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 
5bd1c21032ed673ba8e39dd5a488cce11599855b..b14231a743915c2851eaae49d757b763ec4a8841
 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -83,6 +83,8 @@
 #include 
 #include 
 
+#include "tun_vnet.h"
+
 static void tun_default_link_ksettings(struct net_device *dev,
   struct ethtool_link_ksettings *cmd);
 
@@ -94,9 +96,6 @@ static void tun_default_link_ksettings(struct net_device *dev,
  * overload it to mean fasync when stored there.
  */
 #define TUN_FASYNC IFF_ATTACH_QUEUE
-/* High bits in flags field are unused. */
-#define TUN_VNET_LE 0x8000
-#define TUN_VNET_BE 0x4000
 
 #define TUN_FEATURES (IFF_NO_PI | IFF_ONE_QUEUE | IFF_VNET_HDR | \
  IFF_MULTI_QUEUE | IFF_NAPI | IFF_NAPI_FRAGS)
@@ -298,180 +297,6 @@ static bool tun_napi_frags_enabled(const struct tun_file 
*tfile)
return tfile->napi_frags_enabled;
 }
 
-static inline bool tun_legacy_is_little_endian(unsigned int flags)
-{
-   return !(IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE) &&
-(flags & TUN_VNET_BE)) &&
-   virtio_legacy_is_little_endian();
-}
-
-static long tun_get_vnet_be(unsigned int flags, int __user *argp)
-{
-   int be = !!(flags & TUN_VNET_BE);
-
-   if (!IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE))
-   return -EINVAL;
-
-   if (put_user(be, argp))
-   return -EFAULT;
-
-   return 0;
-}
-
-static long tun_set_vnet_be(unsigned int *flags, int __user *argp)
-{
-   int be;
-
-   if (!IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE))
-   return -EINVAL;
-
-   if (get_user(be, argp))
-   return -EFAULT;
-
-   if (be)
-   *flags |= TUN_VNET_BE;
-   else
-   *flags &= ~TUN_VNET_BE;
-
-   return 0;
-}
-
-static inline bool tun_is_little_endian(unsigned int flags)
-{
-   return flags & TUN_VNET_LE || tun_legacy_is_little_endian(flags);
-}
-
-static inline u16 tun16_to_cpu(unsigned int flags, __virtio16 val)
-{
-   return __virtio16_to_cpu(tun_is_little_endian(flags), val);
-}
-
-static inline __virtio16 cpu_to_tun16(unsigned int flags, u16 val)
-{
-   return __cpu_to_virtio16(tun_is_little_endian(flags), val);
-}
-
-static long tun_vnet_ioctl(int *vnet_hdr_sz, unsigned int *flags,
-  unsigned int cmd, int __user *sp)
-{
-   int s;
-
-   switch (cmd) {
-   case TUNGETVNETHDRSZ:
-   s = *vnet_hdr_sz;
-   if (put_user(s, sp))
-   return -EFAULT;
-   return 0;
-
-   case TUNSETVNETHDRSZ:
-   if (get_user(s, sp))
-   return -EFAULT;
-   if (s < (int)sizeof(struct virtio_net_hdr))
-   return -EINVAL;
-
-   *vnet_hdr_sz = s;
-   return 0;
-
-   case TUNGETVNETLE:
-   s = !!(*flags & TUN_VNET_LE);
-   if (put_user(s, sp))
-   return -EFAULT;
-   return 0;
-
-   case TUNSETVNETLE:
-   if (get_user(s, sp))
-   return -EFAULT;
-   if (s)
-   *flags |= TUN_VNET_LE;
-   else
-   *flags &= ~TUN_VNET_LE;
-   return 0;
-
-   case TUNGETVNETBE:
-   return tun_get_vnet_be(*flags, sp);
-
-   case TUNSETVNETBE:
-   return tun_set_vnet_be(flags, sp);
-
-   default:
-   return -EINVAL;
-   }
-}
-
-static int tun_vnet_hdr_get(int sz, unsigned int flags, struct iov_iter *from,
-   struct virtio_net_hdr *hdr)
-{
-   u16 hdr_len;
-
-   if (iov_iter_count(from) < sz)
-   return -EINVAL;
-
-   if (!copy_from_iter_full(hdr, sizeof(*hdr), from))
-   return -EFAULT;
-
-   hdr_len = tun16_to_cpu(flags, hdr->hdr_len);
-
-   if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
-   hdr_len = max(tun16_to_cpu(flags, hdr->csum_start) + 
tun16_to_cpu(flags, hdr->csum_offset) + 2, hdr_len)

[PATCH net-next v5 4/7] tun: Decouple vnet handling

2025-02-04 Thread Akihiko Odaki
Decouple the vnet handling code so that we can reuse it for tap.

Signed-off-by: Akihiko Odaki 
Reviewed-by: Willem de Bruijn 
---
 drivers/net/tun.c | 237 --
 1 file changed, 139 insertions(+), 98 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 
8ddd4b352f0307e52cdff75254b5ac1d259d51f8..5bd1c21032ed673ba8e39dd5a488cce11599855b
 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -351,6 +351,127 @@ static inline __virtio16 cpu_to_tun16(unsigned int flags, 
u16 val)
return __cpu_to_virtio16(tun_is_little_endian(flags), val);
 }
 
+static long tun_vnet_ioctl(int *vnet_hdr_sz, unsigned int *flags,
+  unsigned int cmd, int __user *sp)
+{
+   int s;
+
+   switch (cmd) {
+   case TUNGETVNETHDRSZ:
+   s = *vnet_hdr_sz;
+   if (put_user(s, sp))
+   return -EFAULT;
+   return 0;
+
+   case TUNSETVNETHDRSZ:
+   if (get_user(s, sp))
+   return -EFAULT;
+   if (s < (int)sizeof(struct virtio_net_hdr))
+   return -EINVAL;
+
+   *vnet_hdr_sz = s;
+   return 0;
+
+   case TUNGETVNETLE:
+   s = !!(*flags & TUN_VNET_LE);
+   if (put_user(s, sp))
+   return -EFAULT;
+   return 0;
+
+   case TUNSETVNETLE:
+   if (get_user(s, sp))
+   return -EFAULT;
+   if (s)
+   *flags |= TUN_VNET_LE;
+   else
+   *flags &= ~TUN_VNET_LE;
+   return 0;
+
+   case TUNGETVNETBE:
+   return tun_get_vnet_be(*flags, sp);
+
+   case TUNSETVNETBE:
+   return tun_set_vnet_be(flags, sp);
+
+   default:
+   return -EINVAL;
+   }
+}
+
+static int tun_vnet_hdr_get(int sz, unsigned int flags, struct iov_iter *from,
+   struct virtio_net_hdr *hdr)
+{
+   u16 hdr_len;
+
+   if (iov_iter_count(from) < sz)
+   return -EINVAL;
+
+   if (!copy_from_iter_full(hdr, sizeof(*hdr), from))
+   return -EFAULT;
+
+   hdr_len = tun16_to_cpu(flags, hdr->hdr_len);
+
+   if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
+   hdr_len = max(tun16_to_cpu(flags, hdr->csum_start) + 
tun16_to_cpu(flags, hdr->csum_offset) + 2, hdr_len);
+   hdr->hdr_len = cpu_to_tun16(flags, hdr_len);
+   }
+
+   if (hdr_len > iov_iter_count(from))
+   return -EINVAL;
+
+   iov_iter_advance(from, sz - sizeof(*hdr));
+
+   return hdr_len;
+}
+
+static int tun_vnet_hdr_put(int sz, struct iov_iter *iter,
+   const struct virtio_net_hdr *hdr)
+{
+   if (unlikely(iov_iter_count(iter) < sz))
+   return -EINVAL;
+
+   if (unlikely(copy_to_iter(hdr, sizeof(*hdr), iter) != sizeof(*hdr)))
+   return -EFAULT;
+
+   iov_iter_advance(iter, sz - sizeof(*hdr));
+
+   return 0;
+}
+
+static int tun_vnet_hdr_to_skb(unsigned int flags, struct sk_buff *skb,
+  const struct virtio_net_hdr *hdr)
+{
+   return virtio_net_hdr_to_skb(skb, hdr, tun_is_little_endian(flags));
+}
+
+static int tun_vnet_hdr_from_skb(unsigned int flags,
+const struct net_device *dev,
+const struct sk_buff *skb,
+struct virtio_net_hdr *hdr)
+{
+   int vlan_hlen = skb_vlan_tag_present(skb) ? VLAN_HLEN : 0;
+
+   if (virtio_net_hdr_from_skb(skb, hdr,
+   tun_is_little_endian(flags), true,
+   vlan_hlen)) {
+   struct skb_shared_info *sinfo = skb_shinfo(skb);
+
+   if (net_ratelimit()) {
+   netdev_err(dev, "unexpected GSO type: 0x%x, gso_size 
%d, hdr_len %d\n",
+  sinfo->gso_type, tun16_to_cpu(flags, 
hdr->gso_size),
+  tun16_to_cpu(flags, hdr->hdr_len));
+   print_hex_dump(KERN_ERR, "tun: ",
+  DUMP_PREFIX_NONE,
+  16, 1, skb->head,
+  min(tun16_to_cpu(flags, hdr->hdr_len), 
64), true);
+   }
+   WARN_ON_ONCE(1);
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
 static inline u32 tun_hashfn(u32 rxhash)
 {
return rxhash & TUN_MASK_FLOW_ENTRIES;
@@ -1764,25 +1885,12 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
 
if (tun->flags & IFF_VNET_HDR) {
int vnet_hdr_sz = READ_ONCE(tun->vnet_hdr_sz);
-   int flags = tun->flags;
-
-   if (le

[PATCH net-next v5 7/7] tap: Use tun's vnet-related code

2025-02-04 Thread Akihiko Odaki
tun and tap implements the same vnet-related features so reuse the code.

Signed-off-by: Akihiko Odaki 
---
 drivers/net/tap.c | 152 ++
 1 file changed, 16 insertions(+), 136 deletions(-)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 
c55c432bac48d395aebc9ceeaa74f7d07e25af4c..40b077aa639be03cf5a6e9a85734833b289f6b86
 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -26,74 +26,9 @@
 #include 
 #include 
 
-#define TAP_IFFEATURES (IFF_VNET_HDR | IFF_MULTI_QUEUE)
-
-#define TAP_VNET_LE 0x8000
-#define TAP_VNET_BE 0x4000
-
-#ifdef CONFIG_TUN_VNET_CROSS_LE
-static inline bool tap_legacy_is_little_endian(struct tap_queue *q)
-{
-   return q->flags & TAP_VNET_BE ? false :
-   virtio_legacy_is_little_endian();
-}
-
-static long tap_get_vnet_be(struct tap_queue *q, int __user *sp)
-{
-   int s = !!(q->flags & TAP_VNET_BE);
-
-   if (put_user(s, sp))
-   return -EFAULT;
-
-   return 0;
-}
-
-static long tap_set_vnet_be(struct tap_queue *q, int __user *sp)
-{
-   int s;
-
-   if (get_user(s, sp))
-   return -EFAULT;
-
-   if (s)
-   q->flags |= TAP_VNET_BE;
-   else
-   q->flags &= ~TAP_VNET_BE;
-
-   return 0;
-}
-#else
-static inline bool tap_legacy_is_little_endian(struct tap_queue *q)
-{
-   return virtio_legacy_is_little_endian();
-}
-
-static long tap_get_vnet_be(struct tap_queue *q, int __user *argp)
-{
-   return -EINVAL;
-}
-
-static long tap_set_vnet_be(struct tap_queue *q, int __user *argp)
-{
-   return -EINVAL;
-}
-#endif /* CONFIG_TUN_VNET_CROSS_LE */
-
-static inline bool tap_is_little_endian(struct tap_queue *q)
-{
-   return q->flags & TAP_VNET_LE ||
-   tap_legacy_is_little_endian(q);
-}
-
-static inline u16 tap16_to_cpu(struct tap_queue *q, __virtio16 val)
-{
-   return __virtio16_to_cpu(tap_is_little_endian(q), val);
-}
+#include "tun_vnet.h"
 
-static inline __virtio16 cpu_to_tap16(struct tap_queue *q, u16 val)
-{
-   return __cpu_to_virtio16(tap_is_little_endian(q), val);
-}
+#define TAP_IFFEATURES (IFF_VNET_HDR | IFF_MULTI_QUEUE)
 
 static struct proto tap_proto = {
.name = "tap",
@@ -655,25 +590,13 @@ static ssize_t tap_get_user(struct tap_queue *q, void 
*msg_control,
if (q->flags & IFF_VNET_HDR) {
vnet_hdr_len = READ_ONCE(q->vnet_hdr_sz);
 
-   err = -EINVAL;
-   if (len < vnet_hdr_len)
+   hdr_len = tun_vnet_hdr_get(vnet_hdr_len, q->flags, from, 
&vnet_hdr);
+   if (hdr_len < 0) {
+   err = hdr_len;
goto err;
-   len -= vnet_hdr_len;
-
-   err = -EFAULT;
-   if (!copy_from_iter_full(&vnet_hdr, sizeof(vnet_hdr), from))
-   goto err;
-   iov_iter_advance(from, vnet_hdr_len - sizeof(vnet_hdr));
-   hdr_len = tap16_to_cpu(q, vnet_hdr.hdr_len);
-   if (vnet_hdr.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
-   hdr_len = max(tap16_to_cpu(q, vnet_hdr.csum_start) +
- tap16_to_cpu(q, vnet_hdr.csum_offset) + 2,
- hdr_len);
-   vnet_hdr.hdr_len = cpu_to_tap16(q, hdr_len);
}
-   err = -EINVAL;
-   if (tap16_to_cpu(q, vnet_hdr.hdr_len) > len)
-   goto err;
+
+   len -= vnet_hdr_len;
}
 
err = -EINVAL;
@@ -729,8 +652,7 @@ static ssize_t tap_get_user(struct tap_queue *q, void 
*msg_control,
skb->dev = tap->dev;
 
if (vnet_hdr_len) {
-   err = virtio_net_hdr_to_skb(skb, &vnet_hdr,
-   tap_is_little_endian(q));
+   err = tun_vnet_hdr_to_skb(q->flags, skb, &vnet_hdr);
if (err) {
rcu_read_unlock();
drop_reason = SKB_DROP_REASON_DEV_HDR;
@@ -793,23 +715,17 @@ static ssize_t tap_put_user(struct tap_queue *q,
int total;
 
if (q->flags & IFF_VNET_HDR) {
-   int vlan_hlen = skb_vlan_tag_present(skb) ? VLAN_HLEN : 0;
struct virtio_net_hdr vnet_hdr;
 
vnet_hdr_len = READ_ONCE(q->vnet_hdr_sz);
-   if (iov_iter_count(iter) < vnet_hdr_len)
-   return -EINVAL;
-
-   if (virtio_net_hdr_from_skb(skb, &vnet_hdr,
-   tap_is_little_endian(q), true,
-   vlan_hlen))
-   BUG();
 
-   if (copy_to_iter(&vnet_hdr, sizeof(vnet_hdr), iter) !=
-   sizeof(vnet_hdr))
-   return -EFAULT;
+

[PATCH net-next v5 6/7] tap: Keep hdr_len in tap_get_user()

2025-02-04 Thread Akihiko Odaki
hdr_len is repeatedly used so keep it in a local variable.

Signed-off-by: Akihiko Odaki 
---
 drivers/net/tap.c | 30 +-
 1 file changed, 13 insertions(+), 17 deletions(-)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 
5aa41d5f7765a6dcf185bccd3cba2299bad89398..c55c432bac48d395aebc9ceeaa74f7d07e25af4c
 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -645,6 +645,7 @@ static ssize_t tap_get_user(struct tap_queue *q, void 
*msg_control,
int err;
struct virtio_net_hdr vnet_hdr = { 0 };
int vnet_hdr_len = 0;
+   int hdr_len = 0;
int copylen = 0;
int depth;
bool zerocopy = false;
@@ -663,13 +664,13 @@ static ssize_t tap_get_user(struct tap_queue *q, void 
*msg_control,
if (!copy_from_iter_full(&vnet_hdr, sizeof(vnet_hdr), from))
goto err;
iov_iter_advance(from, vnet_hdr_len - sizeof(vnet_hdr));
-   if ((vnet_hdr.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
-tap16_to_cpu(q, vnet_hdr.csum_start) +
-tap16_to_cpu(q, vnet_hdr.csum_offset) + 2 >
-tap16_to_cpu(q, vnet_hdr.hdr_len))
-   vnet_hdr.hdr_len = cpu_to_tap16(q,
-tap16_to_cpu(q, vnet_hdr.csum_start) +
-tap16_to_cpu(q, vnet_hdr.csum_offset) + 2);
+   hdr_len = tap16_to_cpu(q, vnet_hdr.hdr_len);
+   if (vnet_hdr.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
+   hdr_len = max(tap16_to_cpu(q, vnet_hdr.csum_start) +
+ tap16_to_cpu(q, vnet_hdr.csum_offset) + 2,
+ hdr_len);
+   vnet_hdr.hdr_len = cpu_to_tap16(q, hdr_len);
+   }
err = -EINVAL;
if (tap16_to_cpu(q, vnet_hdr.hdr_len) > len)
goto err;
@@ -682,11 +683,8 @@ static ssize_t tap_get_user(struct tap_queue *q, void 
*msg_control,
if (msg_control && sock_flag(&q->sk, SOCK_ZEROCOPY)) {
struct iov_iter i;
 
-   copylen = vnet_hdr.hdr_len ?
-   tap16_to_cpu(q, vnet_hdr.hdr_len) : GOODCOPY_LEN;
-   if (copylen > good_linear)
-   copylen = good_linear;
-   else if (copylen < ETH_HLEN)
+   copylen = min(hdr_len ? hdr_len : GOODCOPY_LEN, good_linear);
+   if (copylen < ETH_HLEN)
copylen = ETH_HLEN;
linear = copylen;
i = *from;
@@ -697,11 +695,9 @@ static ssize_t tap_get_user(struct tap_queue *q, void 
*msg_control,
 
if (!zerocopy) {
copylen = len;
-   linear = tap16_to_cpu(q, vnet_hdr.hdr_len);
-   if (linear > good_linear)
-   linear = good_linear;
-   else if (linear < ETH_HLEN)
-   linear = ETH_HLEN;
+   linear = min(hdr_len, good_linear);
+   if (copylen < ETH_HLEN)
+   copylen = ETH_HLEN;
}
 
skb = tap_alloc_skb(&q->sk, TAP_RESERVE, copylen,

-- 
2.48.1




Re: [PATCH net-next v5 1/7] tun: Refactor CONFIG_TUN_VNET_CROSS_LE

2025-02-05 Thread Akihiko Odaki

On 2025/02/06 6:06, Willem de Bruijn wrote:

Akihiko Odaki wrote:

Check IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE) to save some lines and make
future changes easier.

Signed-off-by: Akihiko Odaki 
Reviewed-by: Willem de Bruijn 
---
  drivers/net/tun.c | 26 --
  1 file changed, 8 insertions(+), 18 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 
e816aaba8e5f2ed06f8832f79553b6c976e75bb8..452fc5104260fe7ff5fdd5cedc5d2647cbe35c79
 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -298,10 +298,10 @@ static bool tun_napi_frags_enabled(const struct tun_file 
*tfile)
return tfile->napi_frags_enabled;
  }
  
-#ifdef CONFIG_TUN_VNET_CROSS_LE

  static inline bool tun_legacy_is_little_endian(struct tun_struct *tun)
  {
-   return tun->flags & TUN_VNET_BE ? false :
+   return !(IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE) &&
+(tun->flags & TUN_VNET_BE)) &&
virtio_legacy_is_little_endian();


Since I have other comments to the series:

Can we make this a bit simpler to the reader, by splitting the test:

 if (IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE) && tun->flags & TUN_VNET_BE)
 return false;

 return virtio_legacy_is_little_endian();



I kept all in one expression to show how different variables are reduced 
into one bool value, but I agree it is too complicated.


I'm adding a new variable to simplify this. The return statement will 
look like: return !be && virtio_legacy_is_little_endian();


It means: for tun, whether the legacy format is in little endian will be 
determined from the tun-specific big-endian flag and the virtio's common 
logic.




Re: [PATCH net-next v5 6/7] tap: Keep hdr_len in tap_get_user()

2025-02-05 Thread Akihiko Odaki

On 2025/02/06 6:21, Willem de Bruijn wrote:

Akihiko Odaki wrote:

hdr_len is repeatedly used so keep it in a local variable.

Signed-off-by: Akihiko Odaki 



@@ -682,11 +683,8 @@ static ssize_t tap_get_user(struct tap_queue *q, void 
*msg_control,
if (msg_control && sock_flag(&q->sk, SOCK_ZEROCOPY)) {
struct iov_iter i;
  
-		copylen = vnet_hdr.hdr_len ?

-   tap16_to_cpu(q, vnet_hdr.hdr_len) : GOODCOPY_LEN;
-   if (copylen > good_linear)
-   copylen = good_linear;
-   else if (copylen < ETH_HLEN)
+   copylen = min(hdr_len ? hdr_len : GOODCOPY_LEN, good_linear);
+   if (copylen < ETH_HLEN)
copylen = ETH_HLEN;


I forgot earlier: this can also use single line statement

 copylen = max(copylen, ETH_HLEN);

And perhaps easiest to follow is

 copylen = hdr_len ?: GOODCOPY_LEN;
 copylen = min(copylen, good_linear);
 copylen = max(copylen, ETH_HLEN);


I introduced the min() usage as it now neatly fits in a line, but I 
found even clamp() fits so I'll use it in the next version:

copylen = clamp(hdr_len ?: GOODCOPY_LEN, ETH_HLEN, good_linear);

Please tell me if you prefer hdr_len ?: GOODCOPY_LEN in a separate line:
copylen = hdr_len ?: GOODCOPY_LEN;
copylen = clamp(copylen, ETH_HLEN, good_linear);




linear = copylen;
i = *from;
@@ -697,11 +695,9 @@ static ssize_t tap_get_user(struct tap_queue *q, void 
*msg_control,
  
  	if (!zerocopy) {

copylen = len;
-   linear = tap16_to_cpu(q, vnet_hdr.hdr_len);
-   if (linear > good_linear)
-   linear = good_linear;
-   else if (linear < ETH_HLEN)
-   linear = ETH_HLEN;
+   linear = min(hdr_len, good_linear);
+   if (copylen < ETH_HLEN)
+   copylen = ETH_HLEN;> > Same



I realized I mistakenly replaced linear with copylen here. Using clamp() 
will remove redundant variable references and fix the bug.




Re: [PATCH net-next v5 5/7] tun: Extract the vnet handling code

2025-02-05 Thread Akihiko Odaki

On 2025/02/06 6:12, Willem de Bruijn wrote:

Akihiko Odaki wrote:

The vnet handling code will be reused by tap.

Signed-off-by: Akihiko Odaki 
---
  MAINTAINERS|   2 +-
  drivers/net/tun.c  | 179 +--
  drivers/net/tun_vnet.h | 184 +
  3 files changed, 187 insertions(+), 178 deletions(-)



-static inline bool tun_legacy_is_little_endian(unsigned int flags)
-{
-   return !(IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE) &&
-(flags & TUN_VNET_BE)) &&
-   virtio_legacy_is_little_endian();
-}



+static inline bool tun_vnet_legacy_is_little_endian(unsigned int flags)
+{
+   return !(IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE) &&
+(flags & TUN_VNET_BE)) &&
+   virtio_legacy_is_little_endian();
+}


In general LGTM. But why did you rename functions while moving them?
Please add an explanation in the commit message for any non obvious
changes like that.


I renamed them to clarify they are in a distinct, decoupled part of 
code. It was obvious in the previous version as they are static 
functions contained in a translation unit, but now they are part of a 
header file so I'm clarifying that with this rename. I will add this 
explanation to the commit message.




[PATCH net-next v6 2/7] tun: Keep hdr_len in tun_get_user()

2025-02-06 Thread Akihiko Odaki
hdr_len is repeatedly used so keep it in a local variable.

Signed-off-by: Akihiko Odaki 
Reviewed-by: Willem de Bruijn 
---
 drivers/net/tun.c | 24 +++-
 1 file changed, 11 insertions(+), 13 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 
4b189cbd28e63ec6325073d9a7678f4210bff3e1..c204c1c0d75bc7d336ec315099a5a60d5d70ea82
 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1747,6 +1747,7 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
struct virtio_net_hdr gso = { 0 };
int good_linear;
int copylen;
+   int hdr_len = 0;
bool zerocopy = false;
int err;
u32 rxhash = 0;
@@ -1773,19 +1774,21 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
if (!copy_from_iter_full(&gso, sizeof(gso), from))
return -EFAULT;
 
-   if ((gso.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
-   tun16_to_cpu(tun, gso.csum_start) + tun16_to_cpu(tun, 
gso.csum_offset) + 2 > tun16_to_cpu(tun, gso.hdr_len))
-   gso.hdr_len = cpu_to_tun16(tun, tun16_to_cpu(tun, 
gso.csum_start) + tun16_to_cpu(tun, gso.csum_offset) + 2);
+   hdr_len = tun16_to_cpu(tun, gso.hdr_len);
 
-   if (tun16_to_cpu(tun, gso.hdr_len) > len)
+   if (gso.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
+   hdr_len = max(tun16_to_cpu(tun, gso.csum_start) + 
tun16_to_cpu(tun, gso.csum_offset) + 2, hdr_len);
+   gso.hdr_len = cpu_to_tun16(tun, hdr_len);
+   }
+
+   if (hdr_len > len)
return -EINVAL;
iov_iter_advance(from, vnet_hdr_sz - sizeof(gso));
}
 
if ((tun->flags & TUN_TYPE_MASK) == IFF_TAP) {
align += NET_IP_ALIGN;
-   if (unlikely(len < ETH_HLEN ||
-(gso.hdr_len && tun16_to_cpu(tun, gso.hdr_len) < 
ETH_HLEN)))
+   if (unlikely(len < ETH_HLEN || (hdr_len && hdr_len < ETH_HLEN)))
return -EINVAL;
}
 
@@ -1798,9 +1801,7 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
 * enough room for skb expand head in case it is used.
 * The rest of the buffer is mapped from userspace.
 */
-   copylen = gso.hdr_len ? tun16_to_cpu(tun, gso.hdr_len) : 
GOODCOPY_LEN;
-   if (copylen > good_linear)
-   copylen = good_linear;
+   copylen = min(hdr_len ? hdr_len : GOODCOPY_LEN, good_linear);
linear = copylen;
iov_iter_advance(&i, copylen);
if (iov_iter_npages(&i, INT_MAX) <= MAX_SKB_FRAGS)
@@ -1821,10 +1822,7 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
} else {
if (!zerocopy) {
copylen = len;
-   if (tun16_to_cpu(tun, gso.hdr_len) > good_linear)
-   linear = good_linear;
-   else
-   linear = tun16_to_cpu(tun, gso.hdr_len);
+   linear = min(hdr_len, good_linear);
}
 
if (frags) {

-- 
2.48.1




[PATCH net-next v6 0/7] tun: Unify vnet implementation

2025-02-06 Thread Akihiko Odaki
When I implemented virtio's hash-related features to tun/tap [1],
I found tun/tap does not fill the entire region reserved for the virtio
header, leaving some uninitialized hole in the middle of the buffer
after read()/recvmesg().

This series fills the uninitialized hole. More concretely, the
num_buffers field will be initialized with 1, and the other fields will
be inialized with 0. Setting the num_buffers field to 1 is mandated by
virtio 1.0 [2].

The change to virtio header is preceded by another change that refactors
tun and tap to unify their virtio-related code.

[1]: https://lore.kernel.org/r/20241008-rss-v5-0-f3cf68df0...@daynix.com
[2]: https://lore.kernel.org/r/20241227084256-mutt-send-email-...@kernel.org/

Signed-off-by: Akihiko Odaki 
---
Changes in v6:
- Added an intermediate variable in tun_vnet_legacy_is_little_endian()
  to reduce a complexity of an expression.
- Noted that functions are renamed in the message of patch
  "tun: Extract the vnet handling code".
- Used clamp() in patch "tap: Keep hdr_len in tap_get_user()".
- Link to v5: 
https://lore.kernel.org/r/20250205-tun-v5-0-15d0b32e8...@daynix.com

Changes in v5:
- s/vnet_hdr_len_sz/vnet_hdr_sz/ for patch "tun: Decouple vnet handling"
  (Willem de Bruijn)
- Changed to inline vnet implementations to TUN and TAP.
- Dropped patch "tun: Avoid double-tracking iov_iter length changes" and
  "tap: Avoid double-tracking iov_iter length changes".
- Link to v4: 
https://lore.kernel.org/r/20250120-tun-v4-0-ee81dda03...@daynix.com

Changes in v4:
- s/sz/vnet_hdr_len_sz/ for patch "tun: Decouple vnet handling"
  (Willem de Bruijn)
- Reverted to add CONFIG_TUN_VNET.
- Link to v3: 
https://lore.kernel.org/r/20250116-tun-v3-0-c6b2871e9...@daynix.com

Changes in v3:
- Dropped changes to fill the vnet header.
- Splitted patch "tun: Unify vnet implementation".
- Reverted spurious changes in patch "tun: Unify vnet implementation".
- Merged tun_vnet.c into TAP.
- Link to v2: 
https://lore.kernel.org/r/20250109-tun-v2-0-388d7d5a2...@daynix.com

Changes in v2:
- Fixed num_buffers endian.
- Link to v1: 
https://lore.kernel.org/r/20250108-tun-v1-0-67d784b34...@daynix.com

---
Akihiko Odaki (7):
  tun: Refactor CONFIG_TUN_VNET_CROSS_LE
  tun: Keep hdr_len in tun_get_user()
  tun: Decouple vnet from tun_struct
  tun: Decouple vnet handling
  tun: Extract the vnet handling code
  tap: Keep hdr_len in tap_get_user()
  tap: Use tun's vnet-related code

 MAINTAINERS|   2 +-
 drivers/net/tap.c  | 166 +-
 drivers/net/tun.c  | 193 ++---
 drivers/net/tun_vnet.h | 185 +++
 4 files changed, 229 insertions(+), 317 deletions(-)
---
base-commit: a32e14f8aef69b42826cf0998b068a43d486a9e9
change-id: 20241230-tun-66e10a49b0c7

Best regards,
-- 
Akihiko Odaki 




[PATCH net-next v6 1/7] tun: Refactor CONFIG_TUN_VNET_CROSS_LE

2025-02-06 Thread Akihiko Odaki
Check IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE) to save some lines and make
future changes easier.

Signed-off-by: Akihiko Odaki 
Reviewed-by: Willem de Bruijn 
---
 drivers/net/tun.c | 29 ++---
 1 file changed, 10 insertions(+), 19 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 
e816aaba8e5f2ed06f8832f79553b6c976e75bb8..4b189cbd28e63ec6325073d9a7678f4210bff3e1
 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -298,17 +298,21 @@ static bool tun_napi_frags_enabled(const struct tun_file 
*tfile)
return tfile->napi_frags_enabled;
 }
 
-#ifdef CONFIG_TUN_VNET_CROSS_LE
 static inline bool tun_legacy_is_little_endian(struct tun_struct *tun)
 {
-   return tun->flags & TUN_VNET_BE ? false :
-   virtio_legacy_is_little_endian();
+   bool be = IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE) &&
+ (tun->flags & TUN_VNET_BE);
+
+   return !be && virtio_legacy_is_little_endian();
 }
 
 static long tun_get_vnet_be(struct tun_struct *tun, int __user *argp)
 {
int be = !!(tun->flags & TUN_VNET_BE);
 
+   if (!IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE))
+   return -EINVAL;
+
if (put_user(be, argp))
return -EFAULT;
 
@@ -319,6 +323,9 @@ static long tun_set_vnet_be(struct tun_struct *tun, int 
__user *argp)
 {
int be;
 
+   if (!IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE))
+   return -EINVAL;
+
if (get_user(be, argp))
return -EFAULT;
 
@@ -329,22 +336,6 @@ static long tun_set_vnet_be(struct tun_struct *tun, int 
__user *argp)
 
return 0;
 }
-#else
-static inline bool tun_legacy_is_little_endian(struct tun_struct *tun)
-{
-   return virtio_legacy_is_little_endian();
-}
-
-static long tun_get_vnet_be(struct tun_struct *tun, int __user *argp)
-{
-   return -EINVAL;
-}
-
-static long tun_set_vnet_be(struct tun_struct *tun, int __user *argp)
-{
-   return -EINVAL;
-}
-#endif /* CONFIG_TUN_VNET_CROSS_LE */
 
 static inline bool tun_is_little_endian(struct tun_struct *tun)
 {

-- 
2.48.1




[PATCH net-next v6 5/7] tun: Extract the vnet handling code

2025-02-06 Thread Akihiko Odaki
The vnet handling code will be reused by tap.

Functions are renamed to ensure that their names contain "vnet" to
clarify that they are part of the decoupled vnet handling code.

Signed-off-by: Akihiko Odaki 
---
 MAINTAINERS|   2 +-
 drivers/net/tun.c  | 180 +--
 drivers/net/tun_vnet.h | 185 +
 3 files changed, 188 insertions(+), 179 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 
910305c11e8a882da5b49ce5bd55011b93f28c32..bc32b7e23c79ab80b19c8207f14c5e51a47ec89f
 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -23902,7 +23902,7 @@ W:  http://vtun.sourceforge.net/tun
 F: Documentation/networking/tuntap.rst
 F: arch/um/os-Linux/drivers/
 F: drivers/net/tap.c
-F: drivers/net/tun.c
+F: drivers/net/tun*
 
 TURBOCHANNEL SUBSYSTEM
 M: "Maciej W. Rozycki" 
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 
a52c5d00e75c28bb1574f07f59be9f96702a6f0a..b14231a743915c2851eaae49d757b763ec4a8841
 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -83,6 +83,8 @@
 #include 
 #include 
 
+#include "tun_vnet.h"
+
 static void tun_default_link_ksettings(struct net_device *dev,
   struct ethtool_link_ksettings *cmd);
 
@@ -94,9 +96,6 @@ static void tun_default_link_ksettings(struct net_device *dev,
  * overload it to mean fasync when stored there.
  */
 #define TUN_FASYNC IFF_ATTACH_QUEUE
-/* High bits in flags field are unused. */
-#define TUN_VNET_LE 0x8000
-#define TUN_VNET_BE 0x4000
 
 #define TUN_FEATURES (IFF_NO_PI | IFF_ONE_QUEUE | IFF_VNET_HDR | \
  IFF_MULTI_QUEUE | IFF_NAPI | IFF_NAPI_FRAGS)
@@ -298,181 +297,6 @@ static bool tun_napi_frags_enabled(const struct tun_file 
*tfile)
return tfile->napi_frags_enabled;
 }
 
-static inline bool tun_legacy_is_little_endian(unsigned int flags)
-{
-   bool be = IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE) &&
- (flags & TUN_VNET_BE);
-
-   return !be && virtio_legacy_is_little_endian();
-}
-
-static long tun_get_vnet_be(unsigned int flags, int __user *argp)
-{
-   int be = !!(flags & TUN_VNET_BE);
-
-   if (!IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE))
-   return -EINVAL;
-
-   if (put_user(be, argp))
-   return -EFAULT;
-
-   return 0;
-}
-
-static long tun_set_vnet_be(unsigned int *flags, int __user *argp)
-{
-   int be;
-
-   if (!IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE))
-   return -EINVAL;
-
-   if (get_user(be, argp))
-   return -EFAULT;
-
-   if (be)
-   *flags |= TUN_VNET_BE;
-   else
-   *flags &= ~TUN_VNET_BE;
-
-   return 0;
-}
-
-static inline bool tun_is_little_endian(unsigned int flags)
-{
-   return flags & TUN_VNET_LE || tun_legacy_is_little_endian(flags);
-}
-
-static inline u16 tun16_to_cpu(unsigned int flags, __virtio16 val)
-{
-   return __virtio16_to_cpu(tun_is_little_endian(flags), val);
-}
-
-static inline __virtio16 cpu_to_tun16(unsigned int flags, u16 val)
-{
-   return __cpu_to_virtio16(tun_is_little_endian(flags), val);
-}
-
-static long tun_vnet_ioctl(int *vnet_hdr_sz, unsigned int *flags,
-  unsigned int cmd, int __user *sp)
-{
-   int s;
-
-   switch (cmd) {
-   case TUNGETVNETHDRSZ:
-   s = *vnet_hdr_sz;
-   if (put_user(s, sp))
-   return -EFAULT;
-   return 0;
-
-   case TUNSETVNETHDRSZ:
-   if (get_user(s, sp))
-   return -EFAULT;
-   if (s < (int)sizeof(struct virtio_net_hdr))
-   return -EINVAL;
-
-   *vnet_hdr_sz = s;
-   return 0;
-
-   case TUNGETVNETLE:
-   s = !!(*flags & TUN_VNET_LE);
-   if (put_user(s, sp))
-   return -EFAULT;
-   return 0;
-
-   case TUNSETVNETLE:
-   if (get_user(s, sp))
-   return -EFAULT;
-   if (s)
-   *flags |= TUN_VNET_LE;
-   else
-   *flags &= ~TUN_VNET_LE;
-   return 0;
-
-   case TUNGETVNETBE:
-   return tun_get_vnet_be(*flags, sp);
-
-   case TUNSETVNETBE:
-   return tun_set_vnet_be(flags, sp);
-
-   default:
-   return -EINVAL;
-   }
-}
-
-static int tun_vnet_hdr_get(int sz, unsigned int flags, struct iov_iter *from,
-   struct virtio_net_hdr *hdr)
-{
-   u16 hdr_len;
-
-   if (iov_iter_count(from) < sz)
-   return -EINVAL;
-
-   if (!copy_from_iter_full(hdr, sizeof(*hdr), from))
-   return -EFAULT;
-
-   hdr_len = tun16_to_cpu(flags, hdr->hdr_len);
-
-   if (hdr->flags & VIRTIO_

[PATCH net-next v6 4/7] tun: Decouple vnet handling

2025-02-06 Thread Akihiko Odaki
Decouple the vnet handling code so that we can reuse it for tap.

Signed-off-by: Akihiko Odaki 
Reviewed-by: Willem de Bruijn 
---
 drivers/net/tun.c | 237 --
 1 file changed, 139 insertions(+), 98 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 
28a1af1de9d704ed5cc51aac3d99bc095cb28ba5..a52c5d00e75c28bb1574f07f59be9f96702a6f0a
 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -352,6 +352,127 @@ static inline __virtio16 cpu_to_tun16(unsigned int flags, 
u16 val)
return __cpu_to_virtio16(tun_is_little_endian(flags), val);
 }
 
+static long tun_vnet_ioctl(int *vnet_hdr_sz, unsigned int *flags,
+  unsigned int cmd, int __user *sp)
+{
+   int s;
+
+   switch (cmd) {
+   case TUNGETVNETHDRSZ:
+   s = *vnet_hdr_sz;
+   if (put_user(s, sp))
+   return -EFAULT;
+   return 0;
+
+   case TUNSETVNETHDRSZ:
+   if (get_user(s, sp))
+   return -EFAULT;
+   if (s < (int)sizeof(struct virtio_net_hdr))
+   return -EINVAL;
+
+   *vnet_hdr_sz = s;
+   return 0;
+
+   case TUNGETVNETLE:
+   s = !!(*flags & TUN_VNET_LE);
+   if (put_user(s, sp))
+   return -EFAULT;
+   return 0;
+
+   case TUNSETVNETLE:
+   if (get_user(s, sp))
+   return -EFAULT;
+   if (s)
+   *flags |= TUN_VNET_LE;
+   else
+   *flags &= ~TUN_VNET_LE;
+   return 0;
+
+   case TUNGETVNETBE:
+   return tun_get_vnet_be(*flags, sp);
+
+   case TUNSETVNETBE:
+   return tun_set_vnet_be(flags, sp);
+
+   default:
+   return -EINVAL;
+   }
+}
+
+static int tun_vnet_hdr_get(int sz, unsigned int flags, struct iov_iter *from,
+   struct virtio_net_hdr *hdr)
+{
+   u16 hdr_len;
+
+   if (iov_iter_count(from) < sz)
+   return -EINVAL;
+
+   if (!copy_from_iter_full(hdr, sizeof(*hdr), from))
+   return -EFAULT;
+
+   hdr_len = tun16_to_cpu(flags, hdr->hdr_len);
+
+   if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
+   hdr_len = max(tun16_to_cpu(flags, hdr->csum_start) + 
tun16_to_cpu(flags, hdr->csum_offset) + 2, hdr_len);
+   hdr->hdr_len = cpu_to_tun16(flags, hdr_len);
+   }
+
+   if (hdr_len > iov_iter_count(from))
+   return -EINVAL;
+
+   iov_iter_advance(from, sz - sizeof(*hdr));
+
+   return hdr_len;
+}
+
+static int tun_vnet_hdr_put(int sz, struct iov_iter *iter,
+   const struct virtio_net_hdr *hdr)
+{
+   if (unlikely(iov_iter_count(iter) < sz))
+   return -EINVAL;
+
+   if (unlikely(copy_to_iter(hdr, sizeof(*hdr), iter) != sizeof(*hdr)))
+   return -EFAULT;
+
+   iov_iter_advance(iter, sz - sizeof(*hdr));
+
+   return 0;
+}
+
+static int tun_vnet_hdr_to_skb(unsigned int flags, struct sk_buff *skb,
+  const struct virtio_net_hdr *hdr)
+{
+   return virtio_net_hdr_to_skb(skb, hdr, tun_is_little_endian(flags));
+}
+
+static int tun_vnet_hdr_from_skb(unsigned int flags,
+const struct net_device *dev,
+const struct sk_buff *skb,
+struct virtio_net_hdr *hdr)
+{
+   int vlan_hlen = skb_vlan_tag_present(skb) ? VLAN_HLEN : 0;
+
+   if (virtio_net_hdr_from_skb(skb, hdr,
+   tun_is_little_endian(flags), true,
+   vlan_hlen)) {
+   struct skb_shared_info *sinfo = skb_shinfo(skb);
+
+   if (net_ratelimit()) {
+   netdev_err(dev, "unexpected GSO type: 0x%x, gso_size 
%d, hdr_len %d\n",
+  sinfo->gso_type, tun16_to_cpu(flags, 
hdr->gso_size),
+  tun16_to_cpu(flags, hdr->hdr_len));
+   print_hex_dump(KERN_ERR, "tun: ",
+  DUMP_PREFIX_NONE,
+  16, 1, skb->head,
+  min(tun16_to_cpu(flags, hdr->hdr_len), 
64), true);
+   }
+   WARN_ON_ONCE(1);
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
 static inline u32 tun_hashfn(u32 rxhash)
 {
return rxhash & TUN_MASK_FLOW_ENTRIES;
@@ -1765,25 +1886,12 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
 
if (tun->flags & IFF_VNET_HDR) {
int vnet_hdr_sz = READ_ONCE(tun->vnet_hdr_sz);
-   int flags = tun->flags;
-
-   if (le

[PATCH net-next v6 7/7] tap: Use tun's vnet-related code

2025-02-06 Thread Akihiko Odaki
tun and tap implements the same vnet-related features so reuse the code.

Signed-off-by: Akihiko Odaki 
Reviewed-by: Willem de Bruijn 
---
 drivers/net/tap.c | 152 ++
 1 file changed, 16 insertions(+), 136 deletions(-)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 
8cb002616a6143b54258b65b483fed0c3af2c7a0..1287e241f4454fb8ec4975bbaded5fbaa88e3cc8
 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -26,74 +26,9 @@
 #include 
 #include 
 
-#define TAP_IFFEATURES (IFF_VNET_HDR | IFF_MULTI_QUEUE)
-
-#define TAP_VNET_LE 0x8000
-#define TAP_VNET_BE 0x4000
-
-#ifdef CONFIG_TUN_VNET_CROSS_LE
-static inline bool tap_legacy_is_little_endian(struct tap_queue *q)
-{
-   return q->flags & TAP_VNET_BE ? false :
-   virtio_legacy_is_little_endian();
-}
-
-static long tap_get_vnet_be(struct tap_queue *q, int __user *sp)
-{
-   int s = !!(q->flags & TAP_VNET_BE);
-
-   if (put_user(s, sp))
-   return -EFAULT;
-
-   return 0;
-}
-
-static long tap_set_vnet_be(struct tap_queue *q, int __user *sp)
-{
-   int s;
-
-   if (get_user(s, sp))
-   return -EFAULT;
-
-   if (s)
-   q->flags |= TAP_VNET_BE;
-   else
-   q->flags &= ~TAP_VNET_BE;
-
-   return 0;
-}
-#else
-static inline bool tap_legacy_is_little_endian(struct tap_queue *q)
-{
-   return virtio_legacy_is_little_endian();
-}
-
-static long tap_get_vnet_be(struct tap_queue *q, int __user *argp)
-{
-   return -EINVAL;
-}
-
-static long tap_set_vnet_be(struct tap_queue *q, int __user *argp)
-{
-   return -EINVAL;
-}
-#endif /* CONFIG_TUN_VNET_CROSS_LE */
-
-static inline bool tap_is_little_endian(struct tap_queue *q)
-{
-   return q->flags & TAP_VNET_LE ||
-   tap_legacy_is_little_endian(q);
-}
-
-static inline u16 tap16_to_cpu(struct tap_queue *q, __virtio16 val)
-{
-   return __virtio16_to_cpu(tap_is_little_endian(q), val);
-}
+#include "tun_vnet.h"
 
-static inline __virtio16 cpu_to_tap16(struct tap_queue *q, u16 val)
-{
-   return __cpu_to_virtio16(tap_is_little_endian(q), val);
-}
+#define TAP_IFFEATURES (IFF_VNET_HDR | IFF_MULTI_QUEUE)
 
 static struct proto tap_proto = {
.name = "tap",
@@ -655,25 +590,13 @@ static ssize_t tap_get_user(struct tap_queue *q, void 
*msg_control,
if (q->flags & IFF_VNET_HDR) {
vnet_hdr_len = READ_ONCE(q->vnet_hdr_sz);
 
-   err = -EINVAL;
-   if (len < vnet_hdr_len)
+   hdr_len = tun_vnet_hdr_get(vnet_hdr_len, q->flags, from, 
&vnet_hdr);
+   if (hdr_len < 0) {
+   err = hdr_len;
goto err;
-   len -= vnet_hdr_len;
-
-   err = -EFAULT;
-   if (!copy_from_iter_full(&vnet_hdr, sizeof(vnet_hdr), from))
-   goto err;
-   iov_iter_advance(from, vnet_hdr_len - sizeof(vnet_hdr));
-   hdr_len = tap16_to_cpu(q, vnet_hdr.hdr_len);
-   if (vnet_hdr.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
-   hdr_len = max(tap16_to_cpu(q, vnet_hdr.csum_start) +
- tap16_to_cpu(q, vnet_hdr.csum_offset) + 2,
- hdr_len);
-   vnet_hdr.hdr_len = cpu_to_tap16(q, hdr_len);
}
-   err = -EINVAL;
-   if (tap16_to_cpu(q, vnet_hdr.hdr_len) > len)
-   goto err;
+
+   len -= vnet_hdr_len;
}
 
err = -EINVAL;
@@ -725,8 +648,7 @@ static ssize_t tap_get_user(struct tap_queue *q, void 
*msg_control,
skb->dev = tap->dev;
 
if (vnet_hdr_len) {
-   err = virtio_net_hdr_to_skb(skb, &vnet_hdr,
-   tap_is_little_endian(q));
+   err = tun_vnet_hdr_to_skb(q->flags, skb, &vnet_hdr);
if (err) {
rcu_read_unlock();
drop_reason = SKB_DROP_REASON_DEV_HDR;
@@ -789,23 +711,17 @@ static ssize_t tap_put_user(struct tap_queue *q,
int total;
 
if (q->flags & IFF_VNET_HDR) {
-   int vlan_hlen = skb_vlan_tag_present(skb) ? VLAN_HLEN : 0;
struct virtio_net_hdr vnet_hdr;
 
vnet_hdr_len = READ_ONCE(q->vnet_hdr_sz);
-   if (iov_iter_count(iter) < vnet_hdr_len)
-   return -EINVAL;
-
-   if (virtio_net_hdr_from_skb(skb, &vnet_hdr,
-   tap_is_little_endian(q), true,
-   vlan_hlen))
-   BUG();
 
-   if (copy_to_iter(&vnet_hdr, sizeof(vnet_hdr), iter) !=
-   sizeof(vnet_hdr))
-

[PATCH net-next v6 3/7] tun: Decouple vnet from tun_struct

2025-02-06 Thread Akihiko Odaki
Decouple vnet-related functions from tun_struct so that we can reuse
them for tap in the future.

Signed-off-by: Akihiko Odaki 
Reviewed-by: Willem de Bruijn 
---
 drivers/net/tun.c | 51 ++-
 1 file changed, 26 insertions(+), 25 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 
c204c1c0d75bc7d336ec315099a5a60d5d70ea82..28a1af1de9d704ed5cc51aac3d99bc095cb28ba5
 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -298,17 +298,17 @@ static bool tun_napi_frags_enabled(const struct tun_file 
*tfile)
return tfile->napi_frags_enabled;
 }
 
-static inline bool tun_legacy_is_little_endian(struct tun_struct *tun)
+static inline bool tun_legacy_is_little_endian(unsigned int flags)
 {
bool be = IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE) &&
- (tun->flags & TUN_VNET_BE);
+ (flags & TUN_VNET_BE);
 
return !be && virtio_legacy_is_little_endian();
 }
 
-static long tun_get_vnet_be(struct tun_struct *tun, int __user *argp)
+static long tun_get_vnet_be(unsigned int flags, int __user *argp)
 {
-   int be = !!(tun->flags & TUN_VNET_BE);
+   int be = !!(flags & TUN_VNET_BE);
 
if (!IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE))
return -EINVAL;
@@ -319,7 +319,7 @@ static long tun_get_vnet_be(struct tun_struct *tun, int 
__user *argp)
return 0;
 }
 
-static long tun_set_vnet_be(struct tun_struct *tun, int __user *argp)
+static long tun_set_vnet_be(unsigned int *flags, int __user *argp)
 {
int be;
 
@@ -330,27 +330,26 @@ static long tun_set_vnet_be(struct tun_struct *tun, int 
__user *argp)
return -EFAULT;
 
if (be)
-   tun->flags |= TUN_VNET_BE;
+   *flags |= TUN_VNET_BE;
else
-   tun->flags &= ~TUN_VNET_BE;
+   *flags &= ~TUN_VNET_BE;
 
return 0;
 }
 
-static inline bool tun_is_little_endian(struct tun_struct *tun)
+static inline bool tun_is_little_endian(unsigned int flags)
 {
-   return tun->flags & TUN_VNET_LE ||
-   tun_legacy_is_little_endian(tun);
+   return flags & TUN_VNET_LE || tun_legacy_is_little_endian(flags);
 }
 
-static inline u16 tun16_to_cpu(struct tun_struct *tun, __virtio16 val)
+static inline u16 tun16_to_cpu(unsigned int flags, __virtio16 val)
 {
-   return __virtio16_to_cpu(tun_is_little_endian(tun), val);
+   return __virtio16_to_cpu(tun_is_little_endian(flags), val);
 }
 
-static inline __virtio16 cpu_to_tun16(struct tun_struct *tun, u16 val)
+static inline __virtio16 cpu_to_tun16(unsigned int flags, u16 val)
 {
-   return __cpu_to_virtio16(tun_is_little_endian(tun), val);
+   return __cpu_to_virtio16(tun_is_little_endian(flags), val);
 }
 
 static inline u32 tun_hashfn(u32 rxhash)
@@ -1766,6 +1765,7 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
 
if (tun->flags & IFF_VNET_HDR) {
int vnet_hdr_sz = READ_ONCE(tun->vnet_hdr_sz);
+   int flags = tun->flags;
 
if (len < vnet_hdr_sz)
return -EINVAL;
@@ -1774,11 +1774,11 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
if (!copy_from_iter_full(&gso, sizeof(gso), from))
return -EFAULT;
 
-   hdr_len = tun16_to_cpu(tun, gso.hdr_len);
+   hdr_len = tun16_to_cpu(flags, gso.hdr_len);
 
if (gso.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
-   hdr_len = max(tun16_to_cpu(tun, gso.csum_start) + 
tun16_to_cpu(tun, gso.csum_offset) + 2, hdr_len);
-   gso.hdr_len = cpu_to_tun16(tun, hdr_len);
+   hdr_len = max(tun16_to_cpu(flags, gso.csum_start) + 
tun16_to_cpu(flags, gso.csum_offset) + 2, hdr_len);
+   gso.hdr_len = cpu_to_tun16(flags, hdr_len);
}
 
if (hdr_len > len)
@@ -1857,7 +1857,7 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
}
}
 
-   if (virtio_net_hdr_to_skb(skb, &gso, tun_is_little_endian(tun))) {
+   if (virtio_net_hdr_to_skb(skb, &gso, tun_is_little_endian(tun->flags))) 
{
atomic_long_inc(&tun->rx_frame_errors);
err = -EINVAL;
goto free_skb;
@@ -2111,23 +2111,24 @@ static ssize_t tun_put_user(struct tun_struct *tun,
 
if (vnet_hdr_sz) {
struct virtio_net_hdr gso;
+   int flags = tun->flags;
 
if (iov_iter_count(iter) < vnet_hdr_sz)
return -EINVAL;
 
if (virtio_net_hdr_from_skb(skb, &gso,
-   tun_is_little_endian(tun), true,
+   tun_is_little_endian(flag

[PATCH net-next v6 6/7] tap: Keep hdr_len in tap_get_user()

2025-02-06 Thread Akihiko Odaki
hdr_len is repeatedly used so keep it in a local variable.

Signed-off-by: Akihiko Odaki 
---
 drivers/net/tap.c | 28 ++--
 1 file changed, 10 insertions(+), 18 deletions(-)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 
5aa41d5f7765a6dcf185bccd3cba2299bad89398..8cb002616a6143b54258b65b483fed0c3af2c7a0
 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -645,6 +645,7 @@ static ssize_t tap_get_user(struct tap_queue *q, void 
*msg_control,
int err;
struct virtio_net_hdr vnet_hdr = { 0 };
int vnet_hdr_len = 0;
+   int hdr_len = 0;
int copylen = 0;
int depth;
bool zerocopy = false;
@@ -663,13 +664,13 @@ static ssize_t tap_get_user(struct tap_queue *q, void 
*msg_control,
if (!copy_from_iter_full(&vnet_hdr, sizeof(vnet_hdr), from))
goto err;
iov_iter_advance(from, vnet_hdr_len - sizeof(vnet_hdr));
-   if ((vnet_hdr.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
-tap16_to_cpu(q, vnet_hdr.csum_start) +
-tap16_to_cpu(q, vnet_hdr.csum_offset) + 2 >
-tap16_to_cpu(q, vnet_hdr.hdr_len))
-   vnet_hdr.hdr_len = cpu_to_tap16(q,
-tap16_to_cpu(q, vnet_hdr.csum_start) +
-tap16_to_cpu(q, vnet_hdr.csum_offset) + 2);
+   hdr_len = tap16_to_cpu(q, vnet_hdr.hdr_len);
+   if (vnet_hdr.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
+   hdr_len = max(tap16_to_cpu(q, vnet_hdr.csum_start) +
+ tap16_to_cpu(q, vnet_hdr.csum_offset) + 2,
+ hdr_len);
+   vnet_hdr.hdr_len = cpu_to_tap16(q, hdr_len);
+   }
err = -EINVAL;
if (tap16_to_cpu(q, vnet_hdr.hdr_len) > len)
goto err;
@@ -682,12 +683,7 @@ static ssize_t tap_get_user(struct tap_queue *q, void 
*msg_control,
if (msg_control && sock_flag(&q->sk, SOCK_ZEROCOPY)) {
struct iov_iter i;
 
-   copylen = vnet_hdr.hdr_len ?
-   tap16_to_cpu(q, vnet_hdr.hdr_len) : GOODCOPY_LEN;
-   if (copylen > good_linear)
-   copylen = good_linear;
-   else if (copylen < ETH_HLEN)
-   copylen = ETH_HLEN;
+   copylen = clamp(hdr_len ?: GOODCOPY_LEN, ETH_HLEN, good_linear);
linear = copylen;
i = *from;
iov_iter_advance(&i, copylen);
@@ -697,11 +693,7 @@ static ssize_t tap_get_user(struct tap_queue *q, void 
*msg_control,
 
if (!zerocopy) {
copylen = len;
-   linear = tap16_to_cpu(q, vnet_hdr.hdr_len);
-   if (linear > good_linear)
-   linear = good_linear;
-   else if (linear < ETH_HLEN)
-   linear = ETH_HLEN;
+   linear = clamp(hdr_len, ETH_HLEN, good_linear);
}
 
skb = tap_alloc_skb(&q->sk, TAP_RESERVE, copylen,

-- 
2.48.1




[PATCH v6 0/6] tun: Introduce virtio-net hashing feature

2025-01-08 Thread Akihiko Odaki
This series depends on: "[PATCH v2 0/3] tun: Unify vnet implementation
and fill full vnet header"
https://lore.kernel.org/r/20250109-tun-v2-0-388d7d5a2...@daynix.com

virtio-net have two usage of hashes: one is RSS and another is hash
reporting. Conventionally the hash calculation was done by the VMM.
However, computing the hash after the queue was chosen defeats the
purpose of RSS.

Another approach is to use eBPF steering program. This approach has
another downside: it cannot report the calculated hash due to the
restrictive nature of eBPF.

Introduce the code to compute hashes to the kernel in order to overcome
thse challenges.

An alternative solution is to extend the eBPF steering program so that it
will be able to report to the userspace, but it is based on context
rewrites, which is in feature freeze. We can adopt kfuncs, but they will
not be UAPIs. We opt to ioctl to align with other relevant UAPIs (KVM
and vhost_net).

The patches for QEMU to use this new feature was submitted as RFC and
is available at:
https://patchew.org/QEMU/20240915-hash-v3-0-79cb08d28...@daynix.com/

This work was presented at LPC 2024:
https://lpc.events/event/18/contributions/1963/

V1 -> V2:
  Changed to introduce a new BPF program type.

Signed-off-by: Akihiko Odaki 
---
Changes in v6:
- Extracted changes to fill vnet header holes into another series.
- Squashed patches "skbuff: Introduce SKB_EXT_TUN_VNET_HASH", "tun:
  Introduce virtio-net hash reporting feature", and "tun: Introduce
  virtio-net RSS" into patch "tun: Introduce virtio-net hash feature".
- Dropped the RFC tag.
- Link to v5: 
https://lore.kernel.org/r/20241008-rss-v5-0-f3cf68df0...@daynix.com

Changes in v5:
- Fixed a compilation error with CONFIG_TUN_VNET_CROSS_LE.
- Optimized the calculation of the hash value according to:
  https://git.dpdk.org/dpdk/commit/?id=3fb1ea032bd6ff8317af5dac9af901f1f324cab4
- Added patch "tun: Unify vnet implementation".
- Dropped patch "tap: Pad virtio header with zero".
- Added patch "selftest: tun: Test vnet ioctls without device".
- Reworked selftests to skip for older kernels.
- Documented the case when the underlying device is deleted and packets
  have queue_mapping set by TC.
- Reordered test harness arguments.
- Added code to handle fragmented packets.
- Link to v4: 
https://lore.kernel.org/r/20240924-rss-v4-0-84e932ec0...@daynix.com

Changes in v4:
- Moved tun_vnet_hash_ext to if_tun.h.
- Renamed virtio_net_toeplitz() to virtio_net_toeplitz_calc().
- Replaced htons() with cpu_to_be16().
- Changed virtio_net_hash_rss() to return void.
- Reordered variable declarations in virtio_net_hash_rss().
- Removed virtio_net_hdr_v1_hash_from_skb().
- Updated messages of "tap: Pad virtio header with zero" and
  "tun: Pad virtio header with zero".
- Fixed vnet_hash allocation size.
- Ensured to free vnet_hash when destructing tun_struct.
- Link to v3: 
https://lore.kernel.org/r/20240915-rss-v3-0-c630015db...@daynix.com

Changes in v3:
- Reverted back to add ioctl.
- Split patch "tun: Introduce virtio-net hashing feature" into
  "tun: Introduce virtio-net hash reporting feature" and
  "tun: Introduce virtio-net RSS".
- Changed to reuse hash values computed for automq instead of performing
  RSS hashing when hash reporting is requested but RSS is not.
- Extracted relevant data from struct tun_struct to keep it minimal.
- Added kernel-doc.
- Changed to allow calling TUNGETVNETHASHCAP before TUNSETIFF.
- Initialized num_buffers with 1.
- Added a test case for unclassified packets.
- Fixed error handling in tests.
- Changed tests to verify that the queue index will not overflow.
- Rebased.
- Link to v2: 
https://lore.kernel.org/r/20231015141644.260646-1-akihiko.od...@daynix.com

---
Akihiko Odaki (6):
  virtio_net: Add functions for hashing
  net: flow_dissector: Export flow_keys_dissector_symmetric
  tun: Introduce virtio-net hash feature
  selftest: tun: Test vnet ioctls without device
  selftest: tun: Add tests for virtio-net hashing
  vhost/net: Support VIRTIO_NET_F_HASH_REPORT

 Documentation/networking/tuntap.rst  |   7 +
 drivers/net/Kconfig  |   1 +
 drivers/net/tap.c|  50 ++-
 drivers/net/tun.c|  93 --
 drivers/net/tun_vnet.c   | 167 +-
 drivers/net/tun_vnet.h   |  33 +-
 drivers/vhost/net.c  |  16 +-
 include/linux/if_tap.h   |   2 +
 include/linux/skbuff.h   |   3 +
 include/linux/virtio_net.h   | 188 +++
 include/net/flow_dissector.h |   1 +
 include/uapi/linux/if_tun.h  |  75 +
 net/core/flow_dissector.c|   3 +-
 net/core/skbuff.c|   4 +
 tools/testing/selftests/net/Makefile |   2 +-
 tools/testing/selftests/net/tun.c| 630 ++-
 16

[PATCH v6 1/6] virtio_net: Add functions for hashing

2025-01-08 Thread Akihiko Odaki
They are useful to implement VIRTIO_NET_F_RSS and
VIRTIO_NET_F_HASH_REPORT.

Signed-off-by: Akihiko Odaki 
---
 include/linux/virtio_net.h | 188 +
 1 file changed, 188 insertions(+)

diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index 02a9f4dc594d..3b25ca75710b 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -9,6 +9,194 @@
 #include 
 #include 
 
+struct virtio_net_hash {
+   u32 value;
+   u16 report;
+};
+
+struct virtio_net_toeplitz_state {
+   u32 hash;
+   const u32 *key;
+};
+
+#define VIRTIO_NET_SUPPORTED_HASH_TYPES (VIRTIO_NET_RSS_HASH_TYPE_IPv4 | \
+VIRTIO_NET_RSS_HASH_TYPE_TCPv4 | \
+VIRTIO_NET_RSS_HASH_TYPE_UDPv4 | \
+VIRTIO_NET_RSS_HASH_TYPE_IPv6 | \
+VIRTIO_NET_RSS_HASH_TYPE_TCPv6 | \
+VIRTIO_NET_RSS_HASH_TYPE_UDPv6)
+
+#define VIRTIO_NET_RSS_MAX_KEY_SIZE 40
+
+static inline void virtio_net_toeplitz_convert_key(u32 *input, size_t len)
+{
+   while (len >= sizeof(*input)) {
+   *input = be32_to_cpu((__force __be32)*input);
+   input++;
+   len -= sizeof(*input);
+   }
+}
+
+static inline void virtio_net_toeplitz_calc(struct virtio_net_toeplitz_state 
*state,
+   const __be32 *input, size_t len)
+{
+   while (len >= sizeof(*input)) {
+   for (u32 map = be32_to_cpu(*input); map; map &= (map - 1)) {
+   u32 i = ffs(map);
+
+   state->hash ^= state->key[0] << (32 - i) |
+  (u32)((u64)state->key[1] >> i);
+   }
+
+   state->key++;
+   input++;
+   len -= sizeof(*input);
+   }
+}
+
+static inline u8 virtio_net_hash_key_length(u32 types)
+{
+   size_t len = 0;
+
+   if (types & VIRTIO_NET_HASH_REPORT_IPv4)
+   len = max(len,
+ sizeof(struct flow_dissector_key_ipv4_addrs));
+
+   if (types &
+   (VIRTIO_NET_HASH_REPORT_TCPv4 | VIRTIO_NET_HASH_REPORT_UDPv4))
+   len = max(len,
+ sizeof(struct flow_dissector_key_ipv4_addrs) +
+ sizeof(struct flow_dissector_key_ports));
+
+   if (types & VIRTIO_NET_HASH_REPORT_IPv6)
+   len = max(len,
+ sizeof(struct flow_dissector_key_ipv6_addrs));
+
+   if (types &
+   (VIRTIO_NET_HASH_REPORT_TCPv6 | VIRTIO_NET_HASH_REPORT_UDPv6))
+   len = max(len,
+ sizeof(struct flow_dissector_key_ipv6_addrs) +
+ sizeof(struct flow_dissector_key_ports));
+
+   return len + 4;
+}
+
+static inline u32 virtio_net_hash_report(u32 types,
+const struct flow_keys_basic *keys)
+{
+   switch (keys->basic.n_proto) {
+   case cpu_to_be16(ETH_P_IP):
+   if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) {
+   if (keys->basic.ip_proto == IPPROTO_TCP &&
+   (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv4))
+   return VIRTIO_NET_HASH_REPORT_TCPv4;
+
+   if (keys->basic.ip_proto == IPPROTO_UDP &&
+   (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv4))
+   return VIRTIO_NET_HASH_REPORT_UDPv4;
+   }
+
+   if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv4)
+   return VIRTIO_NET_HASH_REPORT_IPv4;
+
+   return VIRTIO_NET_HASH_REPORT_NONE;
+
+   case cpu_to_be16(ETH_P_IPV6):
+   if (!(keys->control.flags & FLOW_DIS_IS_FRAGMENT)) {
+   if (keys->basic.ip_proto == IPPROTO_TCP &&
+   (types & VIRTIO_NET_RSS_HASH_TYPE_TCPv6))
+   return VIRTIO_NET_HASH_REPORT_TCPv6;
+
+   if (keys->basic.ip_proto == IPPROTO_UDP &&
+   (types & VIRTIO_NET_RSS_HASH_TYPE_UDPv6))
+   return VIRTIO_NET_HASH_REPORT_UDPv6;
+   }
+
+   if (types & VIRTIO_NET_RSS_HASH_TYPE_IPv6)
+   return VIRTIO_NET_HASH_REPORT_IPv6;
+
+   return VIRTIO_NET_HASH_REPORT_NONE;
+
+   default:
+   return VIRTIO_NET_HASH_REPORT_NONE;
+   }
+}
+
+static inline void virtio_net_hash_rss(const struct sk_buff *skb,
+  u32 types, const u32 *key,
+  struct virtio_net_hash *hash)
+{
+   struct virtio_net_toeplitz_state toep

[PATCH v6 2/6] net: flow_dissector: Export flow_keys_dissector_symmetric

2025-01-08 Thread Akihiko Odaki
flow_keys_dissector_symmetric is useful to derive a symmetric hash
and to know its source such as IPv4, IPv6, TCP, and UDP.

Signed-off-by: Akihiko Odaki 
---
 include/net/flow_dissector.h | 1 +
 net/core/flow_dissector.c| 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index ced79dc8e856..d01c1ec77b7d 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -423,6 +423,7 @@ __be32 flow_get_u32_src(const struct flow_keys *flow);
 __be32 flow_get_u32_dst(const struct flow_keys *flow);
 
 extern struct flow_dissector flow_keys_dissector;
+extern struct flow_dissector flow_keys_dissector_symmetric;
 extern struct flow_dissector flow_keys_basic_dissector;
 
 /* struct flow_keys_digest:
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 0e638a37aa09..9822988f2d49 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -1852,7 +1852,8 @@ void make_flow_keys_digest(struct flow_keys_digest 
*digest,
 }
 EXPORT_SYMBOL(make_flow_keys_digest);
 
-static struct flow_dissector flow_keys_dissector_symmetric __read_mostly;
+struct flow_dissector flow_keys_dissector_symmetric __read_mostly;
+EXPORT_SYMBOL(flow_keys_dissector_symmetric);
 
 u32 __skb_get_hash_symmetric_net(const struct net *net, const struct sk_buff 
*skb)
 {

-- 
2.47.1




[PATCH v6 3/6] tun: Introduce virtio-net hash feature

2025-01-08 Thread Akihiko Odaki
Hash reporting
--

Allow the guest to reuse the hash value to make receive steering
consistent between the host and guest, and to save hash computation.

RSS
---

RSS is a receive steering algorithm that can be negotiated to use with
virtio_net. Conventionally the hash calculation was done by the VMM.
However, computing the hash after the queue was chosen defeats the
purpose of RSS.

Another approach is to use eBPF steering program. This approach has
another downside: it cannot report the calculated hash due to the
restrictive nature of eBPF steering program.

Introduce the code to perform RSS to the kernel in order to overcome
thse challenges. An alternative solution is to extend the eBPF steering
program so that it will be able to report to the userspace, but I didn't
opt for it because extending the current mechanism of eBPF steering
program as is because it relies on legacy context rewriting, and
introducing kfunc-based eBPF will result in non-UAPI dependency while
the other relevant virtualization APIs such as KVM and vhost_net are
UAPIs.

Signed-off-by: Akihiko Odaki 
---
 Documentation/networking/tuntap.rst |   7 ++
 drivers/net/Kconfig |   1 +
 drivers/net/tap.c   |  50 ++-
 drivers/net/tun.c   |  93 +++-
 drivers/net/tun_vnet.c  | 167 +---
 drivers/net/tun_vnet.h  |  33 ++-
 include/linux/if_tap.h  |   2 +
 include/linux/skbuff.h  |   3 +
 include/uapi/linux/if_tun.h |  75 
 net/core/skbuff.c   |   4 +
 10 files changed, 397 insertions(+), 38 deletions(-)

diff --git a/Documentation/networking/tuntap.rst 
b/Documentation/networking/tuntap.rst
index 4d7087f727be..86b4ae8caa8a 100644
--- a/Documentation/networking/tuntap.rst
+++ b/Documentation/networking/tuntap.rst
@@ -206,6 +206,13 @@ enable is true we enable it, otherwise we disable it::
   return ioctl(fd, TUNSETQUEUE, (void *)&ifr);
   }
 
+3.4 Reference
+-
+
+``linux/if_tun.h`` defines the interface described below:
+
+.. kernel-doc:: include/uapi/linux/if_tun.h
+
 Universal TUN/TAP device driver Frequently Asked Question
 =
 
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 255c8f9f1d7c..f7b0d9a89a71 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -395,6 +395,7 @@ config TUN
tristate "Universal TUN/TAP device driver support"
depends on INET
select CRC32
+   select SKB_EXTENSIONS
select TUN_VNET
help
  TUN/TAP provides packet reception and transmission for user space
diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index fe9554ee5b8b..27659df1f96e 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -179,6 +179,16 @@ static void tap_put_queue(struct tap_queue *q)
sock_put(&q->sk);
 }
 
+static struct virtio_net_hash *tap_add_hash(struct sk_buff *skb)
+{
+   return (struct virtio_net_hash *)skb->cb;
+}
+
+static const struct virtio_net_hash *tap_find_hash(const struct sk_buff *skb)
+{
+   return (const struct virtio_net_hash *)skb->cb;
+}
+
 /*
  * Select a queue based on the rxq of the device on which this packet
  * arrived. If the incoming device is not mq, calculate a flow hash
@@ -189,6 +199,7 @@ static void tap_put_queue(struct tap_queue *q)
 static struct tap_queue *tap_get_queue(struct tap_dev *tap,
   struct sk_buff *skb)
 {
+   struct flow_keys_basic keys_basic;
struct tap_queue *queue = NULL;
/* Access to taps array is protected by rcu, but access to numvtaps
 * isn't. Below we use it to lookup a queue, but treat it as a hint
@@ -196,17 +207,41 @@ static struct tap_queue *tap_get_queue(struct tap_dev 
*tap,
 * racing against queue removal.
 */
int numvtaps = READ_ONCE(tap->numvtaps);
+   struct tun_vnet_hash_container *vnet_hash = 
rcu_dereference(tap->vnet_hash);
__u32 rxq;
 
+   *tap_add_hash(skb) = (struct virtio_net_hash) { .report = 
VIRTIO_NET_HASH_REPORT_NONE };
+
if (!numvtaps)
goto out;
 
if (numvtaps == 1)
goto single;
 
+   if (vnet_hash && (vnet_hash->common.flags & TUN_VNET_HASH_RSS)) {
+   rxq = tun_vnet_rss_select_queue(numvtaps, vnet_hash, skb, 
tap_add_hash);
+   queue = rcu_dereference(tap->taps[rxq]);
+   goto out;
+   }
+
+   if (!skb->l4_hash && !skb->sw_hash) {
+   struct flow_keys keys;
+
+   skb_flow_dissect_flow_keys(skb, &keys, 
FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL);
+   rxq = flow_hash_from_keys(&keys);
+   keys_basic = (struct flow_keys_basic) {
+   .control = keys.con

[PATCH v6 4/6] selftest: tun: Test vnet ioctls without device

2025-01-08 Thread Akihiko Odaki
Ensure that vnet ioctls result in EBADFD when the underlying device is
deleted.

Signed-off-by: Akihiko Odaki 
---
 tools/testing/selftests/net/tun.c | 74 +++
 1 file changed, 74 insertions(+)

diff --git a/tools/testing/selftests/net/tun.c 
b/tools/testing/selftests/net/tun.c
index fa83918b62d1..463dd98f2b80 100644
--- a/tools/testing/selftests/net/tun.c
+++ b/tools/testing/selftests/net/tun.c
@@ -159,4 +159,78 @@ TEST_F(tun, reattach_close_delete) {
EXPECT_EQ(tun_delete(self->ifname), 0);
 }
 
+FIXTURE(tun_deleted)
+{
+   char ifname[IFNAMSIZ];
+   int fd;
+};
+
+FIXTURE_SETUP(tun_deleted)
+{
+   self->ifname[0] = 0;
+   self->fd = tun_alloc(self->ifname);
+   ASSERT_LE(0, self->fd);
+
+   ASSERT_EQ(0, tun_delete(self->ifname))
+   EXPECT_EQ(0, close(self->fd));
+}
+
+FIXTURE_TEARDOWN(tun_deleted)
+{
+   EXPECT_EQ(0, close(self->fd));
+}
+
+TEST_F(tun_deleted, getvnethdrsz)
+{
+   ASSERT_EQ(-1, ioctl(self->fd, TUNGETVNETHDRSZ));
+   EXPECT_EQ(EBADFD, errno);
+}
+
+TEST_F(tun_deleted, setvnethdrsz)
+{
+   ASSERT_EQ(-1, ioctl(self->fd, TUNSETVNETHDRSZ));
+   EXPECT_EQ(EBADFD, errno);
+}
+
+TEST_F(tun_deleted, getvnetle)
+{
+   ASSERT_EQ(-1, ioctl(self->fd, TUNGETVNETLE));
+   EXPECT_EQ(EBADFD, errno);
+}
+
+TEST_F(tun_deleted, setvnetle)
+{
+   ASSERT_EQ(-1, ioctl(self->fd, TUNSETVNETLE));
+   EXPECT_EQ(EBADFD, errno);
+}
+
+TEST_F(tun_deleted, getvnetbe)
+{
+   ASSERT_EQ(-1, ioctl(self->fd, TUNGETVNETBE));
+   EXPECT_EQ(EBADFD, errno);
+}
+
+TEST_F(tun_deleted, setvnetbe)
+{
+   ASSERT_EQ(-1, ioctl(self->fd, TUNSETVNETBE));
+   EXPECT_EQ(EBADFD, errno);
+}
+
+TEST_F(tun_deleted, getvnethashcap)
+{
+   struct tun_vnet_hash cap;
+   int i = ioctl(self->fd, TUNGETVNETHASHCAP, &cap);
+
+   if (i == -1 && errno == EBADFD)
+   SKIP(return, "TUNGETVNETHASHCAP not supported");
+
+   EXPECT_EQ(0, i);
+}
+
+TEST_F(tun_deleted, setvnethash)
+{
+   ASSERT_EQ(-1, ioctl(self->fd, TUNSETVNETHASH));
+   EXPECT_EQ(EBADFD, errno);
+}
+
 TEST_HARNESS_MAIN

-- 
2.47.1




[PATCH v6 6/6] vhost/net: Support VIRTIO_NET_F_HASH_REPORT

2025-01-08 Thread Akihiko Odaki
VIRTIO_NET_F_HASH_REPORT allows to report hash values calculated on the
host. When VHOST_NET_F_VIRTIO_NET_HDR is employed, it will report no
hash values (i.e., the hash_report member is always set to
VIRTIO_NET_HASH_REPORT_NONE). Otherwise, the values reported by the
underlying socket will be reported.

VIRTIO_NET_F_HASH_REPORT requires VIRTIO_F_VERSION_1.

Signed-off-by: Akihiko Odaki 
---
 drivers/vhost/net.c | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 9ad37c012189..ed1bf01a7fcf 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -73,6 +73,7 @@ enum {
VHOST_NET_FEATURES = VHOST_FEATURES |
 (1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
 (1ULL << VIRTIO_NET_F_MRG_RXBUF) |
+(1ULL << VIRTIO_NET_F_HASH_REPORT) |
 (1ULL << VIRTIO_F_ACCESS_PLATFORM) |
 (1ULL << VIRTIO_F_RING_RESET)
 };
@@ -1604,10 +1605,13 @@ static int vhost_net_set_features(struct vhost_net *n, 
u64 features)
size_t vhost_hlen, sock_hlen, hdr_len;
int i;
 
-   hdr_len = (features & ((1ULL << VIRTIO_NET_F_MRG_RXBUF) |
-  (1ULL << VIRTIO_F_VERSION_1))) ?
-   sizeof(struct virtio_net_hdr_mrg_rxbuf) :
-   sizeof(struct virtio_net_hdr);
+   if (features & (1ULL << VIRTIO_NET_F_HASH_REPORT))
+   hdr_len = sizeof(struct virtio_net_hdr_v1_hash);
+   else if (features & ((1ULL << VIRTIO_NET_F_MRG_RXBUF) |
+(1ULL << VIRTIO_F_VERSION_1)))
+   hdr_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
+   else
+   hdr_len = sizeof(struct virtio_net_hdr);
if (features & (1 << VHOST_NET_F_VIRTIO_NET_HDR)) {
/* vhost provides vnet_hdr */
vhost_hlen = hdr_len;
@@ -1688,6 +1692,10 @@ static long vhost_net_ioctl(struct file *f, unsigned int 
ioctl,
return -EFAULT;
if (features & ~VHOST_NET_FEATURES)
return -EOPNOTSUPP;
+   if ((features & ((1ULL << VIRTIO_F_VERSION_1) |
+(1ULL << VIRTIO_NET_F_HASH_REPORT))) ==
+   (1ULL << VIRTIO_NET_F_HASH_REPORT))
+   return -EINVAL;
return vhost_net_set_features(n, features);
case VHOST_GET_BACKEND_FEATURES:
features = VHOST_NET_BACKEND_FEATURES;

-- 
2.47.1




[PATCH v6 5/6] selftest: tun: Add tests for virtio-net hashing

2025-01-08 Thread Akihiko Odaki
The added tests confirm tun can perform RSS and hash reporting, and
reject invalid configurations for them.

Signed-off-by: Akihiko Odaki 
---
 tools/testing/selftests/net/Makefile |   2 +-
 tools/testing/selftests/net/tun.c| 558 ++-
 2 files changed, 551 insertions(+), 9 deletions(-)

diff --git a/tools/testing/selftests/net/Makefile 
b/tools/testing/selftests/net/Makefile
index cb2fc601de66..92762ce3ebd4 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -121,6 +121,6 @@ $(OUTPUT)/reuseport_bpf_numa: LDLIBS += -lnuma
 $(OUTPUT)/tcp_mmap: LDLIBS += -lpthread -lcrypto
 $(OUTPUT)/tcp_inq: LDLIBS += -lpthread
 $(OUTPUT)/bind_bhash: LDLIBS += -lpthread
-$(OUTPUT)/io_uring_zerocopy_tx: CFLAGS += -I../../../include/
+$(OUTPUT)/io_uring_zerocopy_tx $(OUTPUT)/tun: CFLAGS += -I../../../include/
 
 include bpf.mk
diff --git a/tools/testing/selftests/net/tun.c 
b/tools/testing/selftests/net/tun.c
index 463dd98f2b80..9424d897e341 100644
--- a/tools/testing/selftests/net/tun.c
+++ b/tools/testing/selftests/net/tun.c
@@ -2,21 +2,37 @@
 
 #define _GNU_SOURCE
 
+#include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
-#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
 #include 
+#include 
 #include 
 #include 
-#include 
-#include 
+#include 
+#include 
+#include 
+#include 
 
 #include "../kselftest_harness.h"
 
+#define TUN_HWADDR_SOURCE { 0x02, 0x00, 0x00, 0x00, 0x00, 0x00 }
+#define TUN_HWADDR_DEST { 0x02, 0x00, 0x00, 0x00, 0x00, 0x01 }
+#define TUN_IPADDR_SOURCE htonl((172 << 24) | (17 << 16) | 0)
+#define TUN_IPADDR_DEST htonl((172 << 24) | (17 << 16) | 1)
+
 static int tun_attach(int fd, char *dev)
 {
struct ifreq ifr;
@@ -39,7 +55,7 @@ static int tun_detach(int fd, char *dev)
return ioctl(fd, TUNSETQUEUE, (void *) &ifr);
 }
 
-static int tun_alloc(char *dev)
+static int tun_alloc(char *dev, short flags)
 {
struct ifreq ifr;
int fd, err;
@@ -52,7 +68,8 @@ static int tun_alloc(char *dev)
 
memset(&ifr, 0, sizeof(ifr));
strcpy(ifr.ifr_name, dev);
-   ifr.ifr_flags = IFF_TAP | IFF_NAPI | IFF_MULTI_QUEUE;
+   ifr.ifr_flags = flags | IFF_TAP | IFF_NAPI | IFF_NO_PI |
+   IFF_MULTI_QUEUE;
 
err = ioctl(fd, TUNSETIFF, (void *) &ifr);
if (err < 0) {
@@ -64,6 +81,40 @@ static int tun_alloc(char *dev)
return fd;
 }
 
+static bool tun_add_to_bridge(int local_fd, const char *name)
+{
+   struct ifreq ifreq = {
+   .ifr_name = "xbridge",
+   .ifr_ifindex = if_nametoindex(name)
+   };
+
+   if (!ifreq.ifr_ifindex) {
+   perror("if_nametoindex");
+   return false;
+   }
+
+   if (ioctl(local_fd, SIOCBRADDIF, &ifreq)) {
+   perror("SIOCBRADDIF");
+   return false;
+   }
+
+   return true;
+}
+
+static bool tun_set_flags(int local_fd, const char *name, short flags)
+{
+   struct ifreq ifreq = { .ifr_flags = flags };
+
+   strcpy(ifreq.ifr_name, name);
+
+   if (ioctl(local_fd, SIOCSIFFLAGS, &ifreq)) {
+   perror("SIOCSIFFLAGS");
+   return false;
+   }
+
+   return true;
+}
+
 static int tun_delete(char *dev)
 {
struct {
@@ -102,6 +153,159 @@ static int tun_delete(char *dev)
return ret;
 }
 
+static uint32_t tun_sum(const void *buf, size_t len)
+{
+   const uint16_t *sbuf = buf;
+   uint32_t sum = 0;
+
+   while (len > 1) {
+   sum += *sbuf++;
+   len -= 2;
+   }
+
+   if (len)
+   sum += *(uint8_t *)sbuf;
+
+   return sum;
+}
+
+static uint16_t tun_build_ip_check(uint32_t sum)
+{
+   return ~((sum & 0x) + (sum >> 16));
+}
+
+static uint32_t tun_build_ip_pseudo_sum(const void *iphdr)
+{
+   uint16_t tot_len = ntohs(((struct iphdr *)iphdr)->tot_len);
+
+   return tun_sum((char *)iphdr + offsetof(struct iphdr, saddr), 8) +
+  htons(((struct iphdr *)iphdr)->protocol) +
+  htons(tot_len - sizeof(struct iphdr));
+}
+
+static uint32_t tun_build_ipv6_pseudo_sum(const void *ipv6hdr)
+{
+   return tun_sum((char *)ipv6hdr + offsetof(struct ipv6hdr, saddr), 32) +
+  ((struct ipv6hdr *)ipv6hdr)->payload_len +
+  htons(((struct ipv6hdr *)ipv6hdr)->nexthdr);
+}
+
+static void tun_build_ethhdr(struct ethhdr *ethhdr, uint16_t proto)
+{
+   *ethhdr = (struct ethhdr) {
+   .h_dest = TUN_HWADDR_DEST,
+   .h_source = TUN_HWADDR_SOURCE,
+   .h_proto = htons(proto)
+   };
+}
+
+static void tun_build_iphdr(void *dest, uint16_t len, uint8_t protocol)
+{
+   struct iphdr iphdr = {
+   .ihl = sizeof(iphdr) / 4,
+   .version = 

Re: [PATCH v2 2/3] tun: Pad virtio header with zero

2025-01-08 Thread Akihiko Odaki

On 2025/01/09 16:31, Michael S. Tsirkin wrote:

On Thu, Jan 09, 2025 at 03:58:44PM +0900, Akihiko Odaki wrote:

tun used to simply advance iov_iter when it needs to pad virtio header,
which leaves the garbage in the buffer as is. This is especially
problematic when tun starts to allow enabling the hash reporting
feature; even if the feature is enabled, the packet may lack a hash
value and may contain a hole in the virtio header because the packet
arrived before the feature gets enabled or does not contain the
header fields to be hashed. If the hole is not filled with zero, it is
impossible to tell if the packet lacks a hash value.

In theory, a user of tun can fill the buffer with zero before calling
read() to avoid such a problem, but leaving the garbage in the buffer is
awkward anyway so fill the buffer in tun.

Signed-off-by: Akihiko Odaki 


But if the user did it, you have just overwritten his value,
did you not?


Yes. but that means the user expects some part of buffer is not filled 
after read() or recvmsg(). I'm a bit worried that not filling the buffer 
may break assumptions others (especially the filesystem and socket 
infrastructures in the kernel) may have.


If we are really confident that it will not cause problems, this 
behavior can be opt-in based on a flag or we can just write some 
documentation warning userspace programmers to initialize the buffer.





---
  drivers/net/tun_vnet.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/tun_vnet.c b/drivers/net/tun_vnet.c
index fe842df9e9ef..ffb2186facd3 100644
--- a/drivers/net/tun_vnet.c
+++ b/drivers/net/tun_vnet.c
@@ -138,7 +138,8 @@ int tun_vnet_hdr_put(int sz, struct iov_iter *iter,
if (copy_to_iter(hdr, sizeof(*hdr), iter) != sizeof(*hdr))
return -EFAULT;
  
-	iov_iter_advance(iter, sz - sizeof(*hdr));

+   if (iov_iter_zero(sz - sizeof(*hdr), iter) != sz - sizeof(*hdr))
+   return -EFAULT;
  
  	return 0;

  }

--
2.47.1







[PATCH v2 1/3] tun: Unify vnet implementation

2025-01-08 Thread Akihiko Odaki
Both tun and tap exposes the same set of virtio-net-related features.
Unify their implementations to ease future changes.

Signed-off-by: Akihiko Odaki 
---
 MAINTAINERS|   1 +
 drivers/net/Kconfig|   5 ++
 drivers/net/Makefile   |   1 +
 drivers/net/tap.c  | 172 ++--
 drivers/net/tun.c  | 208 -
 drivers/net/tun_vnet.c | 186 +++
 drivers/net/tun_vnet.h |  24 ++
 7 files changed, 273 insertions(+), 324 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 910305c11e8a..1be8a452d11f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -23903,6 +23903,7 @@ F:  Documentation/networking/tuntap.rst
 F: arch/um/os-Linux/drivers/
 F: drivers/net/tap.c
 F: drivers/net/tun.c
+F: drivers/net/tun_vnet.h
 
 TURBOCHANNEL SUBSYSTEM
 M: "Maciej W. Rozycki" 
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 1fd5acdc73c6..255c8f9f1d7c 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -395,6 +395,7 @@ config TUN
tristate "Universal TUN/TAP device driver support"
depends on INET
select CRC32
+   select TUN_VNET
help
  TUN/TAP provides packet reception and transmission for user space
  programs.  It can be viewed as a simple Point-to-Point or Ethernet
@@ -417,10 +418,14 @@ config TUN
 
 config TAP
tristate
+   select TUN_VNET
help
  This option is selected by any driver implementing tap user space
  interface for a virtual interface to re-use core tap functionality.
 
+config TUN_VNET
+   tristate
+
 config TUN_VNET_CROSS_LE
bool "Support for cross-endian vnet headers on little-endian kernels"
default n
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index 13743d0e83b5..bc1f193eccb1 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -30,6 +30,7 @@ obj-y += pcs/
 obj-$(CONFIG_RIONET) += rionet.o
 obj-$(CONFIG_NET_TEAM) += team/
 obj-$(CONFIG_TUN) += tun.o
+obj-$(CONFIG_TUN_VNET) += tun_vnet.o
 obj-$(CONFIG_TAP) += tap.o
 obj-$(CONFIG_VETH) += veth.o
 obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 5aa41d5f7765..60804855510b 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -26,74 +26,9 @@
 #include 
 #include 
 
-#define TAP_IFFEATURES (IFF_VNET_HDR | IFF_MULTI_QUEUE)
-
-#define TAP_VNET_LE 0x8000
-#define TAP_VNET_BE 0x4000
-
-#ifdef CONFIG_TUN_VNET_CROSS_LE
-static inline bool tap_legacy_is_little_endian(struct tap_queue *q)
-{
-   return q->flags & TAP_VNET_BE ? false :
-   virtio_legacy_is_little_endian();
-}
-
-static long tap_get_vnet_be(struct tap_queue *q, int __user *sp)
-{
-   int s = !!(q->flags & TAP_VNET_BE);
-
-   if (put_user(s, sp))
-   return -EFAULT;
-
-   return 0;
-}
-
-static long tap_set_vnet_be(struct tap_queue *q, int __user *sp)
-{
-   int s;
-
-   if (get_user(s, sp))
-   return -EFAULT;
-
-   if (s)
-   q->flags |= TAP_VNET_BE;
-   else
-   q->flags &= ~TAP_VNET_BE;
-
-   return 0;
-}
-#else
-static inline bool tap_legacy_is_little_endian(struct tap_queue *q)
-{
-   return virtio_legacy_is_little_endian();
-}
-
-static long tap_get_vnet_be(struct tap_queue *q, int __user *argp)
-{
-   return -EINVAL;
-}
+#include "tun_vnet.h"
 
-static long tap_set_vnet_be(struct tap_queue *q, int __user *argp)
-{
-   return -EINVAL;
-}
-#endif /* CONFIG_TUN_VNET_CROSS_LE */
-
-static inline bool tap_is_little_endian(struct tap_queue *q)
-{
-   return q->flags & TAP_VNET_LE ||
-   tap_legacy_is_little_endian(q);
-}
-
-static inline u16 tap16_to_cpu(struct tap_queue *q, __virtio16 val)
-{
-   return __virtio16_to_cpu(tap_is_little_endian(q), val);
-}
-
-static inline __virtio16 cpu_to_tap16(struct tap_queue *q, u16 val)
-{
-   return __cpu_to_virtio16(tap_is_little_endian(q), val);
-}
+#define TAP_IFFEATURES (IFF_VNET_HDR | IFF_MULTI_QUEUE)
 
 static struct proto tap_proto = {
.name = "tap",
@@ -641,10 +576,10 @@ static ssize_t tap_get_user(struct tap_queue *q, void 
*msg_control,
struct sk_buff *skb;
struct tap_dev *tap;
unsigned long total_len = iov_iter_count(from);
-   unsigned long len = total_len;
+   unsigned long len;
int err;
struct virtio_net_hdr vnet_hdr = { 0 };
-   int vnet_hdr_len = 0;
+   int hdr_len;
int copylen = 0;
int depth;
bool zerocopy = false;
@@ -652,38 +587,20 @@ static ssize_t tap_get_user(struct tap_queue *q, void 
*msg_control,
enum skb_drop_reason drop_reason;
 
if (q->flags & IFF_VNET_HDR) {
-   vnet_hdr_len = READ_ONCE(q->vnet_hdr_sz);
-
-   

[PATCH v2 0/3] tun: Unify vnet implementation and fill full vnet header

2025-01-08 Thread Akihiko Odaki
When I implemented virtio's hash-related features to tun/tap [1],
I found tun/tap does not fill the entire region reserved for the virtio
header, leaving some uninitialized hole in the middle of the buffer
after read()/recvmesg().

This series fills the uninitialized hole. More concretely, the
num_buffers field will be initialized with 1, and the other fields will
be inialized with 0. Setting the num_buffers field to 1 is mandated by
virtio 1.0 [2].

The change to virtio header is preceded by another change that refactors
tun and tap to unify their virtio-related code.

[1]: https://lore.kernel.org/r/20241008-rss-v5-0-f3cf68df0...@daynix.com
[2]: https://lore.kernel.org/r/20241227084256-mutt-send-email-...@kernel.org/

Signed-off-by: Akihiko Odaki 
---
Changes in v2:
- Fixed num_buffers endian.
- Link to v1: 
https://lore.kernel.org/r/20250108-tun-v1-0-67d784b34...@daynix.com

---
Akihiko Odaki (3):
  tun: Unify vnet implementation
  tun: Pad virtio header with zero
  tun: Set num_buffers for virtio 1.0

 MAINTAINERS|   1 +
 drivers/net/Kconfig|   5 ++
 drivers/net/Makefile   |   1 +
 drivers/net/tap.c  | 174 ++--
 drivers/net/tun.c  | 214 +
 drivers/net/tun_vnet.c | 191 +++
 drivers/net/tun_vnet.h |  24 ++
 7 files changed, 283 insertions(+), 327 deletions(-)
---
base-commit: a32e14f8aef69b42826cf0998b068a43d486a9e9
change-id: 20241230-tun-66e10a49b0c7

Best regards,
-- 
Akihiko Odaki 




[PATCH v2 2/3] tun: Pad virtio header with zero

2025-01-08 Thread Akihiko Odaki
tun used to simply advance iov_iter when it needs to pad virtio header,
which leaves the garbage in the buffer as is. This is especially
problematic when tun starts to allow enabling the hash reporting
feature; even if the feature is enabled, the packet may lack a hash
value and may contain a hole in the virtio header because the packet
arrived before the feature gets enabled or does not contain the
header fields to be hashed. If the hole is not filled with zero, it is
impossible to tell if the packet lacks a hash value.

In theory, a user of tun can fill the buffer with zero before calling
read() to avoid such a problem, but leaving the garbage in the buffer is
awkward anyway so fill the buffer in tun.

Signed-off-by: Akihiko Odaki 
---
 drivers/net/tun_vnet.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/tun_vnet.c b/drivers/net/tun_vnet.c
index fe842df9e9ef..ffb2186facd3 100644
--- a/drivers/net/tun_vnet.c
+++ b/drivers/net/tun_vnet.c
@@ -138,7 +138,8 @@ int tun_vnet_hdr_put(int sz, struct iov_iter *iter,
if (copy_to_iter(hdr, sizeof(*hdr), iter) != sizeof(*hdr))
return -EFAULT;
 
-   iov_iter_advance(iter, sz - sizeof(*hdr));
+   if (iov_iter_zero(sz - sizeof(*hdr), iter) != sz - sizeof(*hdr))
+   return -EFAULT;
 
return 0;
 }

-- 
2.47.1




[PATCH v2 3/3] tun: Set num_buffers for virtio 1.0

2025-01-08 Thread Akihiko Odaki
The specification says the device MUST set num_buffers to 1 if
VIRTIO_NET_F_MRG_RXBUF has not been negotiated.

Signed-off-by: Akihiko Odaki 
---
 drivers/net/tap.c  |  2 +-
 drivers/net/tun.c  |  6 --
 drivers/net/tun_vnet.c | 14 +-
 drivers/net/tun_vnet.h |  4 ++--
 4 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 60804855510b..fe9554ee5b8b 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -713,7 +713,7 @@ static ssize_t tap_put_user(struct tap_queue *q,
int total;
 
if (q->flags & IFF_VNET_HDR) {
-   struct virtio_net_hdr vnet_hdr;
+   struct virtio_net_hdr_v1 vnet_hdr;
 
vnet_hdr_len = READ_ONCE(q->vnet_hdr_sz);
 
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index dbf0dee92e93..f211d0580887 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1991,7 +1991,9 @@ static ssize_t tun_put_user_xdp(struct tun_struct *tun,
size_t total;
 
if (tun->flags & IFF_VNET_HDR) {
-   struct virtio_net_hdr gso = { 0 };
+   struct virtio_net_hdr_v1 gso = {
+   .num_buffers = __virtio16_to_cpu(true, 1)
+   };
 
vnet_hdr_sz = READ_ONCE(tun->vnet_hdr_sz);
ret = tun_vnet_hdr_put(vnet_hdr_sz, iter, &gso);
@@ -2044,7 +2046,7 @@ static ssize_t tun_put_user(struct tun_struct *tun,
}
 
if (vnet_hdr_sz) {
-   struct virtio_net_hdr gso;
+   struct virtio_net_hdr_v1 gso;
 
ret = tun_vnet_hdr_from_skb(tun->flags, tun->dev, skb, &gso);
if (ret < 0)
diff --git a/drivers/net/tun_vnet.c b/drivers/net/tun_vnet.c
index ffb2186facd3..a7a7989fae56 100644
--- a/drivers/net/tun_vnet.c
+++ b/drivers/net/tun_vnet.c
@@ -130,15 +130,17 @@ int tun_vnet_hdr_get(int sz, unsigned int flags, struct 
iov_iter *from,
 EXPORT_SYMBOL_GPL(tun_vnet_hdr_get);
 
 int tun_vnet_hdr_put(int sz, struct iov_iter *iter,
-const struct virtio_net_hdr *hdr)
+const struct virtio_net_hdr_v1 *hdr)
 {
+   int content_sz = MIN(sizeof(*hdr), sz);
+
if (iov_iter_count(iter) < sz)
return -EINVAL;
 
-   if (copy_to_iter(hdr, sizeof(*hdr), iter) != sizeof(*hdr))
+   if (copy_to_iter(hdr, content_sz, iter) != content_sz)
return -EFAULT;
 
-   if (iov_iter_zero(sz - sizeof(*hdr), iter) != sz - sizeof(*hdr))
+   if (iov_iter_zero(sz - content_sz, iter) != sz - content_sz)
return -EFAULT;
 
return 0;
@@ -154,11 +156,11 @@ EXPORT_SYMBOL_GPL(tun_vnet_hdr_to_skb);
 
 int tun_vnet_hdr_from_skb(unsigned int flags, const struct net_device *dev,
  const struct sk_buff *skb,
- struct virtio_net_hdr *hdr)
+ struct virtio_net_hdr_v1 *hdr)
 {
int vlan_hlen = skb_vlan_tag_present(skb) ? VLAN_HLEN : 0;
 
-   if (virtio_net_hdr_from_skb(skb, hdr,
+   if (virtio_net_hdr_from_skb(skb, (struct virtio_net_hdr *)hdr,
tun_vnet_is_little_endian(flags), true,
vlan_hlen)) {
struct skb_shared_info *sinfo = skb_shinfo(skb);
@@ -176,6 +178,8 @@ int tun_vnet_hdr_from_skb(unsigned int flags, const struct 
net_device *dev,
return -EINVAL;
}
 
+   hdr->num_buffers = 1;
+
return 0;
 }
 EXPORT_SYMBOL_GPL(tun_vnet_hdr_from_skb);
diff --git a/drivers/net/tun_vnet.h b/drivers/net/tun_vnet.h
index 2dfdbe92bb24..d8fd94094227 100644
--- a/drivers/net/tun_vnet.h
+++ b/drivers/net/tun_vnet.h
@@ -12,13 +12,13 @@ int tun_vnet_hdr_get(int sz, unsigned int flags, struct 
iov_iter *from,
 struct virtio_net_hdr *hdr);
 
 int tun_vnet_hdr_put(int sz, struct iov_iter *iter,
-const struct virtio_net_hdr *hdr);
+const struct virtio_net_hdr_v1 *hdr);
 
 int tun_vnet_hdr_to_skb(unsigned int flags, struct sk_buff *skb,
const struct virtio_net_hdr *hdr);
 
 int tun_vnet_hdr_from_skb(unsigned int flags, const struct net_device *dev,
  const struct sk_buff *skb,
- struct virtio_net_hdr *hdr);
+ struct virtio_net_hdr_v1 *hdr);
 
 #endif /* TUN_VNET_H */

-- 
2.47.1




Re: [PATCH v2 3/3] tun: Set num_buffers for virtio 1.0

2025-01-10 Thread Akihiko Odaki

On 2025/01/10 12:27, Jason Wang wrote:

On Thu, Jan 9, 2025 at 2:59 PM Akihiko Odaki  wrote:


The specification says the device MUST set num_buffers to 1 if
VIRTIO_NET_F_MRG_RXBUF has not been negotiated.


Have we agreed on how to fix the spec or not?

As I replied in the spec patch, if we just remove this "MUST", it
looks like we are all fine?


My understanding is that we should fix the kernel and QEMU instead. 
There may be some driver implementations that assumes num_buffers is 1 
so the kernel and QEMU should be fixed to be compatible with such 
potential implementations.


It is also possible to make future drivers with existing kernels and 
QEMU by ensuring they will not read num_buffers when 
VIRTIO_NET_F_MRG_RXBUF has not negotiated, and that's what "[PATCH v3] 
virtio-net: Ignore num_buffers when unused" does.

https://lore.kernel.org/r/20250110-reserved-v3-1-2ade0a5d2...@daynix.com

Regards,
Akihiko Odaki



Re: [PATCH v2 2/3] tun: Pad virtio header with zero

2025-01-10 Thread Akihiko Odaki

On 2025/01/10 12:27, Jason Wang wrote:

On Thu, Jan 9, 2025 at 2:59 PM Akihiko Odaki  wrote:


tun used to simply advance iov_iter when it needs to pad virtio header,
which leaves the garbage in the buffer as is. This is especially
problematic when tun starts to allow enabling the hash reporting
feature; even if the feature is enabled, the packet may lack a hash
value and may contain a hole in the virtio header because the packet
arrived before the feature gets enabled or does not contain the
header fields to be hashed. If the hole is not filled with zero, it is
impossible to tell if the packet lacks a hash value.


I'm not sure I will get here, could we do this in the series of hash reporting?


I'll create another series dedicated for this and num_buffers change as 
suggested by Willem.






In theory, a user of tun can fill the buffer with zero before calling
read() to avoid such a problem, but leaving the garbage in the buffer is
awkward anyway so fill the buffer in tun.

Signed-off-by: Akihiko Odaki 
---
  drivers/net/tun_vnet.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/tun_vnet.c b/drivers/net/tun_vnet.c
index fe842df9e9ef..ffb2186facd3 100644
--- a/drivers/net/tun_vnet.c
+++ b/drivers/net/tun_vnet.c
@@ -138,7 +138,8 @@ int tun_vnet_hdr_put(int sz, struct iov_iter *iter,
 if (copy_to_iter(hdr, sizeof(*hdr), iter) != sizeof(*hdr))
 return -EFAULT;

-   iov_iter_advance(iter, sz - sizeof(*hdr));
+   if (iov_iter_zero(sz - sizeof(*hdr), iter) != sz - sizeof(*hdr))
+   return -EFAULT;

 return 0;


There're various callers of iov_iter_advance(), do we need to fix them all?


No. For example, there are iov_iter_advance() calls for SOCK_ZEROCOPY in 
tun_get_user() and tap_get_user(). They are fine as they are not writing 
buffers after skipping.


The problem is that read_iter() and recvmsg() says it wrote N bytes but 
it leaves some of this N bytes uninialized. Such an implementation may 
be created even without iov_iter_advance() (for example just returning a 
too big number), and it is equally problematic with the current 
tun_get_user()/tap_get_user().


Regards,
Akihiko Odaki



Thanks


  }




--
2.47.1








Re: [PATCH v2 1/3] tun: Unify vnet implementation

2025-01-10 Thread Akihiko Odaki

On 2025/01/09 23:06, Willem de Bruijn wrote:

Akihiko Odaki wrote:

Both tun and tap exposes the same set of virtio-net-related features.
Unify their implementations to ease future changes.

Signed-off-by: Akihiko Odaki 
---
  MAINTAINERS|   1 +
  drivers/net/Kconfig|   5 ++
  drivers/net/Makefile   |   1 +
  drivers/net/tap.c  | 172 ++--
  drivers/net/tun.c  | 208 -
  drivers/net/tun_vnet.c | 186 +++
  drivers/net/tun_vnet.h |  24 ++
  7 files changed, 273 insertions(+), 324 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 910305c11e8a..1be8a452d11f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -23903,6 +23903,7 @@ F:  Documentation/networking/tuntap.rst
  F:arch/um/os-Linux/drivers/
  F:drivers/net/tap.c
  F:drivers/net/tun.c
+F: drivers/net/tun_vnet.h
  
  TURBOCHANNEL SUBSYSTEM

  M:"Maciej W. Rozycki" 
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 1fd5acdc73c6..255c8f9f1d7c 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -395,6 +395,7 @@ config TUN
tristate "Universal TUN/TAP device driver support"
depends on INET
select CRC32
+   select TUN_VNET


No need for this new Kconfig


I will merge tun_vnet.c into TAP.




  static struct proto tap_proto = {
.name = "tap",
@@ -641,10 +576,10 @@ static ssize_t tap_get_user(struct tap_queue *q, void 
*msg_control,
struct sk_buff *skb;
struct tap_dev *tap;
unsigned long total_len = iov_iter_count(from);
-   unsigned long len = total_len;
+   unsigned long len;
int err;
struct virtio_net_hdr vnet_hdr = { 0 };
-   int vnet_hdr_len = 0;
+   int hdr_len;
int copylen = 0;
int depth;
bool zerocopy = false;
@@ -652,38 +587,20 @@ static ssize_t tap_get_user(struct tap_queue *q, void 
*msg_control,
enum skb_drop_reason drop_reason;
  
  	if (q->flags & IFF_VNET_HDR) {

-   vnet_hdr_len = READ_ONCE(q->vnet_hdr_sz);
-
-   err = -EINVAL;
-   if (len < vnet_hdr_len)
-   goto err;
-   len -= vnet_hdr_len;
-
-   err = -EFAULT;
-   if (!copy_from_iter_full(&vnet_hdr, sizeof(vnet_hdr), from))
-   goto err;
-   iov_iter_advance(from, vnet_hdr_len - sizeof(vnet_hdr));
-   if ((vnet_hdr.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
-tap16_to_cpu(q, vnet_hdr.csum_start) +
-tap16_to_cpu(q, vnet_hdr.csum_offset) + 2 >
-tap16_to_cpu(q, vnet_hdr.hdr_len))
-   vnet_hdr.hdr_len = cpu_to_tap16(q,
-tap16_to_cpu(q, vnet_hdr.csum_start) +
-tap16_to_cpu(q, vnet_hdr.csum_offset) + 2);
-   err = -EINVAL;
-   if (tap16_to_cpu(q, vnet_hdr.hdr_len) > len)
+   hdr_len = tun_vnet_hdr_get(READ_ONCE(q->vnet_hdr_sz), q->flags, 
from, &vnet_hdr);
+   if (hdr_len < 0) {
+   err = hdr_len;
goto err;
+   }
+   } else {
+   hdr_len = 0;
}
  
-	err = -EINVAL;

-   if (unlikely(len < ETH_HLEN))
-   goto err;
-


Is this check removal intentional?


No, I'm not sure what this check is for, but it is irrlevant with vnet 
header and shouldn't be modified with this patch. I'll restore the check 
with the next version.





+   len = iov_iter_count(from);
if (msg_control && sock_flag(&q->sk, SOCK_ZEROCOPY)) {
struct iov_iter i;
  
-		copylen = vnet_hdr.hdr_len ?

-   tap16_to_cpu(q, vnet_hdr.hdr_len) : GOODCOPY_LEN;
+   copylen = hdr_len ? hdr_len : GOODCOPY_LEN;
if (copylen > good_linear)
copylen = good_linear;
else if (copylen < ETH_HLEN)
@@ -697,7 +614,7 @@ static ssize_t tap_get_user(struct tap_queue *q, void 
*msg_control,
  
  	if (!zerocopy) {

copylen = len;
-   linear = tap16_to_cpu(q, vnet_hdr.hdr_len);
+   linear = hdr_len;
if (linear > good_linear)
linear = good_linear;
else if (linear < ETH_HLEN)
@@ -732,9 +649,8 @@ static ssize_t tap_get_user(struct tap_queue *q, void 
*msg_control,
}
skb->dev = tap->dev;
  
-	if (vnet_hdr_len) {

-   err = virtio_net_hdr_to_skb(skb, &vnet_hdr,
-   tap_is_little_endian(q));
+   if (q->flags & IFF_VNET_HDR) {
+   err = tun_vnet_hdr_to_skb(q->flags, skb, &vnet_hdr);
  

Re: [PATCH v6 5/6] selftest: tun: Add tests for virtio-net hashing

2025-01-10 Thread Akihiko Odaki

On 2025/01/09 23:36, Willem de Bruijn wrote:

Akihiko Odaki wrote:

The added tests confirm tun can perform RSS and hash reporting, and
reject invalid configurations for them.

Signed-off-by: Akihiko Odaki 
---
  tools/testing/selftests/net/Makefile |   2 +-
  tools/testing/selftests/net/tun.c| 558 ++-
  2 files changed, 551 insertions(+), 9 deletions(-)

diff --git a/tools/testing/selftests/net/Makefile 
b/tools/testing/selftests/net/Makefile
index cb2fc601de66..92762ce3ebd4 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -121,6 +121,6 @@ $(OUTPUT)/reuseport_bpf_numa: LDLIBS += -lnuma
  $(OUTPUT)/tcp_mmap: LDLIBS += -lpthread -lcrypto
  $(OUTPUT)/tcp_inq: LDLIBS += -lpthread
  $(OUTPUT)/bind_bhash: LDLIBS += -lpthread
-$(OUTPUT)/io_uring_zerocopy_tx: CFLAGS += -I../../../include/
+$(OUTPUT)/io_uring_zerocopy_tx $(OUTPUT)/tun: CFLAGS += -I../../../include/
  
  include bpf.mk

diff --git a/tools/testing/selftests/net/tun.c 
b/tools/testing/selftests/net/tun.c
index 463dd98f2b80..9424d897e341 100644
--- a/tools/testing/selftests/net/tun.c
+++ b/tools/testing/selftests/net/tun.c
@@ -2,21 +2,37 @@
  
  #define _GNU_SOURCE
  
+#include 

  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
  #include 
-#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
  #include 
+#include 
  #include 
  #include 
-#include 
-#include 
+#include 
+#include 
+#include 
+#include 


Are all these include changes strictly needed? Iff so, might as well
fix ordering to be alphabetical (lexicographic).
   


Yes. I placed header files in linux/ after the other header files 
because include/uapi/linux/libc-compat.h requires libc header files to 
be placed before linux/ ones.




Re: [PATCH v6 1/6] virtio_net: Add functions for hashing

2025-01-10 Thread Akihiko Odaki

On 2025/01/09 23:13, Willem de Bruijn wrote:

Akihiko Odaki wrote:

They are useful to implement VIRTIO_NET_F_RSS and
VIRTIO_NET_F_HASH_REPORT.


Toeplitz potentially has users beyond virtio. I wonder if we should
from the start implement this as net/core/rss.c.


Or in lib/toeplitz.c just as like lib/siphash.c. I just chose the 
easiest option to implement everything in include/linux/virtio_net.h.




  

Signed-off-by: Akihiko Odaki 
---
  include/linux/virtio_net.h | 188 +
  1 file changed, 188 insertions(+)

diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index 02a9f4dc594d..3b25ca75710b 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -9,6 +9,194 @@
  #include 
  #include 
  
+struct virtio_net_hash {

+   u32 value;
+   u16 report;
+};
+
+struct virtio_net_toeplitz_state {
+   u32 hash;
+   const u32 *key;
+};
+
+#define VIRTIO_NET_SUPPORTED_HASH_TYPES (VIRTIO_NET_RSS_HASH_TYPE_IPv4 | \
+VIRTIO_NET_RSS_HASH_TYPE_TCPv4 | \
+VIRTIO_NET_RSS_HASH_TYPE_UDPv4 | \
+VIRTIO_NET_RSS_HASH_TYPE_IPv6 | \
+VIRTIO_NET_RSS_HASH_TYPE_TCPv6 | \
+VIRTIO_NET_RSS_HASH_TYPE_UDPv6)
+
+#define VIRTIO_NET_RSS_MAX_KEY_SIZE 40
+
+static inline void virtio_net_toeplitz_convert_key(u32 *input, size_t len)
+{
+   while (len >= sizeof(*input)) {
+   *input = be32_to_cpu((__force __be32)*input);
+   input++;
+   len -= sizeof(*input);
+   }
+}
+
+static inline void virtio_net_toeplitz_calc(struct virtio_net_toeplitz_state 
*state,
+   const __be32 *input, size_t len)
+{
+   while (len >= sizeof(*input)) {
+   for (u32 map = be32_to_cpu(*input); map; map &= (map - 1)) {
+   u32 i = ffs(map);
+
+   state->hash ^= state->key[0] << (32 - i) |
+  (u32)((u64)state->key[1] >> i);
+   }
+
+   state->key++;
+   input++;
+   len -= sizeof(*input);
+   }
+}


Have you verified that this algorithm matches a known toeplitz
implementation. And computes the expected values for the test
inputs in

https://learn.microsoft.com/en-us/windows-hardware/drivers/network/verifying-the-rss-hash-calculation


Yes.



We have a toeplitz implementation in
tools/testing/selftests/net/toeplitz.c that can also be used as
reference.

> >> +

+static inline u8 virtio_net_hash_key_length(u32 types)
+{
+   size_t len = 0;
+
+   if (types & VIRTIO_NET_HASH_REPORT_IPv4)
+   len = max(len,
+ sizeof(struct flow_dissector_key_ipv4_addrs));
+
+   if (types &
+   (VIRTIO_NET_HASH_REPORT_TCPv4 | VIRTIO_NET_HASH_REPORT_UDPv4))
+   len = max(len,
+ sizeof(struct flow_dissector_key_ipv4_addrs) +
+ sizeof(struct flow_dissector_key_ports));
+
+   if (types & VIRTIO_NET_HASH_REPORT_IPv6)
+   len = max(len,
+ sizeof(struct flow_dissector_key_ipv6_addrs));
+
+   if (types &
+   (VIRTIO_NET_HASH_REPORT_TCPv6 | VIRTIO_NET_HASH_REPORT_UDPv6))
+   len = max(len,
+ sizeof(struct flow_dissector_key_ipv6_addrs) +
+ sizeof(struct flow_dissector_key_ports));
+
+   return len + 4;


Avoid magic constants. Please use sizeof or something else to signal
what this 4 derives from.




Re: [PATCH v6 3/6] tun: Introduce virtio-net hash feature

2025-01-10 Thread Akihiko Odaki

On 2025/01/09 23:28, Willem de Bruijn wrote:

Akihiko Odaki wrote:

Hash reporting
--

Allow the guest to reuse the hash value to make receive steering
consistent between the host and guest, and to save hash computation.

RSS
---

RSS is a receive steering algorithm that can be negotiated to use with
virtio_net. Conventionally the hash calculation was done by the VMM.
However, computing the hash after the queue was chosen defeats the
purpose of RSS.

Another approach is to use eBPF steering program. This approach has
another downside: it cannot report the calculated hash due to the
restrictive nature of eBPF steering program.

Introduce the code to perform RSS to the kernel in order to overcome
thse challenges. An alternative solution is to extend the eBPF steering
program so that it will be able to report to the userspace, but I didn't
opt for it because extending the current mechanism of eBPF steering
program as is because it relies on legacy context rewriting, and
introducing kfunc-based eBPF will result in non-UAPI dependency while
the other relevant virtualization APIs such as KVM and vhost_net are
UAPIs.

Signed-off-by: Akihiko Odaki 
---
  Documentation/networking/tuntap.rst |   7 ++
  drivers/net/Kconfig |   1 +
  drivers/net/tap.c   |  50 ++-
  drivers/net/tun.c   |  93 +++-
  drivers/net/tun_vnet.c  | 167 +---
  drivers/net/tun_vnet.h  |  33 ++-
  include/linux/if_tap.h  |   2 +
  include/linux/skbuff.h  |   3 +
  include/uapi/linux/if_tun.h |  75 
  net/core/skbuff.c   |   4 +
  10 files changed, 397 insertions(+), 38 deletions(-)

diff --git a/Documentation/networking/tuntap.rst 
b/Documentation/networking/tuntap.rst
index 4d7087f727be..86b4ae8caa8a 100644
--- a/Documentation/networking/tuntap.rst
+++ b/Documentation/networking/tuntap.rst
@@ -206,6 +206,13 @@ enable is true we enable it, otherwise we disable it::
return ioctl(fd, TUNSETQUEUE, (void *)&ifr);
}
  
+3.4 Reference

+-
+
+``linux/if_tun.h`` defines the interface described below:
+
+.. kernel-doc:: include/uapi/linux/if_tun.h
+
  Universal TUN/TAP device driver Frequently Asked Question
  =
  
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig

index 255c8f9f1d7c..f7b0d9a89a71 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -395,6 +395,7 @@ config TUN
tristate "Universal TUN/TAP device driver support"
depends on INET
select CRC32
+   select SKB_EXTENSIONS
select TUN_VNET
help
  TUN/TAP provides packet reception and transmission for user space
diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index fe9554ee5b8b..27659df1f96e 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -179,6 +179,16 @@ static void tap_put_queue(struct tap_queue *q)
sock_put(&q->sk);
  }
  
+static struct virtio_net_hash *tap_add_hash(struct sk_buff *skb)

+{
+   return (struct virtio_net_hash *)skb->cb;
+}
+
+static const struct virtio_net_hash *tap_find_hash(const struct sk_buff *skb)
+{
+   return (const struct virtio_net_hash *)skb->cb;
+}
+


If introducing a cb for tap, define a struct tuntap_skb_cb.

So that we do not have to change types if we ever need to extend it further.

And in line with your other patch that deduplicates between tun and tap,
define only one new struct, not two (as this patch currently does).


The previous version did that, but Jason suggested the added 
TUNSETVNETHASH ioctl should support all the flags we are implementing 
(TUN_VNET_HASH_REPORT and TUN_VNET_HASH_RSS) in one patch.





  /*
   * Select a queue based on the rxq of the device on which this packet
   * arrived. If the incoming device is not mq, calculate a flow hash
@@ -189,6 +199,7 @@ static void tap_put_queue(struct tap_queue *q)
  static struct tap_queue *tap_get_queue(struct tap_dev *tap,
   struct sk_buff *skb)
  {
+   struct flow_keys_basic keys_basic;
struct tap_queue *queue = NULL;
/* Access to taps array is protected by rcu, but access to numvtaps
 * isn't. Below we use it to lookup a queue, but treat it as a hint
@@ -196,17 +207,41 @@ static struct tap_queue *tap_get_queue(struct tap_dev 
*tap,
 * racing against queue removal.
 */
int numvtaps = READ_ONCE(tap->numvtaps);
+   struct tun_vnet_hash_container *vnet_hash = 
rcu_dereference(tap->vnet_hash);
__u32 rxq;
  
+	*tap_add_hash(skb) = (struct virtio_net_hash) { .report = VIRTIO_NET_HASH_REPORT_NONE };

+
if (!numvtaps)
goto out;
  
  	if (numvtaps == 1)

goto single;
  
+	if (vnet_hash && (vnet_hash->common.flags & TUN_VNET_HAS

Re: [PATCH v2 3/3] tun: Set num_buffers for virtio 1.0

2025-01-10 Thread Akihiko Odaki

On 2025/01/10 19:23, Michael S. Tsirkin wrote:

On Fri, Jan 10, 2025 at 11:27:13AM +0800, Jason Wang wrote:

On Thu, Jan 9, 2025 at 2:59 PM Akihiko Odaki  wrote:


The specification says the device MUST set num_buffers to 1 if
VIRTIO_NET_F_MRG_RXBUF has not been negotiated.


Have we agreed on how to fix the spec or not?

As I replied in the spec patch, if we just remove this "MUST", it
looks like we are all fine?

Thanks


We should replace MUST with SHOULD but it is not all fine,
ignoring SHOULD is a quality of implementation issue.



Should we really replace it? It would mean that a driver conformant with 
the current specification may not be compatible with a device conformant 
with the future specification.


We are going to fix all implementations known to buggy (QEMU and Linux) 
anyway so I think it's just fine to leave that part of specification as is.




Re: [PATCH v2 2/3] tun: Pad virtio header with zero

2025-01-09 Thread Akihiko Odaki

On 2025/01/09 21:46, Willem de Bruijn wrote:

Akihiko Odaki wrote:

On 2025/01/09 16:31, Michael S. Tsirkin wrote:

On Thu, Jan 09, 2025 at 03:58:44PM +0900, Akihiko Odaki wrote:

tun used to simply advance iov_iter when it needs to pad virtio header,
which leaves the garbage in the buffer as is. This is especially
problematic when tun starts to allow enabling the hash reporting
feature; even if the feature is enabled, the packet may lack a hash
value and may contain a hole in the virtio header because the packet
arrived before the feature gets enabled or does not contain the
header fields to be hashed. If the hole is not filled with zero, it is
impossible to tell if the packet lacks a hash value.


Zero is a valid hash value, so cannot be used as an indication that
hashing is inactive.


Zeroing will initialize the hash_report field to 
VIRTIO_NET_HASH_REPORT_NONE, which tells it does not have a hash value.





In theory, a user of tun can fill the buffer with zero before calling
read() to avoid such a problem, but leaving the garbage in the buffer is
awkward anyway so fill the buffer in tun.

Signed-off-by: Akihiko Odaki 


But if the user did it, you have just overwritten his value,
did you not?


Yes. but that means the user expects some part of buffer is not filled
after read() or recvmsg(). I'm a bit worried that not filling the buffer
may break assumptions others (especially the filesystem and socket
infrastructures in the kernel) may have.


If this is user memory that is ignored by the kernel, just reflected
back, then there is no need in general to zero it. There are many such
instances, also in msg_control.


More specifically, is there any instance of recvmsg() implementation 
which returns N and does not fill the complete N bytes of msg_iter?




If not zeroing leads to ambiguity with the new feature, that would be
a reason to add it -- it is always safe to do so.
  

If we are really confident that it will not cause problems, this
behavior can be opt-in based on a flag or we can just write some
documentation warning userspace programmers to initialize the buffer.





Re: [PATCH net v3 9/9] tap: Use tun's vnet-related code

2025-01-20 Thread Akihiko Odaki

On 2025/01/20 20:19, Willem de Bruijn wrote:

On Mon, Jan 20, 2025 at 1:37 AM Jason Wang  wrote:


On Fri, Jan 17, 2025 at 6:35 PM Akihiko Odaki  wrote:


On 2025/01/17 18:23, Willem de Bruijn wrote:

Akihiko Odaki wrote:

tun and tap implements the same vnet-related features so reuse the code.

Signed-off-by: Akihiko Odaki 
---
   drivers/net/Kconfig|   1 +
   drivers/net/Makefile   |   6 +-
   drivers/net/tap.c  | 152 
+
   drivers/net/tun_vnet.c |   5 ++
   4 files changed, 24 insertions(+), 140 deletions(-)

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 1fd5acdc73c6..c420418473fc 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -395,6 +395,7 @@ config TUN
  tristate "Universal TUN/TAP device driver support"
  depends on INET
  select CRC32
+select TAP
  help
TUN/TAP provides packet reception and transmission for user space
programs.  It can be viewed as a simple Point-to-Point or Ethernet
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index bb8eb3053772..2275309a97ee 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -29,9 +29,9 @@ obj-y += mdio/
   obj-y += pcs/
   obj-$(CONFIG_RIONET) += rionet.o
   obj-$(CONFIG_NET_TEAM) += team/
-obj-$(CONFIG_TUN) += tun-drv.o
-tun-drv-y := tun.o tun_vnet.o
-obj-$(CONFIG_TAP) += tap.o
+obj-$(CONFIG_TUN) += tun.o


Is reversing the previous changes to tun.ko intentional?

Perhaps the previous approach with a new CONFIG_TUN_VNET is preferable
over this. In particular over making TUN select TAP, a new dependency.


Jason, you also commented about CONFIG_TUN_VNET for the previous
version. Do you prefer the old approach, or the new one? (Or if you have
another idea, please tell me.)


Ideally, if we can make TUN select TAP that would be better. But there
are some subtle differences in the multi queue implementation. We will
end up with some useless code for TUN unless we can unify the multi
queue logic. It might not be worth it to change the TUN's multi queue
logic so having a new file seems to be better.


+1 on deduplicating further. But this series is complex enough. Let's not
expand that.

The latest approach with a separate .o file may have some performance
cost by converting likely inlined code into real function calls.
Another option is to move it all into tun_vnet.h. That also resolves
the Makefile issues.


I measured the size difference between the latest inlining approaches. 
The numbers may vary depending on the system configuration of course, 
but they should be useful for reference.


The below shows sizes when having a separate module: 106496 bytes in total

# lsmod
Module  Size  Used by
tap28672  0
tun61440  0
tun_vnet   16384  2 tun,tap

The below shows sizes when inlining: 102400 bytes in total

# lsmod
Module  Size  Used by
tap32768  0
tun69632  0

So having a separate module costs 4096 bytes more.

These two approaches should have similar tendency for run-time and 
compile-time performance; the code is so trivial that the overhead of 
having one additional module is dominant.


The only downside of having all in tun_vnet.h is that it will expose its 
internal macros and functions, which I think tolerable.




Re: [PATCH net-next v4 8/9] tap: Keep hdr_len in tap_get_user()

2025-01-20 Thread Akihiko Odaki

On 2025/01/20 20:24, Willem de Bruijn wrote:

Akihiko Odaki wrote:

hdr_len is repeatedly used so keep it in a local variable.

Signed-off-by: Akihiko Odaki 
---
  drivers/net/tap.c | 17 +++--
  1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 
061c2f27dfc83f5e6d0bea4da0e845cc429b1fd8..7ee2e9ee2a89fd539b087496b92d2f6198266f44
 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -645,6 +645,7 @@ static ssize_t tap_get_user(struct tap_queue *q, void 
*msg_control,
int err;
struct virtio_net_hdr vnet_hdr = { 0 };
int vnet_hdr_len = 0;
+   int hdr_len = 0;
int copylen = 0;
int depth;
bool zerocopy = false;
@@ -672,6 +673,7 @@ static ssize_t tap_get_user(struct tap_queue *q, void 
*msg_control,
err = -EINVAL;
if (tap16_to_cpu(q, vnet_hdr.hdr_len) > iov_iter_count(from))
goto err;
+   hdr_len = tap16_to_cpu(q, vnet_hdr.hdr_len);
}
  
  	len = iov_iter_count(from);

@@ -683,11 +685,8 @@ static ssize_t tap_get_user(struct tap_queue *q, void 
*msg_control,
if (msg_control && sock_flag(&q->sk, SOCK_ZEROCOPY)) {
struct iov_iter i;
  
-		copylen = vnet_hdr.hdr_len ?

-   tap16_to_cpu(q, vnet_hdr.hdr_len) : GOODCOPY_LEN;
-   if (copylen > good_linear)
-   copylen = good_linear;
-   else if (copylen < ETH_HLEN)
+   copylen = min(hdr_len ? hdr_len : GOODCOPY_LEN, good_linear);
+   if (copylen < ETH_HLEN)
copylen = ETH_HLEN;
linear = copylen;
i = *from;
@@ -698,11 +697,9 @@ static ssize_t tap_get_user(struct tap_queue *q, void 
*msg_control,
  
  	if (!zerocopy) {

copylen = len;
-   linear = tap16_to_cpu(q, vnet_hdr.hdr_len);
-   if (linear > good_linear)
-   linear = good_linear;
-   else if (linear < ETH_HLEN)
-   linear = ETH_HLEN;
+   linear = min(hdr_len, good_linear);
+   if (copylen < ETH_HLEN)
+   copylen = ETH_HLEN;


Similar to previous patch, I don't think this patch is significant
enough to warrant the code churn.


The following patch will require replacing
tap16_to_cpu(q, vnet_hdr.hdr_len)
with
tap16_to_cpu(q->flags, vnet_hdr.hdr_len)

It will make some lines a bit too long. Calling tap16_to_cpu() at 
multiple places is also not good to keep the vnet implementation unified 
as the function inspects vnet_hdr.


This patch is independently too trivial, but I think it is a worthwhile 
cleanup combined with the following patch.




Re: [PATCH v2 3/3] tun: Set num_buffers for virtio 1.0

2025-01-19 Thread Akihiko Odaki

On 2025/01/20 9:40, Jason Wang wrote:

On Thu, Jan 16, 2025 at 1:30 PM Akihiko Odaki  wrote:


On 2025/01/16 10:06, Jason Wang wrote:

On Wed, Jan 15, 2025 at 1:07 PM Akihiko Odaki  wrote:


On 2025/01/13 12:04, Jason Wang wrote:

On Fri, Jan 10, 2025 at 7:12 PM Akihiko Odaki  wrote:


On 2025/01/10 19:23, Michael S. Tsirkin wrote:

On Fri, Jan 10, 2025 at 11:27:13AM +0800, Jason Wang wrote:

On Thu, Jan 9, 2025 at 2:59 PM Akihiko Odaki  wrote:


The specification says the device MUST set num_buffers to 1 if
VIRTIO_NET_F_MRG_RXBUF has not been negotiated.


Have we agreed on how to fix the spec or not?

As I replied in the spec patch, if we just remove this "MUST", it
looks like we are all fine?

Thanks


We should replace MUST with SHOULD but it is not all fine,
ignoring SHOULD is a quality of implementation issue.



So is this something that the driver should notice?



Should we really replace it? It would mean that a driver conformant with
the current specification may not be compatible with a device conformant
with the future specification.


I don't get this. We are talking about devices and we want to relax so
it should compatibile.



The problem is:
1) On the device side, the num_buffers can be left uninitialized due to bugs
2) On the driver side, the specification allows assuming the num_buffers
is set to one.

Relaxing the device requirement will replace "due to bugs" with
"according to the specification" in 1). It still contradicts with 2) so
does not fix compatibility.


Just to clarify I meant we can simply remove the following:

"""
The device MUST use only a single descriptor if VIRTIO_NET_F_MRG_RXBUF
was not negotiated. Note: This means that num_buffers will always be 1
if VIRTIO_NET_F_MRG_RXBUF is not negotiated.
"""

And

"""
If VIRTIO_NET_F_MRG_RXBUF has not been negotiated, the device MUST set
num_buffers to 1.
"""

This seems easier as it reflects the fact where some devices don't set
it. And it eases the transitional device as it doesn't need to have
any special care.


That can potentially break existing drivers that are compliant with the
current and assumes the num_buffers is set to 1.


Those drivers are already 'broken'. Aren't they?


The drivers are not broken, but vhost_net is. The driver works fine as 
long as it's used with a device compliant with the specification. If we 
relax the device requirement in the future specification, the drivers 
may not work with devices compliant with the revised specification.


Regards,
Akihiko Odaki



Thanks



Regards,
Akihiko Odaki



Then we don't need any driver normative so I don't see any conflict.

Michael suggests we use "SHOULD", but if this is something that the
driver needs to be aware of I don't know how "SHOULD" can help a lot
or not.



Instead, we should make the driver requirement stricter to change 2).
That is what "[PATCH v3] virtio-net: Ignore num_buffers when unused" does:
https://lore.kernel.org/r/20250110-reserved-v3-1-2ade0a5d2...@daynix.com





We are going to fix all implementations known to buggy (QEMU and Linux)
anyway so I think it's just fine to leave that part of specification as is.


I don't think we can fix it all.


It essentially only requires storing 16 bits. There are details we need
to work out, but it should be possible to fix.


I meant it's not realistic to fix all the hypervisors. Note that
modern devices have been implemented for about a decade so we may have
too many versions of various hypervisors. (E.g DPDK seems to stick
with the same behaviour of the current kernel).

  > >>

Regards,
Akihiko Odaki



Thanks










[PATCH net-next v4 0/9] tun: Unify vnet implementation

2025-01-20 Thread Akihiko Odaki
When I implemented virtio's hash-related features to tun/tap [1],
I found tun/tap does not fill the entire region reserved for the virtio
header, leaving some uninitialized hole in the middle of the buffer
after read()/recvmesg().

This series fills the uninitialized hole. More concretely, the
num_buffers field will be initialized with 1, and the other fields will
be inialized with 0. Setting the num_buffers field to 1 is mandated by
virtio 1.0 [2].

The change to virtio header is preceded by another change that refactors
tun and tap to unify their virtio-related code.

[1]: https://lore.kernel.org/r/20241008-rss-v5-0-f3cf68df0...@daynix.com
[2]: https://lore.kernel.org/r/20241227084256-mutt-send-email-...@kernel.org/

Signed-off-by: Akihiko Odaki 
---
Changes in v4:
- s/sz/vnet_hdr_len_sz/ for patch "tun: Decouple vnet handling"
  (Willem de Bruijn)
- Reverted to add CONFIG_TUN_VNET.
- Link to v3: 
https://lore.kernel.org/r/20250116-tun-v3-0-c6b2871e9...@daynix.com

Changes in v3:
- Dropped changes to fill the vnet header.
- Splitted patch "tun: Unify vnet implementation".
- Reverted spurious changes in patch "tun: Unify vnet implementation".
- Merged tun_vnet.c into TAP.
- Link to v2: 
https://lore.kernel.org/r/20250109-tun-v2-0-388d7d5a2...@daynix.com

Changes in v2:
- Fixed num_buffers endian.
- Link to v1: 
https://lore.kernel.org/r/20250108-tun-v1-0-67d784b34...@daynix.com

---
Akihiko Odaki (9):
  tun: Refactor CONFIG_TUN_VNET_CROSS_LE
  tun: Avoid double-tracking iov_iter length changes
  tun: Keep hdr_len in tun_get_user()
  tun: Decouple vnet from tun_struct
  tun: Decouple vnet handling
  tun: Extract the vnet handling code
  tap: Avoid double-tracking iov_iter length changes
  tap: Keep hdr_len in tap_get_user()
  tap: Use tun's vnet-related code

 MAINTAINERS|   2 +-
 drivers/net/Kconfig|   5 ++
 drivers/net/Makefile   |   1 +
 drivers/net/tap.c  | 172 ++
 drivers/net/tun.c  | 200 +++--
 drivers/net/tun_vnet.c | 184 +
 drivers/net/tun_vnet.h |  25 +++
 7 files changed, 267 insertions(+), 322 deletions(-)
---
base-commit: a32e14f8aef69b42826cf0998b068a43d486a9e9
change-id: 20241230-tun-66e10a49b0c7

Best regards,
-- 
Akihiko Odaki 




[PATCH net-next v4 5/9] tun: Decouple vnet handling

2025-01-20 Thread Akihiko Odaki
Decouple the vnet handling code so that we can reuse it for tap.

Signed-off-by: Akihiko Odaki 
---
 drivers/net/tun.c | 229 +++---
 1 file changed, 133 insertions(+), 96 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 
add09dfdada5f76da87ae568072d121c2fc21caf..20659a62bb51d2a497a9d3e9e3b3ee7e9fad4f35
 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -351,6 +351,122 @@ static inline __virtio16 cpu_to_tun16(unsigned int flags, 
u16 val)
return __cpu_to_virtio16(tun_is_little_endian(flags), val);
 }
 
+static long tun_vnet_ioctl(int *vnet_hdr_len_sz, unsigned int *flags,
+  unsigned int cmd, int __user *sp)
+{
+   int s;
+
+   switch (cmd) {
+   case TUNGETVNETHDRSZ:
+   s = *vnet_hdr_len_sz;
+   if (put_user(s, sp))
+   return -EFAULT;
+   return 0;
+
+   case TUNSETVNETHDRSZ:
+   if (get_user(s, sp))
+   return -EFAULT;
+   if (s < (int)sizeof(struct virtio_net_hdr))
+   return -EINVAL;
+
+   *vnet_hdr_len_sz = s;
+   return 0;
+
+   case TUNGETVNETLE:
+   s = !!(*flags & TUN_VNET_LE);
+   if (put_user(s, sp))
+   return -EFAULT;
+   return 0;
+
+   case TUNSETVNETLE:
+   if (get_user(s, sp))
+   return -EFAULT;
+   if (s)
+   *flags |= TUN_VNET_LE;
+   else
+   *flags &= ~TUN_VNET_LE;
+   return 0;
+
+   case TUNGETVNETBE:
+   return tun_get_vnet_be(*flags, sp);
+
+   case TUNSETVNETBE:
+   return tun_set_vnet_be(flags, sp);
+
+   default:
+   return -EINVAL;
+   }
+}
+
+static int tun_vnet_hdr_get(int sz, unsigned int flags, struct iov_iter *from,
+   struct virtio_net_hdr *hdr)
+{
+   if (iov_iter_count(from) < sz)
+   return -EINVAL;
+
+   if (!copy_from_iter_full(hdr, sizeof(*hdr), from))
+   return -EFAULT;
+
+   if ((hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
+   tun16_to_cpu(flags, hdr->csum_start) + tun16_to_cpu(flags, 
hdr->csum_offset) + 2 > tun16_to_cpu(flags, hdr->hdr_len))
+   hdr->hdr_len = cpu_to_tun16(flags, tun16_to_cpu(flags, 
hdr->csum_start) + tun16_to_cpu(flags, hdr->csum_offset) + 2);
+
+   if (tun16_to_cpu(flags, hdr->hdr_len) > iov_iter_count(from))
+   return -EINVAL;
+
+   iov_iter_advance(from, sz - sizeof(*hdr));
+
+   return tun16_to_cpu(flags, hdr->hdr_len);
+}
+
+static int tun_vnet_hdr_put(int sz, struct iov_iter *iter,
+   const struct virtio_net_hdr *hdr)
+{
+   if (unlikely(iov_iter_count(iter) < sz))
+   return -EINVAL;
+
+   if (unlikely(copy_to_iter(hdr, sizeof(*hdr), iter) != sizeof(*hdr)))
+   return -EFAULT;
+
+   iov_iter_advance(iter, sz - sizeof(*hdr));
+
+   return 0;
+}
+
+static int tun_vnet_hdr_to_skb(unsigned int flags, struct sk_buff *skb,
+  const struct virtio_net_hdr *hdr)
+{
+   return virtio_net_hdr_to_skb(skb, hdr, tun_is_little_endian(flags));
+}
+
+static int tun_vnet_hdr_from_skb(unsigned int flags,
+const struct net_device *dev,
+const struct sk_buff *skb,
+struct virtio_net_hdr *hdr)
+{
+   int vlan_hlen = skb_vlan_tag_present(skb) ? VLAN_HLEN : 0;
+
+   if (virtio_net_hdr_from_skb(skb, hdr,
+   tun_is_little_endian(flags), true,
+   vlan_hlen)) {
+   struct skb_shared_info *sinfo = skb_shinfo(skb);
+
+   if (net_ratelimit()) {
+   netdev_err(dev, "unexpected GSO type: 0x%x, gso_size 
%d, hdr_len %d\n",
+  sinfo->gso_type, tun16_to_cpu(flags, 
hdr->gso_size),
+  tun16_to_cpu(flags, hdr->hdr_len));
+   print_hex_dump(KERN_ERR, "tun: ",
+  DUMP_PREFIX_NONE,
+  16, 1, skb->head,
+  min(tun16_to_cpu(flags, hdr->hdr_len), 
64), true);
+   }
+   WARN_ON_ONCE(1);
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
 static inline u32 tun_hashfn(u32 rxhash)
 {
return rxhash & TUN_MASK_FLOW_ENTRIES;
@@ -1763,22 +1879,10 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
 
if (tun->flags & IFF_VNET_HDR) {
int vnet_hdr_sz = READ_ONCE(tun->

[PATCH net-next v4 9/9] tap: Use tun's vnet-related code

2025-01-20 Thread Akihiko Odaki
tun and tap implements the same vnet-related features so reuse the code.

Signed-off-by: Akihiko Odaki 
---
 drivers/net/Kconfig |   1 +
 drivers/net/tap.c   | 152 ++--
 2 files changed, 16 insertions(+), 137 deletions(-)

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 
924bf61f12a49566b26a78f42cea5ca1c48537c5..f8aa35bf8a93ac1c7f76b85919ac110cb06f21fb
 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -421,6 +421,7 @@ config TUN
 
 config TAP
tristate
+   select TUN_VNET
help
  This option is selected by any driver implementing tap user space
  interface for a virtual interface to re-use core tap functionality.
diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 
7ee2e9ee2a89fd539b087496b92d2f6198266f44..4f3cc3b2e3c6fb387ee2aaeef54c3faf39d90f10
 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -26,74 +26,9 @@
 #include 
 #include 
 
-#define TAP_IFFEATURES (IFF_VNET_HDR | IFF_MULTI_QUEUE)
-
-#define TAP_VNET_LE 0x8000
-#define TAP_VNET_BE 0x4000
-
-#ifdef CONFIG_TUN_VNET_CROSS_LE
-static inline bool tap_legacy_is_little_endian(struct tap_queue *q)
-{
-   return q->flags & TAP_VNET_BE ? false :
-   virtio_legacy_is_little_endian();
-}
-
-static long tap_get_vnet_be(struct tap_queue *q, int __user *sp)
-{
-   int s = !!(q->flags & TAP_VNET_BE);
-
-   if (put_user(s, sp))
-   return -EFAULT;
-
-   return 0;
-}
-
-static long tap_set_vnet_be(struct tap_queue *q, int __user *sp)
-{
-   int s;
-
-   if (get_user(s, sp))
-   return -EFAULT;
-
-   if (s)
-   q->flags |= TAP_VNET_BE;
-   else
-   q->flags &= ~TAP_VNET_BE;
-
-   return 0;
-}
-#else
-static inline bool tap_legacy_is_little_endian(struct tap_queue *q)
-{
-   return virtio_legacy_is_little_endian();
-}
-
-static long tap_get_vnet_be(struct tap_queue *q, int __user *argp)
-{
-   return -EINVAL;
-}
-
-static long tap_set_vnet_be(struct tap_queue *q, int __user *argp)
-{
-   return -EINVAL;
-}
-#endif /* CONFIG_TUN_VNET_CROSS_LE */
-
-static inline bool tap_is_little_endian(struct tap_queue *q)
-{
-   return q->flags & TAP_VNET_LE ||
-   tap_legacy_is_little_endian(q);
-}
-
-static inline u16 tap16_to_cpu(struct tap_queue *q, __virtio16 val)
-{
-   return __virtio16_to_cpu(tap_is_little_endian(q), val);
-}
+#include "tun_vnet.h"
 
-static inline __virtio16 cpu_to_tap16(struct tap_queue *q, u16 val)
-{
-   return __cpu_to_virtio16(tap_is_little_endian(q), val);
-}
+#define TAP_IFFEATURES (IFF_VNET_HDR | IFF_MULTI_QUEUE)
 
 static struct proto tap_proto = {
.name = "tap",
@@ -655,25 +590,11 @@ static ssize_t tap_get_user(struct tap_queue *q, void 
*msg_control,
if (q->flags & IFF_VNET_HDR) {
vnet_hdr_len = READ_ONCE(q->vnet_hdr_sz);
 
-   err = -EINVAL;
-   if (iov_iter_count(from) < vnet_hdr_len)
-   goto err;
-
-   err = -EFAULT;
-   if (!copy_from_iter_full(&vnet_hdr, sizeof(vnet_hdr), from))
-   goto err;
-   iov_iter_advance(from, vnet_hdr_len - sizeof(vnet_hdr));
-   if ((vnet_hdr.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
-tap16_to_cpu(q, vnet_hdr.csum_start) +
-tap16_to_cpu(q, vnet_hdr.csum_offset) + 2 >
-tap16_to_cpu(q, vnet_hdr.hdr_len))
-   vnet_hdr.hdr_len = cpu_to_tap16(q,
-tap16_to_cpu(q, vnet_hdr.csum_start) +
-tap16_to_cpu(q, vnet_hdr.csum_offset) + 2);
-   err = -EINVAL;
-   if (tap16_to_cpu(q, vnet_hdr.hdr_len) > iov_iter_count(from))
+   hdr_len = tun_vnet_hdr_get(vnet_hdr_len, q->flags, from, 
&vnet_hdr);
+   if (hdr_len < 0) {
+   err = hdr_len;
goto err;
-   hdr_len = tap16_to_cpu(q, vnet_hdr.hdr_len);
+   }
}
 
len = iov_iter_count(from);
@@ -731,8 +652,7 @@ static ssize_t tap_get_user(struct tap_queue *q, void 
*msg_control,
skb->dev = tap->dev;
 
if (vnet_hdr_len) {
-   err = virtio_net_hdr_to_skb(skb, &vnet_hdr,
-   tap_is_little_endian(q));
+   err = tun_vnet_hdr_to_skb(q->flags, skb, &vnet_hdr);
if (err) {
rcu_read_unlock();
drop_reason = SKB_DROP_REASON_DEV_HDR;
@@ -795,23 +715,17 @@ static ssize_t tap_put_user(struct tap_queue *q,
int total;
 
if (q->flags & IFF_VNET_HDR) {
-   int vlan_hlen = skb_vlan_tag_present(skb) ? VLAN_HLEN : 0;
struct virt

[PATCH net-next v4 2/9] tun: Avoid double-tracking iov_iter length changes

2025-01-20 Thread Akihiko Odaki
tun_get_user() used to track the length of iov_iter with another
variable. We can use iov_iter_count() to determine the current length
to avoid such chores.

Signed-off-by: Akihiko Odaki 
Reviewed-by: Willem de Bruijn 
---
 drivers/net/tun.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 
452fc5104260fe7ff5fdd5cedc5d2647cbe35c79..bd272b4736fb7e9004f7d91dc83c69af5239bfe0
 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1742,7 +1742,7 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
struct tun_pi pi = { 0, cpu_to_be16(ETH_P_IP) };
struct sk_buff *skb;
size_t total_len = iov_iter_count(from);
-   size_t len = total_len, align = tun->align, linear;
+   size_t len, align = tun->align, linear;
struct virtio_net_hdr gso = { 0 };
int good_linear;
int copylen;
@@ -1754,9 +1754,8 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
enum skb_drop_reason drop_reason = SKB_DROP_REASON_NOT_SPECIFIED;
 
if (!(tun->flags & IFF_NO_PI)) {
-   if (len < sizeof(pi))
+   if (iov_iter_count(from) < sizeof(pi))
return -EINVAL;
-   len -= sizeof(pi);
 
if (!copy_from_iter_full(&pi, sizeof(pi), from))
return -EFAULT;
@@ -1765,9 +1764,8 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
if (tun->flags & IFF_VNET_HDR) {
int vnet_hdr_sz = READ_ONCE(tun->vnet_hdr_sz);
 
-   if (len < vnet_hdr_sz)
+   if (iov_iter_count(from) < vnet_hdr_sz)
return -EINVAL;
-   len -= vnet_hdr_sz;
 
if (!copy_from_iter_full(&gso, sizeof(gso), from))
return -EFAULT;
@@ -1776,11 +1774,13 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
tun16_to_cpu(tun, gso.csum_start) + tun16_to_cpu(tun, 
gso.csum_offset) + 2 > tun16_to_cpu(tun, gso.hdr_len))
gso.hdr_len = cpu_to_tun16(tun, tun16_to_cpu(tun, 
gso.csum_start) + tun16_to_cpu(tun, gso.csum_offset) + 2);
 
-   if (tun16_to_cpu(tun, gso.hdr_len) > len)
+   if (tun16_to_cpu(tun, gso.hdr_len) > iov_iter_count(from))
return -EINVAL;
iov_iter_advance(from, vnet_hdr_sz - sizeof(gso));
}
 
+   len = iov_iter_count(from);
+
if ((tun->flags & TUN_TYPE_MASK) == IFF_TAP) {
align += NET_IP_ALIGN;
if (unlikely(len < ETH_HLEN ||

-- 
2.47.1




[PATCH net-next v4 1/9] tun: Refactor CONFIG_TUN_VNET_CROSS_LE

2025-01-20 Thread Akihiko Odaki
Check IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE) to save some lines and make
future changes easier.

Signed-off-by: Akihiko Odaki 
Reviewed-by: Willem de Bruijn 
---
 drivers/net/tun.c | 26 --
 1 file changed, 8 insertions(+), 18 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 
e816aaba8e5f2ed06f8832f79553b6c976e75bb8..452fc5104260fe7ff5fdd5cedc5d2647cbe35c79
 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -298,10 +298,10 @@ static bool tun_napi_frags_enabled(const struct tun_file 
*tfile)
return tfile->napi_frags_enabled;
 }
 
-#ifdef CONFIG_TUN_VNET_CROSS_LE
 static inline bool tun_legacy_is_little_endian(struct tun_struct *tun)
 {
-   return tun->flags & TUN_VNET_BE ? false :
+   return !(IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE) &&
+(tun->flags & TUN_VNET_BE)) &&
virtio_legacy_is_little_endian();
 }
 
@@ -309,6 +309,9 @@ static long tun_get_vnet_be(struct tun_struct *tun, int 
__user *argp)
 {
int be = !!(tun->flags & TUN_VNET_BE);
 
+   if (!IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE))
+   return -EINVAL;
+
if (put_user(be, argp))
return -EFAULT;
 
@@ -319,6 +322,9 @@ static long tun_set_vnet_be(struct tun_struct *tun, int 
__user *argp)
 {
int be;
 
+   if (!IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE))
+   return -EINVAL;
+
if (get_user(be, argp))
return -EFAULT;
 
@@ -329,22 +335,6 @@ static long tun_set_vnet_be(struct tun_struct *tun, int 
__user *argp)
 
return 0;
 }
-#else
-static inline bool tun_legacy_is_little_endian(struct tun_struct *tun)
-{
-   return virtio_legacy_is_little_endian();
-}
-
-static long tun_get_vnet_be(struct tun_struct *tun, int __user *argp)
-{
-   return -EINVAL;
-}
-
-static long tun_set_vnet_be(struct tun_struct *tun, int __user *argp)
-{
-   return -EINVAL;
-}
-#endif /* CONFIG_TUN_VNET_CROSS_LE */
 
 static inline bool tun_is_little_endian(struct tun_struct *tun)
 {

-- 
2.47.1




[PATCH net-next v4 4/9] tun: Decouple vnet from tun_struct

2025-01-20 Thread Akihiko Odaki
Decouple vnet-related functions from tun_struct so that we can reuse
them for tap in the future.

Signed-off-by: Akihiko Odaki 
---
 drivers/net/tun.c | 53 +++--
 1 file changed, 27 insertions(+), 26 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 
ec56ac86584813f990fabf4633e4d96ca81176ae..add09dfdada5f76da87ae568072d121c2fc21caf
 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -298,16 +298,16 @@ static bool tun_napi_frags_enabled(const struct tun_file 
*tfile)
return tfile->napi_frags_enabled;
 }
 
-static inline bool tun_legacy_is_little_endian(struct tun_struct *tun)
+static inline bool tun_legacy_is_little_endian(unsigned int flags)
 {
return !(IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE) &&
-(tun->flags & TUN_VNET_BE)) &&
+(flags & TUN_VNET_BE)) &&
virtio_legacy_is_little_endian();
 }
 
-static long tun_get_vnet_be(struct tun_struct *tun, int __user *argp)
+static long tun_get_vnet_be(unsigned int flags, int __user *argp)
 {
-   int be = !!(tun->flags & TUN_VNET_BE);
+   int be = !!(flags & TUN_VNET_BE);
 
if (!IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE))
return -EINVAL;
@@ -318,7 +318,7 @@ static long tun_get_vnet_be(struct tun_struct *tun, int 
__user *argp)
return 0;
 }
 
-static long tun_set_vnet_be(struct tun_struct *tun, int __user *argp)
+static long tun_set_vnet_be(unsigned int *flags, int __user *argp)
 {
int be;
 
@@ -329,27 +329,26 @@ static long tun_set_vnet_be(struct tun_struct *tun, int 
__user *argp)
return -EFAULT;
 
if (be)
-   tun->flags |= TUN_VNET_BE;
+   *flags |= TUN_VNET_BE;
else
-   tun->flags &= ~TUN_VNET_BE;
+   *flags &= ~TUN_VNET_BE;
 
return 0;
 }
 
-static inline bool tun_is_little_endian(struct tun_struct *tun)
+static inline bool tun_is_little_endian(unsigned int flags)
 {
-   return tun->flags & TUN_VNET_LE ||
-   tun_legacy_is_little_endian(tun);
+   return flags & TUN_VNET_LE || tun_legacy_is_little_endian(flags);
 }
 
-static inline u16 tun16_to_cpu(struct tun_struct *tun, __virtio16 val)
+static inline u16 tun16_to_cpu(unsigned int flags, __virtio16 val)
 {
-   return __virtio16_to_cpu(tun_is_little_endian(tun), val);
+   return __virtio16_to_cpu(tun_is_little_endian(flags), val);
 }
 
-static inline __virtio16 cpu_to_tun16(struct tun_struct *tun, u16 val)
+static inline __virtio16 cpu_to_tun16(unsigned int flags, u16 val)
 {
-   return __cpu_to_virtio16(tun_is_little_endian(tun), val);
+   return __cpu_to_virtio16(tun_is_little_endian(flags), val);
 }
 
 static inline u32 tun_hashfn(u32 rxhash)
@@ -1764,6 +1763,7 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
 
if (tun->flags & IFF_VNET_HDR) {
int vnet_hdr_sz = READ_ONCE(tun->vnet_hdr_sz);
+   int flags = tun->flags;
 
if (iov_iter_count(from) < vnet_hdr_sz)
return -EINVAL;
@@ -1772,12 +1772,12 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
return -EFAULT;
 
if ((gso.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
-   tun16_to_cpu(tun, gso.csum_start) + tun16_to_cpu(tun, 
gso.csum_offset) + 2 > tun16_to_cpu(tun, gso.hdr_len))
-   gso.hdr_len = cpu_to_tun16(tun, tun16_to_cpu(tun, 
gso.csum_start) + tun16_to_cpu(tun, gso.csum_offset) + 2);
+   tun16_to_cpu(flags, gso.csum_start) + tun16_to_cpu(flags, 
gso.csum_offset) + 2 > tun16_to_cpu(flags, gso.hdr_len))
+   gso.hdr_len = cpu_to_tun16(flags, tun16_to_cpu(flags, 
gso.csum_start) + tun16_to_cpu(flags, gso.csum_offset) + 2);
 
-   if (tun16_to_cpu(tun, gso.hdr_len) > iov_iter_count(from))
+   if (tun16_to_cpu(flags, gso.hdr_len) > iov_iter_count(from))
return -EINVAL;
-   hdr_len = tun16_to_cpu(tun, gso.hdr_len);
+   hdr_len = tun16_to_cpu(flags, gso.hdr_len);
iov_iter_advance(from, vnet_hdr_sz - sizeof(gso));
}
 
@@ -1854,7 +1854,7 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
}
}
 
-   if (virtio_net_hdr_to_skb(skb, &gso, tun_is_little_endian(tun))) {
+   if (virtio_net_hdr_to_skb(skb, &gso, tun_is_little_endian(tun->flags))) 
{
atomic_long_inc(&tun->rx_frame_errors);
err = -EINVAL;
goto free_skb;
@@ -2108,23 +2108,24 @@ static ssize_t tun_put_user(struct tun_struct *tun,
 
if (vnet_hdr_sz) {
struct virtio_net_hdr gso;
+   int flags 

[PATCH net-next v4 3/9] tun: Keep hdr_len in tun_get_user()

2025-01-20 Thread Akihiko Odaki
hdr_len is repeatedly used so keep it in a local variable.

Signed-off-by: Akihiko Odaki 
Reviewed-by: Willem de Bruijn 
---
 drivers/net/tun.c | 14 +-
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 
bd272b4736fb7e9004f7d91dc83c69af5239bfe0..ec56ac86584813f990fabf4633e4d96ca81176ae
 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1746,6 +1746,7 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
struct virtio_net_hdr gso = { 0 };
int good_linear;
int copylen;
+   int hdr_len = 0;
bool zerocopy = false;
int err;
u32 rxhash = 0;
@@ -1776,6 +1777,7 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
 
if (tun16_to_cpu(tun, gso.hdr_len) > iov_iter_count(from))
return -EINVAL;
+   hdr_len = tun16_to_cpu(tun, gso.hdr_len);
iov_iter_advance(from, vnet_hdr_sz - sizeof(gso));
}
 
@@ -1783,8 +1785,7 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
 
if ((tun->flags & TUN_TYPE_MASK) == IFF_TAP) {
align += NET_IP_ALIGN;
-   if (unlikely(len < ETH_HLEN ||
-(gso.hdr_len && tun16_to_cpu(tun, gso.hdr_len) < 
ETH_HLEN)))
+   if (unlikely(len < ETH_HLEN || (hdr_len && hdr_len < ETH_HLEN)))
return -EINVAL;
}
 
@@ -1797,9 +1798,7 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
 * enough room for skb expand head in case it is used.
 * The rest of the buffer is mapped from userspace.
 */
-   copylen = gso.hdr_len ? tun16_to_cpu(tun, gso.hdr_len) : 
GOODCOPY_LEN;
-   if (copylen > good_linear)
-   copylen = good_linear;
+   copylen = min(hdr_len ? hdr_len : GOODCOPY_LEN, good_linear);
linear = copylen;
iov_iter_advance(&i, copylen);
if (iov_iter_npages(&i, INT_MAX) <= MAX_SKB_FRAGS)
@@ -1820,10 +1819,7 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
} else {
if (!zerocopy) {
copylen = len;
-   if (tun16_to_cpu(tun, gso.hdr_len) > good_linear)
-   linear = good_linear;
-   else
-   linear = tun16_to_cpu(tun, gso.hdr_len);
+   linear = min(hdr_len, good_linear);
}
 
if (frags) {

-- 
2.47.1




[PATCH net-next v4 6/9] tun: Extract the vnet handling code

2025-01-20 Thread Akihiko Odaki
The vnet handling code will be reused by tap.

Signed-off-by: Akihiko Odaki 
---
 MAINTAINERS|   2 +-
 drivers/net/Kconfig|   4 ++
 drivers/net/Makefile   |   1 +
 drivers/net/tun.c  | 174 +-
 drivers/net/tun_vnet.c | 184 +
 drivers/net/tun_vnet.h |  25 +++
 6 files changed, 217 insertions(+), 173 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 
910305c11e8a882da5b49ce5bd55011b93f28c32..bc32b7e23c79ab80b19c8207f14c5e51a47ec89f
 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -23902,7 +23902,7 @@ W:  http://vtun.sourceforge.net/tun
 F: Documentation/networking/tuntap.rst
 F: arch/um/os-Linux/drivers/
 F: drivers/net/tap.c
-F: drivers/net/tun.c
+F: drivers/net/tun*
 
 TURBOCHANNEL SUBSYSTEM
 M: "Maciej W. Rozycki" 
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 
1fd5acdc73c6af0e1a861867039c3624fc618e25..924bf61f12a49566b26a78f42cea5ca1c48537c5
 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -391,10 +391,14 @@ config RIONET_RX_SIZE
depends on RIONET
default "128"
 
+config TUN_VNET
+   tristate
+
 config TUN
tristate "Universal TUN/TAP device driver support"
depends on INET
select CRC32
+   select TUN_VNET
help
  TUN/TAP provides packet reception and transmission for user space
  programs.  It can be viewed as a simple Point-to-Point or Ethernet
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index 
13743d0e83b5fde479e9b30ad736be402d880dee..f6590f2795cf742ab15047d8f1b2d2d8661954a3
 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -29,6 +29,7 @@ obj-y += mdio/
 obj-y += pcs/
 obj-$(CONFIG_RIONET) += rionet.o
 obj-$(CONFIG_NET_TEAM) += team/
+obj-$(CONFIG_TUN_VNET) += tun_vnet.o
 obj-$(CONFIG_TUN) += tun.o
 obj-$(CONFIG_TAP) += tap.o
 obj-$(CONFIG_VETH) += veth.o
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 
20659a62bb51d2a497a9d3e9e3b3ee7e9fad4f35..21abd3613cacda175d4f469f580a2994b2f836e8
 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -83,6 +83,8 @@
 #include 
 #include 
 
+#include "tun_vnet.h"
+
 static void tun_default_link_ksettings(struct net_device *dev,
   struct ethtool_link_ksettings *cmd);
 
@@ -94,9 +96,6 @@ static void tun_default_link_ksettings(struct net_device *dev,
  * overload it to mean fasync when stored there.
  */
 #define TUN_FASYNC IFF_ATTACH_QUEUE
-/* High bits in flags field are unused. */
-#define TUN_VNET_LE 0x8000
-#define TUN_VNET_BE 0x4000
 
 #define TUN_FEATURES (IFF_NO_PI | IFF_ONE_QUEUE | IFF_VNET_HDR | \
  IFF_MULTI_QUEUE | IFF_NAPI | IFF_NAPI_FRAGS)
@@ -298,175 +297,6 @@ static bool tun_napi_frags_enabled(const struct tun_file 
*tfile)
return tfile->napi_frags_enabled;
 }
 
-static inline bool tun_legacy_is_little_endian(unsigned int flags)
-{
-   return !(IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE) &&
-(flags & TUN_VNET_BE)) &&
-   virtio_legacy_is_little_endian();
-}
-
-static long tun_get_vnet_be(unsigned int flags, int __user *argp)
-{
-   int be = !!(flags & TUN_VNET_BE);
-
-   if (!IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE))
-   return -EINVAL;
-
-   if (put_user(be, argp))
-   return -EFAULT;
-
-   return 0;
-}
-
-static long tun_set_vnet_be(unsigned int *flags, int __user *argp)
-{
-   int be;
-
-   if (!IS_ENABLED(CONFIG_TUN_VNET_CROSS_LE))
-   return -EINVAL;
-
-   if (get_user(be, argp))
-   return -EFAULT;
-
-   if (be)
-   *flags |= TUN_VNET_BE;
-   else
-   *flags &= ~TUN_VNET_BE;
-
-   return 0;
-}
-
-static inline bool tun_is_little_endian(unsigned int flags)
-{
-   return flags & TUN_VNET_LE || tun_legacy_is_little_endian(flags);
-}
-
-static inline u16 tun16_to_cpu(unsigned int flags, __virtio16 val)
-{
-   return __virtio16_to_cpu(tun_is_little_endian(flags), val);
-}
-
-static inline __virtio16 cpu_to_tun16(unsigned int flags, u16 val)
-{
-   return __cpu_to_virtio16(tun_is_little_endian(flags), val);
-}
-
-static long tun_vnet_ioctl(int *vnet_hdr_len_sz, unsigned int *flags,
-  unsigned int cmd, int __user *sp)
-{
-   int s;
-
-   switch (cmd) {
-   case TUNGETVNETHDRSZ:
-   s = *vnet_hdr_len_sz;
-   if (put_user(s, sp))
-   return -EFAULT;
-   return 0;
-
-   case TUNSETVNETHDRSZ:
-   if (get_user(s, sp))
-   return -EFAULT;
-   if (s < (int)sizeof(struct virtio_net_hdr))
-   return -EINVAL;
-
-   *vnet_hdr_len_sz = s;
-   return 0;
-
-   case TUNGETVNETLE:
- 

[PATCH net-next v4 8/9] tap: Keep hdr_len in tap_get_user()

2025-01-20 Thread Akihiko Odaki
hdr_len is repeatedly used so keep it in a local variable.

Signed-off-by: Akihiko Odaki 
---
 drivers/net/tap.c | 17 +++--
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 
061c2f27dfc83f5e6d0bea4da0e845cc429b1fd8..7ee2e9ee2a89fd539b087496b92d2f6198266f44
 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -645,6 +645,7 @@ static ssize_t tap_get_user(struct tap_queue *q, void 
*msg_control,
int err;
struct virtio_net_hdr vnet_hdr = { 0 };
int vnet_hdr_len = 0;
+   int hdr_len = 0;
int copylen = 0;
int depth;
bool zerocopy = false;
@@ -672,6 +673,7 @@ static ssize_t tap_get_user(struct tap_queue *q, void 
*msg_control,
err = -EINVAL;
if (tap16_to_cpu(q, vnet_hdr.hdr_len) > iov_iter_count(from))
goto err;
+   hdr_len = tap16_to_cpu(q, vnet_hdr.hdr_len);
}
 
len = iov_iter_count(from);
@@ -683,11 +685,8 @@ static ssize_t tap_get_user(struct tap_queue *q, void 
*msg_control,
if (msg_control && sock_flag(&q->sk, SOCK_ZEROCOPY)) {
struct iov_iter i;
 
-   copylen = vnet_hdr.hdr_len ?
-   tap16_to_cpu(q, vnet_hdr.hdr_len) : GOODCOPY_LEN;
-   if (copylen > good_linear)
-   copylen = good_linear;
-   else if (copylen < ETH_HLEN)
+   copylen = min(hdr_len ? hdr_len : GOODCOPY_LEN, good_linear);
+   if (copylen < ETH_HLEN)
copylen = ETH_HLEN;
linear = copylen;
i = *from;
@@ -698,11 +697,9 @@ static ssize_t tap_get_user(struct tap_queue *q, void 
*msg_control,
 
if (!zerocopy) {
copylen = len;
-   linear = tap16_to_cpu(q, vnet_hdr.hdr_len);
-   if (linear > good_linear)
-   linear = good_linear;
-   else if (linear < ETH_HLEN)
-   linear = ETH_HLEN;
+   linear = min(hdr_len, good_linear);
+   if (copylen < ETH_HLEN)
+   copylen = ETH_HLEN;
}
 
skb = tap_alloc_skb(&q->sk, TAP_RESERVE, copylen,

-- 
2.47.1




[PATCH net-next v4 7/9] tap: Avoid double-tracking iov_iter length changes

2025-01-20 Thread Akihiko Odaki
tap_get_user() used to track the length of iov_iter with another
variable. We can use iov_iter_count() to determine the current length
to avoid such chores.

Signed-off-by: Akihiko Odaki 
---
 drivers/net/tap.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 
5aa41d5f7765a6dcf185bccd3cba2299bad89398..061c2f27dfc83f5e6d0bea4da0e845cc429b1fd8
 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -641,7 +641,7 @@ static ssize_t tap_get_user(struct tap_queue *q, void 
*msg_control,
struct sk_buff *skb;
struct tap_dev *tap;
unsigned long total_len = iov_iter_count(from);
-   unsigned long len = total_len;
+   unsigned long len;
int err;
struct virtio_net_hdr vnet_hdr = { 0 };
int vnet_hdr_len = 0;
@@ -655,9 +655,8 @@ static ssize_t tap_get_user(struct tap_queue *q, void 
*msg_control,
vnet_hdr_len = READ_ONCE(q->vnet_hdr_sz);
 
err = -EINVAL;
-   if (len < vnet_hdr_len)
+   if (iov_iter_count(from) < vnet_hdr_len)
goto err;
-   len -= vnet_hdr_len;
 
err = -EFAULT;
if (!copy_from_iter_full(&vnet_hdr, sizeof(vnet_hdr), from))
@@ -671,10 +670,12 @@ static ssize_t tap_get_user(struct tap_queue *q, void 
*msg_control,
 tap16_to_cpu(q, vnet_hdr.csum_start) +
 tap16_to_cpu(q, vnet_hdr.csum_offset) + 2);
err = -EINVAL;
-   if (tap16_to_cpu(q, vnet_hdr.hdr_len) > len)
+   if (tap16_to_cpu(q, vnet_hdr.hdr_len) > iov_iter_count(from))
goto err;
}
 
+   len = iov_iter_count(from);
+
err = -EINVAL;
if (unlikely(len < ETH_HLEN))
goto err;

-- 
2.47.1




Re: [PATCH v2 3/3] tun: Set num_buffers for virtio 1.0

2025-01-14 Thread Akihiko Odaki

On 2025/01/13 12:04, Jason Wang wrote:

On Fri, Jan 10, 2025 at 7:12 PM Akihiko Odaki  wrote:


On 2025/01/10 19:23, Michael S. Tsirkin wrote:

On Fri, Jan 10, 2025 at 11:27:13AM +0800, Jason Wang wrote:

On Thu, Jan 9, 2025 at 2:59 PM Akihiko Odaki  wrote:


The specification says the device MUST set num_buffers to 1 if
VIRTIO_NET_F_MRG_RXBUF has not been negotiated.


Have we agreed on how to fix the spec or not?

As I replied in the spec patch, if we just remove this "MUST", it
looks like we are all fine?

Thanks


We should replace MUST with SHOULD but it is not all fine,
ignoring SHOULD is a quality of implementation issue.



So is this something that the driver should notice?



Should we really replace it? It would mean that a driver conformant with
the current specification may not be compatible with a device conformant
with the future specification.


I don't get this. We are talking about devices and we want to relax so
it should compatibile.



The problem is:
1) On the device side, the num_buffers can be left uninitialized due to bugs
2) On the driver side, the specification allows assuming the num_buffers 
is set to one.


Relaxing the device requirement will replace "due to bugs" with 
"according to the specification" in 1). It still contradicts with 2) so 
does not fix compatibility.


Instead, we should make the driver requirement stricter to change 2). 
That is what "[PATCH v3] virtio-net: Ignore num_buffers when unused" does:

https://lore.kernel.org/r/20250110-reserved-v3-1-2ade0a5d2...@daynix.com





We are going to fix all implementations known to buggy (QEMU and Linux)
anyway so I think it's just fine to leave that part of specification as is.


I don't think we can fix it all.


It essentially only requires storing 16 bits. There are details we need 
to work out, but it should be possible to fix.


Regards,
Akihiko Odaki



  1   2   3   >