date:20250121

Re: [PATCH v10 00/14] riscv: Add support for xtheadvector

2025-01-21 Thread nexp_0x17

On Wed, 2024-11-13 at 18:44 -0800, Charlie Jenkins wrote:
> On Mon, Sep 30, 2024 at 12:07:23AM +0800, Aoba K wrote:
> > 
> > On 2024/9/12 13:55, Charlie Jenkins wrote:
> > > xtheadvector is a custom extension that is based upon riscv
> > > vector
> > > version 0.7.1 [1]. All of the vector routines have been modified
> > > to
> > > support this alternative vector version based upon whether
> > > xtheadvector
> > > was determined to be supported at boot.
> > > 
> > > vlenb is not supported on the existing xtheadvector hardware, so
> > > a
> > > devicetree property thead,vlenb is added to provide the vlenb to
> > > Linux.
> > > 
> > > There is a new hwprobe key RISCV_HWPROBE_KEY_VENDOR_EXT_THEAD_0
> > > that is
> > > used to request which thead vendor extensions are supported on
> > > the
> > > current platform. This allows future vendors to allocate hwprobe
> > > keys
> > > for their vendor.
> > > 
> > > Support for xtheadvector is also added to the vector kselftests.
> > > 
> > > Signed-off-by: Charlie Jenkins 
> > > 
> > > [1]
> > > https://github.com/T-head-Semi/thead-extension-spec/blob/95358cb2cca9489361c61d335e03d3134b14133f/xtheadvector.adoc
> > > 
> > > ---
> > > This series is a continuation of a different series that was
> > > fragmented
> > > into two other series in an attempt to get part of it merged in
> > > the 6.10
> > > merge window. The split-off series did not get merged due to a
> > > NAK on
> > > the series that added the generic riscv,vlenb devicetree entry.
> > > This
> > > series has converted riscv,vlenb to thead,vlenb to remedy this
> > > issue.
> > > 
> > > The original series is titled "riscv: Support vendor extensions
> > > and
> > > xtheadvector" [3].
> > > 
> > > The series titled "riscv: Extend cpufeature.c to detect vendor
> > > extensions" is still under development and this series is based
> > > on that
> > > series! [4]
> > > 
> > > I have tested this with an Allwinner Nezha board. I used SkiffOS
> > > [1] to
> > > manage building the image, but upgraded the U-Boot version to
> > > Samuel
> > > Holland's more up-to-date version [2] and changed out the device
> > > tree
> > > used by U-Boot with the device trees that are present in upstream
> > > linux
> > > and this series. Thank you Samuel for all of the work you did to
> > > make
> > > this task possible.
> > > 
> > > [1]
> > > https://github.com/skiffos/SkiffOS/tree/master/configs/allwinner/nezha
> > > [2]
> > > https://github.com/smaeul/u-boot/commit/2e89b706f5c956a70c989cd31665f1429e9a0b48
> > > [3]
> > > https://lore.kernel.org/all/20240503-dev-charlie-support_thead_vector_6_9-v6-0-cb7624e65...@rivosinc.com/
> > > [4]
> > > https://lore.kernel.org/lkml/20240719-support_vendor_extensions-v3-4-0af7587bb...@rivosinc.com/T/
> > > 
> > > ---
> > > Changes in v10:
> > > - In DT probing disable vector with new function to clear vendor
> > >    extension bits for xtheadvector
> > > - Add ghostwrite mitigations for c9xx CPUs. This disables
> > > xtheadvector
> > >    unless mitigations=off is set as a kernel boot arg
> > > - Link to v9:
> > > https://lore.kernel.org/r/20240806-xtheadvector-v9-0-62a56d2da...@rivosinc.com
> > > 
> > > Changes in v9:
> > > - Rebase onto palmer's for-next
> > > - Fix sparse error in arch/riscv/kernel/vendor_extensions/thead.c
> > > - Fix maybe-uninitialized warning in
> > > arch/riscv/include/asm/vendor_extensions/vendor_hwprobe.h
> > > - Wrap some long lines
> > > - Link to v8:
> > > https://lore.kernel.org/r/20240724-xtheadvector-v8-0-cf043168e...@rivosinc.com
> > > 
> > > Changes in v8:
> > > - Rebase onto palmer's for-next
> > > - Link to v7:
> > > https://lore.kernel.org/r/20240724-xtheadvector-v7-0-b741910ad...@rivosinc.com
> > > 
> > > Changes in v7:
> > > - Add defs for has_xtheadvector_no_alternatives() and
> > > has_xtheadvector()
> > >    when vector disabled. (Palmer)
> > > - Link to v6:
> > > https://lore.kernel.org/r/20240722-xtheadvector-v6-0-c9af0130f...@rivosinc.com
> > > 
> > > Changes in v6:
> > > - Fix return type of is_vector_supported()/is_xthead_supported()
> > > to be bool
> > > - Link to v5:
> > > https://lore.kernel.org/r/20240719-xtheadvector-v5-0-4b485fc7d...@rivosinc.com
> > > 
> > > Changes in v5:
> > > - Rebase on for-next
> > > - Link to v4:
> > > https://lore.kernel.org/r/20240702-xtheadvector-v4-0-2bad6820d...@rivosinc.com
> > > 
> > > Changes in v4:
> > > - Replace inline asm with C (Samuel)
> > > - Rename VCSRs to CSRs (Samuel)
> > > - Replace .insn directives with .4byte directives
> > > - Link to v3:
> > > https://lore.kernel.org/r/20240619-xtheadvector-v3-0-bff39eb96...@rivosinc.com
> > > 
> > > Changes in v3:
> > > - Add back Heiko's signed-off-by (Conor)
> > > - Mark RISCV_HWPROBE_KEY_VENDOR_EXT_THEAD_0 as a bitmask
> > > - Link to v2:
> > > https://lore.kernel.org/r/20240610-xtheadvector-v2-0-97a48613a...@rivosinc.com
> > > 
> > > Changes in v2:
> > > - Removed extraneous references to "riscv,vlenb" (Jess)
> > > - Moved declaration of "thead,vlenb" into c

Re: [PATCH bpf v8 1/5] strparser: add read_sock callback

2025-01-21 Thread Jakub Sitnicki

On Tue, Jan 21, 2025 at 01:07 PM +08, Jiayuan Chen wrote:
> Added a new read_sock handler, allowing users to customize read operations
> instead of relying on the native socket's read_sock.
>
> Signed-off-by: Jiayuan Chen 
> ---

Reviewed-by: Jakub Sitnicki

Re: [PATCH v6 01/15] lib: Add TLV parser

2025-01-21 Thread Roberto Sassu

On Tue, 2025-01-21 at 15:21 +0100, Thomas Weißschuh wrote:
> On 2025-01-21 14:48:09+0100, Roberto Sassu wrote:
> > On Tue, 2025-01-21 at 14:29 +0100, Thomas Weißschuh wrote:
> > > Hi Robert,
> > > 
> > > On 2024-11-19 11:49:08+0100, Roberto Sassu wrote:
> > > > From: Roberto Sassu 
> > > > 
> > > > Add a parser of a generic Type-Length-Value (TLV) format:
> > > > 
> > > > +--+--+-++-+
> > > > > field1 (u16) | len1 (u32) | value1 (u8 len1) |
> > > > +--++--+
> > > > > ...  |... |...   |
> > > > +--++--+
> > > > > fieldN (u16) | lenN (u32) | valueN (u8 lenN) |
> > > > +--++--+
> > > 
> > > Should mention that its big endian.
> > 
> > Ok.
> > 
> > > > Each adopter can define its own fields. The TLV parser does not need to 
> > > > be
> > > > aware of those, but lets the adopter obtain the data and decide how to
> > > 
> > > "adopter" -> "user".
> > 
> > Ok.
> > 
> > > > continue.
> > > > 
> > > > After processing a TLV entry, call the callback function also with the
> > > > callback data provided by the adopter. The latter can decide how to
> > > > interpret the TLV entry depending on the field ID.
> > > > 
> > > > Nesting TLVs is also possible, the callback function can call 
> > > > tlv_parse()
> > > > to parse the inner structure.
> > > 
> > > Given that we already have the netlink data structures, helpers and
> > > infrastructure, what is the advantage over those?
> > 
> > Sorry, I'm not too familiar on how netlink works, so I might not
> > understand your point.
> 
> Netlink is a TLV format used by the Linux networking subsystem:
> 
> struct nlmsghdr {
> __u32 nlmsg_len;/* Length of message including header */
> __u16 nlmsg_type;   /* Type of message content */
> __u16 nlmsg_flags;  /* Additional flags */
> __u32 nlmsg_seq;/* Sequence number */
> __u32 nlmsg_pid;/* Sender port ID */
> };
> 
> https://man.archlinux.org/man/core/man-pages/netlink.7.en
> 
> There are both userspace and in-kernel infrastructures to handle it.
> Looking at it again however it has some unnecessary fields, wasting
> space and uses "host" byteorder which is a problem for an on-disk
> datastructure.
> So maybe not a good alternative after all.
> 
> > I think the benefit of this data structure is the retrocompatibility.
> > If you add new data fields, you don't need to introduce a v2, v3 data
> > format.
> > 
> > New versions of the parser can consume the new information, while the
> > older can still take the ones they are able to understand.
> 
> This is also exactly how netlink is used.

Ok, perfect!

> FYI there were also some review comments inside the patch diff itself.

Ops, thanks! Missed them. Will reply to that email.

Roberto

Re: [PATCH net-next v4 8/9] tap: Keep hdr_len in tap_get_user()

2025-01-21 Thread Willem de Bruijn

Akihiko Odaki wrote:
> On 2025/01/20 20:24, Willem de Bruijn wrote:
> > Akihiko Odaki wrote:
> >> hdr_len is repeatedly used so keep it in a local variable.
> >>
> >> Signed-off-by: Akihiko Odaki 
> >> ---
> >>   drivers/net/tap.c | 17 +++--
> >>   1 file changed, 7 insertions(+), 10 deletions(-)
> >>
> >> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
> >> index 
> >> 061c2f27dfc83f5e6d0bea4da0e845cc429b1fd8..7ee2e9ee2a89fd539b087496b92d2f6198266f44
> >>  100644
> >> --- a/drivers/net/tap.c
> >> +++ b/drivers/net/tap.c
> >> @@ -645,6 +645,7 @@ static ssize_t tap_get_user(struct tap_queue *q, void 
> >> *msg_control,
> >>int err;
> >>struct virtio_net_hdr vnet_hdr = { 0 };
> >>int vnet_hdr_len = 0;
> >> +  int hdr_len = 0;
> >>int copylen = 0;
> >>int depth;
> >>bool zerocopy = false;
> >> @@ -672,6 +673,7 @@ static ssize_t tap_get_user(struct tap_queue *q, void 
> >> *msg_control,
> >>err = -EINVAL;
> >>if (tap16_to_cpu(q, vnet_hdr.hdr_len) > iov_iter_count(from))
> >>goto err;
> >> +  hdr_len = tap16_to_cpu(q, vnet_hdr.hdr_len);
> >>}
> >>   
> >>len = iov_iter_count(from);
> >> @@ -683,11 +685,8 @@ static ssize_t tap_get_user(struct tap_queue *q, void 
> >> *msg_control,
> >>if (msg_control && sock_flag(&q->sk, SOCK_ZEROCOPY)) {
> >>struct iov_iter i;
> >>   
> >> -  copylen = vnet_hdr.hdr_len ?
> >> -  tap16_to_cpu(q, vnet_hdr.hdr_len) : GOODCOPY_LEN;
> >> -  if (copylen > good_linear)
> >> -  copylen = good_linear;
> >> -  else if (copylen < ETH_HLEN)
> >> +  copylen = min(hdr_len ? hdr_len : GOODCOPY_LEN, good_linear);
> >> +  if (copylen < ETH_HLEN)
> >>copylen = ETH_HLEN;
> >>linear = copylen;
> >>i = *from;
> >> @@ -698,11 +697,9 @@ static ssize_t tap_get_user(struct tap_queue *q, void 
> >> *msg_control,
> >>   
> >>if (!zerocopy) {
> >>copylen = len;
> >> -  linear = tap16_to_cpu(q, vnet_hdr.hdr_len);
> >> -  if (linear > good_linear)
> >> -  linear = good_linear;
> >> -  else if (linear < ETH_HLEN)
> >> -  linear = ETH_HLEN;
> >> +  linear = min(hdr_len, good_linear);
> >> +  if (copylen < ETH_HLEN)
> >> +  copylen = ETH_HLEN;
> > 
> > Similar to previous patch, I don't think this patch is significant
> > enough to warrant the code churn.
> 
> The following patch will require replacing
> tap16_to_cpu(q, vnet_hdr.hdr_len)
> with
> tap16_to_cpu(q->flags, vnet_hdr.hdr_len)
>
> It will make some lines a bit too long. Calling tap16_to_cpu() at 
> multiple places is also not good to keep the vnet implementation unified 
> as the function inspects vnet_hdr.
> 
> This patch is independently too trivial, but I think it is a worthwhile 
> cleanup combined with the following patch.

Ok. I see what you mean.

Re: [PATCH net v3 9/9] tap: Use tun's vnet-related code

2025-01-21 Thread Willem de Bruijn

Akihiko Odaki wrote:
> On 2025/01/20 20:19, Willem de Bruijn wrote:
> > On Mon, Jan 20, 2025 at 1:37 AM Jason Wang  wrote:
> >>
> >> On Fri, Jan 17, 2025 at 6:35 PM Akihiko Odaki  
> >> wrote:
> >>>
> >>> On 2025/01/17 18:23, Willem de Bruijn wrote:
>  Akihiko Odaki wrote:
> > tun and tap implements the same vnet-related features so reuse the code.
> >
> > Signed-off-by: Akihiko Odaki 
> > ---
> >drivers/net/Kconfig|   1 +
> >drivers/net/Makefile   |   6 +-
> >drivers/net/tap.c  | 152 
> > +
> >drivers/net/tun_vnet.c |   5 ++
> >4 files changed, 24 insertions(+), 140 deletions(-)
> >
> > diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
> > index 1fd5acdc73c6..c420418473fc 100644
> > --- a/drivers/net/Kconfig
> > +++ b/drivers/net/Kconfig
> > @@ -395,6 +395,7 @@ config TUN
> >   tristate "Universal TUN/TAP device driver support"
> >   depends on INET
> >   select CRC32
> > +select TAP
> >   help
> > TUN/TAP provides packet reception and transmission for user 
> > space
> > programs.  It can be viewed as a simple Point-to-Point or 
> > Ethernet
> > diff --git a/drivers/net/Makefile b/drivers/net/Makefile
> > index bb8eb3053772..2275309a97ee 100644
> > --- a/drivers/net/Makefile
> > +++ b/drivers/net/Makefile
> > @@ -29,9 +29,9 @@ obj-y += mdio/
> >obj-y += pcs/
> >obj-$(CONFIG_RIONET) += rionet.o
> >obj-$(CONFIG_NET_TEAM) += team/
> > -obj-$(CONFIG_TUN) += tun-drv.o
> > -tun-drv-y := tun.o tun_vnet.o
> > -obj-$(CONFIG_TAP) += tap.o
> > +obj-$(CONFIG_TUN) += tun.o
> 
>  Is reversing the previous changes to tun.ko intentional?
> 
>  Perhaps the previous approach with a new CONFIG_TUN_VNET is preferable
>  over this. In particular over making TUN select TAP, a new dependency.
> >>>
> >>> Jason, you also commented about CONFIG_TUN_VNET for the previous
> >>> version. Do you prefer the old approach, or the new one? (Or if you have
> >>> another idea, please tell me.)
> >>
> >> Ideally, if we can make TUN select TAP that would be better. But there
> >> are some subtle differences in the multi queue implementation. We will
> >> end up with some useless code for TUN unless we can unify the multi
> >> queue logic. It might not be worth it to change the TUN's multi queue
> >> logic so having a new file seems to be better.
> > 
> > +1 on deduplicating further. But this series is complex enough. Let's not
> > expand that.
> > 
> > The latest approach with a separate .o file may have some performance
> > cost by converting likely inlined code into real function calls.
> > Another option is to move it all into tun_vnet.h. That also resolves
> > the Makefile issues.
> 
> I measured the size difference between the latest inlining approaches. 
> The numbers may vary depending on the system configuration of course, 
> but they should be useful for reference.
> 
> The below shows sizes when having a separate module: 106496 bytes in total
> 
> # lsmod
> Module  Size  Used by
> tap28672  0
> tun61440  0
> tun_vnet   16384  2 tun,tap
> 
> The below shows sizes when inlining: 102400 bytes in total
> 
> # lsmod
> Module  Size  Used by
> tap32768  0
> tun69632  0
> 
> So having a separate module costs 4096 bytes more.
> 
> These two approaches should have similar tendency for run-time and 
> compile-time performance; the code is so trivial that the overhead of 
> having one additional module is dominant.

The concern raised was not about object size, but about inlined vs
true calls in the hot path.
 
> The only downside of having all in tun_vnet.h is that it will expose its 
> internal macros and functions, which I think tolerable.

As long as only included by tun.c and tap.c, I think that's okay.
The slow path code (ioctl) could remain in a .c file. But it's
simpler to just have the header file.

Re: [PATCH bpf v8 2/5] bpf: fix wrong copied_seq calculation

2025-01-21 Thread Jakub Sitnicki

On Tue, Jan 21, 2025 at 01:07 PM +08, Jiayuan Chen wrote:
> 'sk->copied_seq' was updated in the tcp_eat_skb() function when the
> action of a BPF program was SK_REDIRECT. For other actions, like SK_PASS,
> the update logic for 'sk->copied_seq' was moved to
> tcp_bpf_recvmsg_parser() to ensure the accuracy of the 'fionread' feature.
>
> It works for a single stream_verdict scenario, as it also modified
> 'sk_data_ready->sk_psock_verdict_data_ready->tcp_read_skb'
> to remove updating 'sk->copied_seq'.
>
> However, for programs where both stream_parser and stream_verdict are
> active(strparser purpose), tcp_read_sock() was used instead of

Nit: missing space, "active (strparser purpose)"
^

> tcp_read_skb() (sk_data_ready->strp_data_ready->tcp_read_sock)

Nit: missing period, "… (sk_data_ready->strp_data_ready->tcp_read_sock)."
   ^

> tcp_read_sock() now still update 'sk->copied_seq', leading to duplicated

Nit: grammar "still updates"
  ^
Please run commit descriptions through a language tool or an LLM, if
possible. This makes reviewing easier.

> updates.
>
> In summary, for strparser + SK_PASS, copied_seq is redundantly calculated
> in both tcp_read_sock() and tcp_bpf_recvmsg_parser().
>
> The issue causes incorrect copied_seq calculations, which prevent
> correct data reads from the recv() interface in user-land.
>
> We do not want to add new proto_ops to implement a new version of
> tcp_read_sock, as this would introduce code complexity [1].
>
> We add new callback for strparser for customized read operation.
>
> [1]: https://lore.kernel.org/bpf/20241218053408.437295-1-mr...@163.com
> Fixes: e5c6de5fa025 ("bpf, sockmap: Incorrectly handling copied_seq")
> Suggested-by: Jakub Sitnicki 
> Signed-off-by: Jiayuan Chen 
> ---
>  include/linux/skmsg.h |  2 ++
>  include/net/tcp.h |  8 
>  net/core/skmsg.c  |  7 +++
>  net/ipv4/tcp.c| 29 -
>  net/ipv4/tcp_bpf.c| 42 ++
>  5 files changed, 83 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
> index 2cbe0c22a32f..0b9095a281b8 100644
> --- a/include/linux/skmsg.h
> +++ b/include/linux/skmsg.h
> @@ -91,6 +91,8 @@ struct sk_psock {
>   struct sk_psock_progs   progs;
>  #if IS_ENABLED(CONFIG_BPF_STREAM_PARSER)
>   struct strparserstrp;
> + u32 copied_seq;
> + u32 ingress_bytes;
>  #endif
>   struct sk_buff_head ingress_skb;
>   struct list_headingress_msg;
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index e9b37b76e894..06affc653247 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -729,6 +729,9 @@ void tcp_get_info(struct sock *, struct tcp_info *);
>  /* Read 'sendfile()'-style from a TCP socket */
>  int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
> sk_read_actor_t recv_actor);
> +int tcp_read_sock_noack(struct sock *sk, read_descriptor_t *desc,
> + sk_read_actor_t recv_actor, bool noack,
> + u32 *copied_seq);
>  int tcp_read_skb(struct sock *sk, skb_read_actor_t recv_actor);
>  struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off);
>  void tcp_read_done(struct sock *sk, size_t len);
> @@ -2599,6 +2602,11 @@ struct sk_psock;
>  #ifdef CONFIG_BPF_SYSCALL
>  int tcp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool 
> restore);
>  void tcp_bpf_clone(const struct sock *sk, struct sock *newsk);
> +#ifdef CONFIG_BPF_STREAM_PARSER
> +struct strparser;
> +int tcp_bpf_strp_read_sock(struct strparser *strp, read_descriptor_t *desc,
> +sk_read_actor_t recv_actor);
> +#endif /* CONFIG_BPF_STREAM_PARSER */
>  #endif /* CONFIG_BPF_SYSCALL */
>  
>  #ifdef CONFIG_INET
> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index 61f3f3d4e528..0ddc4c718833 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c
> @@ -549,6 +549,9 @@ static int sk_psock_skb_ingress_enqueue(struct sk_buff 
> *skb,
>   return num_sge;
>   }
>  
> +#if IS_ENABLED(CONFIG_BPF_STREAM_PARSER)
> + psock->ingress_bytes += len;
> +#endif
>   copied = len;
>   msg->sg.start = 0;
>   msg->sg.size = copied;
> @@ -1144,6 +1147,10 @@ int sk_psock_init_strp(struct sock *sk, struct 
> sk_psock *psock)
>   if (!ret)
>   sk_psock_set_state(psock, SK_PSOCK_RX_STRP_ENABLED);
>  
> + if (sk_is_tcp(sk)) {
> + psock->strp.cb.read_sock = tcp_bpf_strp_read_sock;
> + psock->copied_seq = tcp_sk(sk)->copied_seq;
> + }
>   return ret;
>  }
>  
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 0d704bda6c41..285678d8ce07 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -1565,12 +1565,13 @@ EXPORT_SYMBOL(tcp_recv_skb

Re: [PATCH v6 01/15] lib: Add TLV parser

2025-01-21 Thread Thomas Weißschuh

On 2025-01-21 14:48:09+0100, Roberto Sassu wrote:
> On Tue, 2025-01-21 at 14:29 +0100, Thomas Weißschuh wrote:
> > Hi Robert,
> > 
> > On 2024-11-19 11:49:08+0100, Roberto Sassu wrote:
> > > From: Roberto Sassu 
> > > 
> > > Add a parser of a generic Type-Length-Value (TLV) format:
> > > 
> > > +--+--+-++-+
> > > > field1 (u16) | len1 (u32) | value1 (u8 len1) |
> > > +--++--+
> > > > ...  |... |...   |
> > > +--++--+
> > > > fieldN (u16) | lenN (u32) | valueN (u8 lenN) |
> > > +--++--+
> > 
> > Should mention that its big endian.
> 
> Ok.
> 
> > > Each adopter can define its own fields. The TLV parser does not need to be
> > > aware of those, but lets the adopter obtain the data and decide how to
> > 
> > "adopter" -> "user".
> 
> Ok.
> 
> > > continue.
> > > 
> > > After processing a TLV entry, call the callback function also with the
> > > callback data provided by the adopter. The latter can decide how to
> > > interpret the TLV entry depending on the field ID.
> > > 
> > > Nesting TLVs is also possible, the callback function can call tlv_parse()
> > > to parse the inner structure.
> > 
> > Given that we already have the netlink data structures, helpers and
> > infrastructure, what is the advantage over those?
> 
> Sorry, I'm not too familiar on how netlink works, so I might not
> understand your point.

Netlink is a TLV format used by the Linux networking subsystem:

struct nlmsghdr {
__u32 nlmsg_len;/* Length of message including header */
__u16 nlmsg_type;   /* Type of message content */
__u16 nlmsg_flags;  /* Additional flags */
__u32 nlmsg_seq;/* Sequence number */
__u32 nlmsg_pid;/* Sender port ID */
};

https://man.archlinux.org/man/core/man-pages/netlink.7.en

There are both userspace and in-kernel infrastructures to handle it.
Looking at it again however it has some unnecessary fields, wasting
space and uses "host" byteorder which is a problem for an on-disk
datastructure.
So maybe not a good alternative after all.

> I think the benefit of this data structure is the retrocompatibility.
> If you add new data fields, you don't need to introduce a v2, v3 data
> format.
> 
> New versions of the parser can consume the new information, while the
> older can still take the ones they are able to understand.

This is also exactly how netlink is used.

FYI there were also some review comments inside the patch diff itself.


Thomas

Re: [PATCH bpf v8 3/5] bpf: disable non stream socket for strparser

2025-01-21 Thread Jakub Sitnicki

On Tue, Jan 21, 2025 at 01:07 PM +08, Jiayuan Chen wrote:
> Currently, only TCP supports strparser, but sockmap doesn't intercept
> non-TCP to attach strparser. For example, with UDP, although the
> read/write handlers are replaced, strparser is not executed due to the
> lack of read_sock operation.
>
> Furthermore, in udp_bpf_recvmsg(), it checks whether psock has data, and
> if not, it falls back to the native UDP read interface, making
> UDP + strparser appear to read correctly. According to it's commit

Nit: typo, "its"

> history, the behavior is unexpected.
>
> Moreover, since UDP lacks the concept of streams, we intercept it
> directly.
>
> Signed-off-by: Jiayuan Chen 
> ---
>  net/core/sock_map.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> index f1b9b3958792..3b0f59d9b4db 100644
> --- a/net/core/sock_map.c
> +++ b/net/core/sock_map.c
> @@ -303,7 +303,10 @@ static int sock_map_link(struct bpf_map *map, struct 
> sock *sk)
>  
>   write_lock_bh(&sk->sk_callback_lock);
>   if (stream_parser && stream_verdict && !psock->saved_data_ready) {
> - ret = sk_psock_init_strp(sk, psock);
> + if (sk_is_tcp(sk))
> + ret = sk_psock_init_strp(sk, psock);
> + else
> + ret = -EOPNOTSUPP;
>   if (ret) {
>   write_unlock_bh(&sk->sk_callback_lock);
>   sk_psock_put(sk, psock);

Reviewed-by: Jakub Sitnicki

Re: [PATCH v6 01/15] lib: Add TLV parser

2025-01-21 Thread Roberto Sassu

On Tue, 2025-01-21 at 14:29 +0100, Thomas Weißschuh wrote:
> Hi Robert,
> 
> On 2024-11-19 11:49:08+0100, Roberto Sassu wrote:
> > From: Roberto Sassu 
> > 
> > Add a parser of a generic Type-Length-Value (TLV) format:
> > 
> > +--+--+-++-+
> > > field1 (u16) | len1 (u32) | value1 (u8 len1) |
> > +--++--+
> > > ...  |... |...   |
> > +--++--+
> > > fieldN (u16) | lenN (u32) | valueN (u8 lenN) |
> > +--++--+
> 
> Should mention that its big endian.
> 
> > Each adopter can define its own fields. The TLV parser does not need to be
> > aware of those, but lets the adopter obtain the data and decide how to
> 
> "adopter" -> "user".
> 
> > continue.
> > 
> > After processing a TLV entry, call the callback function also with the
> > callback data provided by the adopter. The latter can decide how to
> > interpret the TLV entry depending on the field ID.
> > 
> > Nesting TLVs is also possible, the callback function can call tlv_parse()
> > to parse the inner structure.
> 
> Given that we already have the netlink data structures, helpers and
> infrastructure, what is the advantage over those?
> 
> > 
> > Signed-off-by: Roberto Sassu 
> > ---
> >  MAINTAINERS |  8 +++
> >  include/linux/tlv_parser.h  | 32 
> >  include/uapi/linux/tlv_parser.h | 41 
> >  lib/Kconfig |  3 ++
> >  lib/Makefile|  2 +
> >  lib/tlv_parser.c| 87 +
> >  lib/tlv_parser.h| 18 +++
> >  7 files changed, 191 insertions(+)
> >  create mode 100644 include/linux/tlv_parser.h
> >  create mode 100644 include/uapi/linux/tlv_parser.h
> >  create mode 100644 lib/tlv_parser.c
> >  create mode 100644 lib/tlv_parser.h
> > 
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index a097afd76ded..1f7ffa1c9dbd 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -23388,6 +23388,14 @@ W: http://sourceforge.net/projects/tlan/
> >  F: Documentation/networking/device_drivers/ethernet/ti/tlan.rst
> >  F: drivers/net/ethernet/ti/tlan.*
> >  
> > +TLV PARSER
> > +M: Roberto Sassu 
> > +L: linux-ker...@vger.kernel.org
> > +S: Maintained
> > +F: include/linux/tlv_parser.h
> > +F: include/uapi/linux/tlv_parser.h
> > +F: lib/tlv_parser.*
> > +
> >  TMIO/SDHI MMC DRIVER
> >  M: Wolfram Sang 
> >  L: linux-...@vger.kernel.org
> > diff --git a/include/linux/tlv_parser.h b/include/linux/tlv_parser.h
> > new file mode 100644
> > index ..0c72742af548
> > --- /dev/null
> > +++ b/include/linux/tlv_parser.h
> > @@ -0,0 +1,32 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * Copyright (C) 2023-2024 Huawei Technologies Duesseldorf GmbH
> > + *
> > + * Author: Roberto Sassu 
> > + *
> > + * Header file of TLV parser.
> > + */
> > +
> > +#ifndef _LINUX_TLV_PARSER_H
> > +#define _LINUX_TLV_PARSER_H
> > +
> > +#include 
> > +
> > +/**
> > + * typedef callback - Callback after parsing TLV entry
> > + * @callback_data: Opaque data to supply to the callback function
> > + * @field: Field identifier
> > + * @field_data: Field data
> > + * @field_len: Length of @field_data
> > + *
> > + * This callback is invoked after a TLV entry is parsed.
> > + *
> > + * Return: Zero on success, a negative value on error.
> 
> It's not explained what happens on error.

Ok, will be more specific.

> > + */
> > +typedef int (*callback)(void *callback_data, __u16 field,
> > +   const __u8 *field_data, __u32 field_len);
> 
> No need for __underscore types in kernel-only signatures.

It is just for convenience. I'm reusing the same file for the userspace
counterpart digest-cache-tools. In that case, the parser is used for
example to show the content of the digest list.

> > +
> > +int tlv_parse(callback callback, void *callback_data, const __u8 *data,
> > + size_t data_len, const char **fields, __u32 num_fields);
> > +
> > +#endif /* _LINUX_TLV_PARSER_H */
> > diff --git a/include/uapi/linux/tlv_parser.h 
> > b/include/uapi/linux/tlv_parser.h
> > new file mode 100644
> > index ..171d0cfd2c4c
> > --- /dev/null
> > +++ b/include/uapi/linux/tlv_parser.h
> > @@ -0,0 +1,41 @@
> > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > +/*
> > + * Copyright (C) 2023-2024 Huawei Technologies Duesseldorf GmbH
> > + *
> > + * Author: Roberto Sassu 
> > + *
> > + * Implement the user space interface for the TLV parser.
> > + */
> 
> Can you explain in the commit message where this will be exposed to
> userspace as binary?

I see that my explanation is not ideal.

This is the format for data exchange between user space and kernel
space, but it is still the kernel that reads and parses the TLV-
formatted file for extracting the digests and adding them to the digest
cache.

> > +
> > +#ifndef _UAPI_LINUX_TLV_PARS

Re: [PATCH v5 2/3] RISC-V: hwprobe: Expose Zicbom extension and its block size

2025-01-21 Thread Andrew Jones

On Wed, Jan 15, 2025 at 10:40:23AM +0800, Yunhui Cui wrote:
> Expose Zicbom through hwprobe and also provide a key to extract its
> respective block size.
> 
> Signed-off-by: Yunhui Cui 
> ---
>  Documentation/arch/riscv/hwprobe.rst  | 6 ++
>  arch/riscv/include/asm/hwprobe.h  | 2 +-
>  arch/riscv/include/uapi/asm/hwprobe.h | 2 ++
>  arch/riscv/kernel/sys_hwprobe.c   | 6 ++
>  4 files changed, 15 insertions(+), 1 deletion(-)
>

As the bot points out, we need to add the following to this patch.

Thanks,
drew

diff --git a/arch/riscv/kernel/sys_hwprobe.c b/arch/riscv/kernel/sys_hwprobe.c
index cb93adfffc48..6b5b24b399c3 100644
--- a/arch/riscv/kernel/sys_hwprobe.c
+++ b/arch/riscv/kernel/sys_hwprobe.c
@@ -160,7 +160,7 @@ static void hwprobe_isa_ext0(struct riscv_hwprobe *pair,
pair->value &= ~missing;
 }

-static bool hwprobe_ext0_has(const struct cpumask *cpus, unsigned long ext)
+static bool hwprobe_ext0_has(const struct cpumask *cpus, u64 ext)
 {
struct riscv_hwprobe pair;

Re: [PATCH v2 0/6] module: Introduce hash-based integrity checking

2025-01-21 Thread Roberto Sassu

On Mon, 2025-01-20 at 18:44 +0100, Thomas Weißschuh wrote:
> The current signature-based module integrity checking has some drawbacks
> in combination with reproducible builds:
> Either the module signing key is generated at build time, which makes
> the build unreproducible, or a static key is used, which precludes
> rebuilds by third parties and makes the whole build and packaging
> process much more complicated.
> Introduce a new mechanism to ensure only well-known modules are loaded
> by embedding a list of hashes of all modules built as part of the full
> kernel build into vmlinux.
> 
> Interest has been proclaimed by NixOS, Arch Linux, Proxmox, SUSE and the
> general reproducible builds community.
> 
> To properly test the reproducibility in combination with CONFIG_INFO_BTF
> another patch is needed:
> "[PATCH bpf-next] kbuild, bpf: Enable reproducible BTF generation" [0]
> (If you happen to test that one, please give some feedback)
> 
> Questions for current patch:
> * Naming
> * Can the number of built-in modules be retrieved while building
>   kernel/module/hashes.o? This would remove the need for the
>   preallocation step in link-vmlinux.sh.
> 
> Further improvements:
> * Use a LSM/IMA/Keyring to store and validate hashes

+ linux-integrity, Mimi

Hi Thomas

I developed something related to it, it is called Integrity Digest
Cache [1].

It has the ability to store in the kernel memory a cache of digests
extracted from a file (or if desired in the future, from a reserved
area in the kernel image).

It exposes an API to query a digest (get/lookup/put) from a digest
cache and to verify whether or not the integrity of the file digests
were extracted from was verified by IMA or another LSM
(verif_set/verif_get). 

Roberto


[1]: 
https://lore.kernel.org/linux-integrity/20241119104922.2772571-1-roberto.sa...@huaweicloud.com/

> * Use MODULE_SIG_HASH for configuration
> * UAPI for discovery?
> 
> [0] 
> https://lore.kernel.org/lkml/20241211-pahole-reproducible-v1-1-22feae19b...@weissschuh.net/
> 
> Signed-off-by: Thomas Weißschuh 
> ---
> Changes in v2:
> - Drop RFC state
> - Mention interested parties in cover letter
> - Expand Kconfig description
> - Add compatibility with CONFIG_MODULE_SIG
> - Parallelize module-hashes.sh
> - Update Documentation/kbuild/reproducible-builds.rst
> - Link to v1: 
> https://lore.kernel.org/r/20241225-module-hashes-v1-0-d710ce7a3...@weissschuh.net
> 
> ---
> Thomas Weißschuh (6):
>   kbuild: add stamp file for vmlinux BTF data
>   module: Make module loading policy usable without MODULE_SIG
>   module: Move integrity checks into dedicated function
>   module: Move lockdown check into generic module loader
>   lockdown: Make the relationship to MODULE_SIG a dependency
>   module: Introduce hash-based integrity checking
> 
>  .gitignore   |  1 +
>  Documentation/kbuild/reproducible-builds.rst |  5 ++-
>  Makefile |  8 -
>  include/asm-generic/vmlinux.lds.h| 11 ++
>  include/linux/module.h   |  8 ++---
>  include/linux/module_hashes.h| 17 +
>  kernel/module/Kconfig| 21 ++-
>  kernel/module/Makefile   |  1 +
>  kernel/module/hashes.c   | 52 +++
>  kernel/module/internal.h |  8 +
>  kernel/module/main.c | 54 
> +---
>  kernel/module/signing.c  | 24 +
>  scripts/Makefile.modfinal| 10 --
>  scripts/Makefile.vmlinux |  5 +++
>  scripts/link-vmlinux.sh  | 31 +++-
>  scripts/module-hashes.sh | 26 ++
>  security/lockdown/Kconfig|  2 +-
>  17 files changed, 238 insertions(+), 46 deletions(-)
> ---
> base-commit: 2cd5917560a84d69dd6128b640d7a68406ff019b
> change-id: 20241225-module-hashes-7a50a7cc2a30
> 
> Best regards,

Re: [PATCH v6 01/15] lib: Add TLV parser

2025-01-21 Thread Thomas Weißschuh

Hi Robert,

On 2024-11-19 11:49:08+0100, Roberto Sassu wrote:
> From: Roberto Sassu 
> 
> Add a parser of a generic Type-Length-Value (TLV) format:
> 
> +--+--+-++-+
> | field1 (u16) | len1 (u32) | value1 (u8 len1) |
> +--++--+
> | ...  |... |...   |
> +--++--+
> | fieldN (u16) | lenN (u32) | valueN (u8 lenN) |
> +--++--+

Should mention that its big endian.

> Each adopter can define its own fields. The TLV parser does not need to be
> aware of those, but lets the adopter obtain the data and decide how to

"adopter" -> "user".

> continue.
> 
> After processing a TLV entry, call the callback function also with the
> callback data provided by the adopter. The latter can decide how to
> interpret the TLV entry depending on the field ID.
> 
> Nesting TLVs is also possible, the callback function can call tlv_parse()
> to parse the inner structure.

Given that we already have the netlink data structures, helpers and
infrastructure, what is the advantage over those?

> 
> Signed-off-by: Roberto Sassu 
> ---
>  MAINTAINERS |  8 +++
>  include/linux/tlv_parser.h  | 32 
>  include/uapi/linux/tlv_parser.h | 41 
>  lib/Kconfig |  3 ++
>  lib/Makefile|  2 +
>  lib/tlv_parser.c| 87 +
>  lib/tlv_parser.h| 18 +++
>  7 files changed, 191 insertions(+)
>  create mode 100644 include/linux/tlv_parser.h
>  create mode 100644 include/uapi/linux/tlv_parser.h
>  create mode 100644 lib/tlv_parser.c
>  create mode 100644 lib/tlv_parser.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index a097afd76ded..1f7ffa1c9dbd 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -23388,6 +23388,14 @@ W:   http://sourceforge.net/projects/tlan/
>  F:   Documentation/networking/device_drivers/ethernet/ti/tlan.rst
>  F:   drivers/net/ethernet/ti/tlan.*
>  
> +TLV PARSER
> +M:   Roberto Sassu 
> +L:   linux-ker...@vger.kernel.org
> +S:   Maintained
> +F:   include/linux/tlv_parser.h
> +F:   include/uapi/linux/tlv_parser.h
> +F:   lib/tlv_parser.*
> +
>  TMIO/SDHI MMC DRIVER
>  M:   Wolfram Sang 
>  L:   linux-...@vger.kernel.org
> diff --git a/include/linux/tlv_parser.h b/include/linux/tlv_parser.h
> new file mode 100644
> index ..0c72742af548
> --- /dev/null
> +++ b/include/linux/tlv_parser.h
> @@ -0,0 +1,32 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (C) 2023-2024 Huawei Technologies Duesseldorf GmbH
> + *
> + * Author: Roberto Sassu 
> + *
> + * Header file of TLV parser.
> + */
> +
> +#ifndef _LINUX_TLV_PARSER_H
> +#define _LINUX_TLV_PARSER_H
> +
> +#include 
> +
> +/**
> + * typedef callback - Callback after parsing TLV entry
> + * @callback_data: Opaque data to supply to the callback function
> + * @field: Field identifier
> + * @field_data: Field data
> + * @field_len: Length of @field_data
> + *
> + * This callback is invoked after a TLV entry is parsed.
> + *
> + * Return: Zero on success, a negative value on error.

It's not explained what happens on error.

> + */
> +typedef int (*callback)(void *callback_data, __u16 field,
> + const __u8 *field_data, __u32 field_len);

No need for __underscore types in kernel-only signatures.

> +
> +int tlv_parse(callback callback, void *callback_data, const __u8 *data,
> +   size_t data_len, const char **fields, __u32 num_fields);
> +
> +#endif /* _LINUX_TLV_PARSER_H */
> diff --git a/include/uapi/linux/tlv_parser.h b/include/uapi/linux/tlv_parser.h
> new file mode 100644
> index ..171d0cfd2c4c
> --- /dev/null
> +++ b/include/uapi/linux/tlv_parser.h
> @@ -0,0 +1,41 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/*
> + * Copyright (C) 2023-2024 Huawei Technologies Duesseldorf GmbH
> + *
> + * Author: Roberto Sassu 
> + *
> + * Implement the user space interface for the TLV parser.
> + */

Can you explain in the commit message where this will be exposed to
userspace as binary?

> +
> +#ifndef _UAPI_LINUX_TLV_PARSER_H
> +#define _UAPI_LINUX_TLV_PARSER_H
> +
> +#include 
> +
> +/*
> + * TLV format:
> + *
> + * +--+--+-++-+
> + * | field1 (u16) | len1 (u32) | value1 (u8 len1) |
> + * +--++--+
> + * | ...  |... |...   |
> + * +--++--+
> + * | fieldN (u16) | lenN (u32) | valueN (u8 lenN) |
> + * +--++--+
> + */
> +
> +/**
> + * struct tlv_entry - Entry of TLV format
> + * @field: Field identifier
> + * @length: Data length
> + * @data: Data
> + *
> + * This structure represents an entry of the TLV format.
> + */
> +struct tlv_entry {
> + __u16 field;
> + __u32 length;

Use __be16 and

Re: [PATCH v2 0/6] module: Introduce hash-based integrity checking

2025-01-21 Thread Roberto Sassu

On Tue, 2025-01-21 at 13:58 +0100, Thomas Weißschuh wrote:
> Hi Roberto,
> 
> On 2025-01-21 11:26:29+0100, Roberto Sassu wrote:
> > On Mon, 2025-01-20 at 18:44 +0100, Thomas Weißschuh wrote:
> > > The current signature-based module integrity checking has some drawbacks
> > > in combination with reproducible builds:
> > > Either the module signing key is generated at build time, which makes
> > > the build unreproducible, or a static key is used, which precludes
> > > rebuilds by third parties and makes the whole build and packaging
> > > process much more complicated.
> > > Introduce a new mechanism to ensure only well-known modules are loaded
> > > by embedding a list of hashes of all modules built as part of the full
> > > kernel build into vmlinux.
> > > 
> > > Interest has been proclaimed by NixOS, Arch Linux, Proxmox, SUSE and the
> > > general reproducible builds community.
> > > 
> > > To properly test the reproducibility in combination with CONFIG_INFO_BTF
> > > another patch is needed:
> > > "[PATCH bpf-next] kbuild, bpf: Enable reproducible BTF generation" [0]
> > > (If you happen to test that one, please give some feedback)
> > > 
> > > Questions for current patch:
> > > * Naming
> > > * Can the number of built-in modules be retrieved while building
> > >   kernel/module/hashes.o? This would remove the need for the
> > >   preallocation step in link-vmlinux.sh.
> > > 
> > > Further improvements:
> > > * Use a LSM/IMA/Keyring to store and validate hashes
> > 
> > + linux-integrity, Mimi
> > 
> > Hi Thomas
> > 
> > I developed something related to it, it is called Integrity Digest
> > Cache [1].
> 
> Thanks for the pointer.
> 
> > It has the ability to store in the kernel memory a cache of digests
> > extracted from a file (or if desired in the future, from a reserved
> > area in the kernel image).
> > 
> > It exposes an API to query a digest (get/lookup/put) from a digest
> > cache and to verify whether or not the integrity of the file digests
> > were extracted from was verified by IMA or another LSM
> > (verif_set/verif_get). 
> 
> I see how this could be used together with the module hashes.
> For now I think both features should be developed independently.
> Integrating them will require some extra code and coordination.

Yes, I agree.

> While the current linear, unsorted list of hashes used by my code may be
> slightly inefficient, it shouldn't matter in practize as the hash
> validation is only a bunch of memcmp()s over a contiguous chunk of
> memory, which is very cheap.

Ok, I guess so, should not be too slow for this use case.

> When both features are well established we can look at integrating them.
> At least a build-time userspace generator of a digest cache would be
> necessary.
> And due to the current implementation details it would be necessary to
> estimate the size of a static digest cache more or less exactly by its
> number of elements alone.

This information is included in the digest list, since it is also used
by the Integrity Digest Cache itself to determine the correct size of
the hash table.

Thanks

Roberto

> Thomas
> 
> > [1]: 
> > https://lore.kernel.org/linux-integrity/20241119104922.2772571-1-roberto.sa...@huaweicloud.com/
> > 
> > > * Use MODULE_SIG_HASH for configuration
> > > * UAPI for discovery?
> > > 
> > > [0] 
> > > https://lore.kernel.org/lkml/20241211-pahole-reproducible-v1-1-22feae19b...@weissschuh.net/
> > > 
> > > Signed-off-by: Thomas Weißschuh 
> > > ---
> > > Changes in v2:
> > > - Drop RFC state
> > > - Mention interested parties in cover letter
> > > - Expand Kconfig description
> > > - Add compatibility with CONFIG_MODULE_SIG
> > > - Parallelize module-hashes.sh
> > > - Update Documentation/kbuild/reproducible-builds.rst
> > > - Link to v1: 
> > > https://lore.kernel.org/r/20241225-module-hashes-v1-0-d710ce7a3...@weissschuh.net
> > > 
> > > ---
> > > Thomas Weißschuh (6):
> > >   kbuild: add stamp file for vmlinux BTF data
> > >   module: Make module loading policy usable without MODULE_SIG
> > >   module: Move integrity checks into dedicated function
> > >   module: Move lockdown check into generic module loader
> > >   lockdown: Make the relationship to MODULE_SIG a dependency
> > >   module: Introduce hash-based integrity checking
> > > 
> > >  .gitignore   |  1 +
> > >  Documentation/kbuild/reproducible-builds.rst |  5 ++-
> > >  Makefile |  8 -
> > >  include/asm-generic/vmlinux.lds.h| 11 ++
> > >  include/linux/module.h   |  8 ++---
> > >  include/linux/module_hashes.h| 17 +
> > >  kernel/module/Kconfig| 21 ++-
> > >  kernel/module/Makefile   |  1 +
> > >  kernel/module/hashes.c   | 52 
> > > +++
> > >  kernel/module/internal.h |  8 +
> > >  kernel/module/main.c

Re: [PATCH v6 01/15] lib: Add TLV parser

2025-01-21 Thread Roberto Sassu

On Tue, 2025-01-21 at 14:29 +0100, Thomas Weißschuh wrote:
> Hi Robert,
> 
> On 2024-11-19 11:49:08+0100, Roberto Sassu wrote:
> > From: Roberto Sassu 
> > 
> > Add a parser of a generic Type-Length-Value (TLV) format:
> > 
> > +--+--+-++-+
> > > field1 (u16) | len1 (u32) | value1 (u8 len1) |
> > +--++--+
> > > ...  |... |...   |
> > +--++--+
> > > fieldN (u16) | lenN (u32) | valueN (u8 lenN) |
> > +--++--+
> 
> Should mention that its big endian.

Ok.

> > Each adopter can define its own fields. The TLV parser does not need to be
> > aware of those, but lets the adopter obtain the data and decide how to
> 
> "adopter" -> "user".

Ok.

> > continue.
> > 
> > After processing a TLV entry, call the callback function also with the
> > callback data provided by the adopter. The latter can decide how to
> > interpret the TLV entry depending on the field ID.
> > 
> > Nesting TLVs is also possible, the callback function can call tlv_parse()
> > to parse the inner structure.
> 
> Given that we already have the netlink data structures, helpers and
> infrastructure, what is the advantage over those?

Sorry, I'm not too familiar on how netlink works, so I might not
understand your point.

I think the benefit of this data structure is the retrocompatibility.
If you add new data fields, you don't need to introduce a v2, v3 data
format.

New versions of the parser can consume the new information, while the
older can still take the ones they are able to understand.

Thanks

Roberto

> > 
> > Signed-off-by: Roberto Sassu 
> > ---
> >  MAINTAINERS |  8 +++
> >  include/linux/tlv_parser.h  | 32 
> >  include/uapi/linux/tlv_parser.h | 41 
> >  lib/Kconfig |  3 ++
> >  lib/Makefile|  2 +
> >  lib/tlv_parser.c| 87 +
> >  lib/tlv_parser.h| 18 +++
> >  7 files changed, 191 insertions(+)
> >  create mode 100644 include/linux/tlv_parser.h
> >  create mode 100644 include/uapi/linux/tlv_parser.h
> >  create mode 100644 lib/tlv_parser.c
> >  create mode 100644 lib/tlv_parser.h
> > 
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index a097afd76ded..1f7ffa1c9dbd 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -23388,6 +23388,14 @@ W: http://sourceforge.net/projects/tlan/
> >  F: Documentation/networking/device_drivers/ethernet/ti/tlan.rst
> >  F: drivers/net/ethernet/ti/tlan.*
> >  
> > +TLV PARSER
> > +M: Roberto Sassu 
> > +L: linux-ker...@vger.kernel.org
> > +S: Maintained
> > +F: include/linux/tlv_parser.h
> > +F: include/uapi/linux/tlv_parser.h
> > +F: lib/tlv_parser.*
> > +
> >  TMIO/SDHI MMC DRIVER
> >  M: Wolfram Sang 
> >  L: linux-...@vger.kernel.org
> > diff --git a/include/linux/tlv_parser.h b/include/linux/tlv_parser.h
> > new file mode 100644
> > index ..0c72742af548
> > --- /dev/null
> > +++ b/include/linux/tlv_parser.h
> > @@ -0,0 +1,32 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * Copyright (C) 2023-2024 Huawei Technologies Duesseldorf GmbH
> > + *
> > + * Author: Roberto Sassu 
> > + *
> > + * Header file of TLV parser.
> > + */
> > +
> > +#ifndef _LINUX_TLV_PARSER_H
> > +#define _LINUX_TLV_PARSER_H
> > +
> > +#include 
> > +
> > +/**
> > + * typedef callback - Callback after parsing TLV entry
> > + * @callback_data: Opaque data to supply to the callback function
> > + * @field: Field identifier
> > + * @field_data: Field data
> > + * @field_len: Length of @field_data
> > + *
> > + * This callback is invoked after a TLV entry is parsed.
> > + *
> > + * Return: Zero on success, a negative value on error.
> 
> It's not explained what happens on error.
> 
> > + */
> > +typedef int (*callback)(void *callback_data, __u16 field,
> > +   const __u8 *field_data, __u32 field_len);
> 
> No need for __underscore types in kernel-only signatures.
> 
> > +
> > +int tlv_parse(callback callback, void *callback_data, const __u8 *data,
> > + size_t data_len, const char **fields, __u32 num_fields);
> > +
> > +#endif /* _LINUX_TLV_PARSER_H */
> > diff --git a/include/uapi/linux/tlv_parser.h 
> > b/include/uapi/linux/tlv_parser.h
> > new file mode 100644
> > index ..171d0cfd2c4c
> > --- /dev/null
> > +++ b/include/uapi/linux/tlv_parser.h
> > @@ -0,0 +1,41 @@
> > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > +/*
> > + * Copyright (C) 2023-2024 Huawei Technologies Duesseldorf GmbH
> > + *
> > + * Author: Roberto Sassu 
> > + *
> > + * Implement the user space interface for the TLV parser.
> > + */
> 
> Can you explain in the commit message where this will be exposed to
> userspace as binary?
> 
> > +
> > +#ifndef _UAPI_LINUX_TLV_PARSER_H
> > +#define _UAPI_LINUX_TLV_PARSER_H
> > +
> > +#include 
> > +
> > +/

Re: [PATCH v2 0/6] module: Introduce hash-based integrity checking

2025-01-21 Thread Thomas Weißschuh

Hi Roberto,

On 2025-01-21 11:26:29+0100, Roberto Sassu wrote:
> On Mon, 2025-01-20 at 18:44 +0100, Thomas Weißschuh wrote:
> > The current signature-based module integrity checking has some drawbacks
> > in combination with reproducible builds:
> > Either the module signing key is generated at build time, which makes
> > the build unreproducible, or a static key is used, which precludes
> > rebuilds by third parties and makes the whole build and packaging
> > process much more complicated.
> > Introduce a new mechanism to ensure only well-known modules are loaded
> > by embedding a list of hashes of all modules built as part of the full
> > kernel build into vmlinux.
> > 
> > Interest has been proclaimed by NixOS, Arch Linux, Proxmox, SUSE and the
> > general reproducible builds community.
> > 
> > To properly test the reproducibility in combination with CONFIG_INFO_BTF
> > another patch is needed:
> > "[PATCH bpf-next] kbuild, bpf: Enable reproducible BTF generation" [0]
> > (If you happen to test that one, please give some feedback)
> > 
> > Questions for current patch:
> > * Naming
> > * Can the number of built-in modules be retrieved while building
> >   kernel/module/hashes.o? This would remove the need for the
> >   preallocation step in link-vmlinux.sh.
> > 
> > Further improvements:
> > * Use a LSM/IMA/Keyring to store and validate hashes
> 
> + linux-integrity, Mimi
> 
> Hi Thomas
> 
> I developed something related to it, it is called Integrity Digest
> Cache [1].

Thanks for the pointer.

> It has the ability to store in the kernel memory a cache of digests
> extracted from a file (or if desired in the future, from a reserved
> area in the kernel image).
> 
> It exposes an API to query a digest (get/lookup/put) from a digest
> cache and to verify whether or not the integrity of the file digests
> were extracted from was verified by IMA or another LSM
> (verif_set/verif_get). 

I see how this could be used together with the module hashes.
For now I think both features should be developed independently.
Integrating them will require some extra code and coordination.

While the current linear, unsorted list of hashes used by my code may be
slightly inefficient, it shouldn't matter in practize as the hash
validation is only a bunch of memcmp()s over a contiguous chunk of
memory, which is very cheap.

When both features are well established we can look at integrating them.
At least a build-time userspace generator of a digest cache would be
necessary.
And due to the current implementation details it would be necessary to
estimate the size of a static digest cache more or less exactly by its
number of elements alone.


Thomas

> [1]: 
> https://lore.kernel.org/linux-integrity/20241119104922.2772571-1-roberto.sa...@huaweicloud.com/
> 
> > * Use MODULE_SIG_HASH for configuration
> > * UAPI for discovery?
> > 
> > [0] 
> > https://lore.kernel.org/lkml/20241211-pahole-reproducible-v1-1-22feae19b...@weissschuh.net/
> > 
> > Signed-off-by: Thomas Weißschuh 
> > ---
> > Changes in v2:
> > - Drop RFC state
> > - Mention interested parties in cover letter
> > - Expand Kconfig description
> > - Add compatibility with CONFIG_MODULE_SIG
> > - Parallelize module-hashes.sh
> > - Update Documentation/kbuild/reproducible-builds.rst
> > - Link to v1: 
> > https://lore.kernel.org/r/20241225-module-hashes-v1-0-d710ce7a3...@weissschuh.net
> > 
> > ---
> > Thomas Weißschuh (6):
> >   kbuild: add stamp file for vmlinux BTF data
> >   module: Make module loading policy usable without MODULE_SIG
> >   module: Move integrity checks into dedicated function
> >   module: Move lockdown check into generic module loader
> >   lockdown: Make the relationship to MODULE_SIG a dependency
> >   module: Introduce hash-based integrity checking
> > 
> >  .gitignore   |  1 +
> >  Documentation/kbuild/reproducible-builds.rst |  5 ++-
> >  Makefile |  8 -
> >  include/asm-generic/vmlinux.lds.h| 11 ++
> >  include/linux/module.h   |  8 ++---
> >  include/linux/module_hashes.h| 17 +
> >  kernel/module/Kconfig| 21 ++-
> >  kernel/module/Makefile   |  1 +
> >  kernel/module/hashes.c   | 52 
> > +++
> >  kernel/module/internal.h |  8 +
> >  kernel/module/main.c | 54 
> > +---
> >  kernel/module/signing.c  | 24 +
> >  scripts/Makefile.modfinal| 10 --
> >  scripts/Makefile.vmlinux |  5 +++
> >  scripts/link-vmlinux.sh  | 31 +++-
> >  scripts/module-hashes.sh | 26 ++
> >  security/lockdown/Kconfig|  2 +-
> >  17 files changed, 238 insertions(+), 46 deletions(-)
> > ---
> > b

Re: [PATCH v5 2/3] RISC-V: hwprobe: Expose Zicbom extension and its block size

2025-01-21 Thread kernel test robot

Hi Yunhui,

kernel test robot noticed the following build warnings:

[auto build test WARNING on shuah-kselftest/next]
[also build test WARNING on shuah-kselftest/fixes linus/master v6.13 
next-20250121]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:
https://github.com/intel-lab-lkp/linux/commits/Yunhui-Cui/RISC-V-Enable-cbo-clean-flush-in-usermode/20250115-104456
base:   
https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git next
patch link:
https://lore.kernel.org/r/20250115024024.84365-3-cuiyunhui%40bytedance.com
patch subject: [PATCH v5 2/3] RISC-V: hwprobe: Expose Zicbom extension and its 
block size
config: riscv-randconfig-r133-20250121 
(https://download.01.org/0day-ci/archive/20250121/202501212220.5ghttuf7-...@intel.com/config)
compiler: clang version 20.0.0git (https://github.com/llvm/llvm-project 
c23f2417dc5f6dc371afb07af5627ec2a9d373a0)
reproduce: 
(https://download.01.org/0day-ci/archive/20250121/202501212220.5ghttuf7-...@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot 
| Closes: 
https://lore.kernel.org/oe-kbuild-all/202501212220.5ghttuf7-...@intel.com/

All warnings (new ones prefixed by >>):

>> arch/riscv/kernel/sys_hwprobe.c:284:30: warning: implicit conversion from 
>> 'unsigned long long' to 'unsigned long' changes value from 1125899906842624 
>> to 0 [-Wconstant-conversion]
 284 | if (hwprobe_ext0_has(cpus, RISCV_HWPROBE_EXT_ZICBOM))
 |    ^~~~
   arch/riscv/include/uapi/asm/hwprobe.h:76:41: note: expanded from macro 
'RISCV_HWPROBE_EXT_ZICBOM'
  76 | #define RISCV_HWPROBE_EXT_ZICBOM(1ULL << 50)
 |  ~^
   1 warning generated.


vim +284 arch/riscv/kernel/sys_hwprobe.c

   244  
   245  static void hwprobe_one_pair(struct riscv_hwprobe *pair,
   246   const struct cpumask *cpus)
   247  {
   248  switch (pair->key) {
   249  case RISCV_HWPROBE_KEY_MVENDORID:
   250  case RISCV_HWPROBE_KEY_MARCHID:
   251  case RISCV_HWPROBE_KEY_MIMPID:
   252  hwprobe_arch_id(pair, cpus);
   253  break;
   254  /*
   255   * The kernel already assumes that the base single-letter ISA
   256   * extensions are supported on all harts, and only supports the
   257   * IMA base, so just cheat a bit here and tell that to
   258   * userspace.
   259   */
   260  case RISCV_HWPROBE_KEY_BASE_BEHAVIOR:
   261  pair->value = RISCV_HWPROBE_BASE_BEHAVIOR_IMA;
   262  break;
   263  
   264  case RISCV_HWPROBE_KEY_IMA_EXT_0:
   265  hwprobe_isa_ext0(pair, cpus);
   266  break;
   267  
   268  case RISCV_HWPROBE_KEY_CPUPERF_0:
   269  case RISCV_HWPROBE_KEY_MISALIGNED_SCALAR_PERF:
   270  pair->value = hwprobe_misaligned(cpus);
   271  break;
   272  
   273  case RISCV_HWPROBE_KEY_MISALIGNED_VECTOR_PERF:
   274  pair->value = hwprobe_vec_misaligned(cpus);
   275  break;
   276  
   277  case RISCV_HWPROBE_KEY_ZICBOZ_BLOCK_SIZE:
   278  pair->value = 0;
   279  if (hwprobe_ext0_has(cpus, RISCV_HWPROBE_EXT_ZICBOZ))
   280  pair->value = riscv_cboz_block_size;
   281  break;
   282  case RISCV_HWPROBE_KEY_ZICBOM_BLOCK_SIZE:
   283  pair->value = 0;
 > 284  if (hwprobe_ext0_has(cpus, RISCV_HWPROBE_EXT_ZICBOM))
   285  pair->value = riscv_cbom_block_size;
   286  break;
   287  case RISCV_HWPROBE_KEY_HIGHEST_VIRT_ADDRESS:
   288  pair->value = user_max_virt_addr();
   289  break;
   290  
   291  case RISCV_HWPROBE_KEY_TIME_CSR_FREQ:
   292  pair->value = riscv_timebase;
   293  break;
   294  
   295  /*
   296   * For forward compatibility, unknown keys don't fail the whole
   297   * call, but get their element key set to -1 and value set to 0
   298   * indicating they're unrecognized.
   299   */
   300  default:
   301  pair->key = -1;
   302  pair->value = 0;
   303  break;
   304  }
   305  }
   306  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

Re: [PATCH bpf v8 5/5] selftests/bpf: add strparser test for bpf

2025-01-21 Thread Jakub Sitnicki

Thanks for expanding tests.

On Tue, Jan 21, 2025 at 01:07 PM +08, Jiayuan Chen wrote:
> Add test cases for bpf + strparser and separated them from
> sockmap_basic, as strparser has more encapsulation and parsing
> capabilities compared to sockmap.
>
> Signed-off-by: Jiayuan Chen 
> ---
>  .../selftests/bpf/prog_tests/sockmap_basic.c  |  53 --
>  .../selftests/bpf/prog_tests/sockmap_strp.c   | 452 ++
>  .../selftests/bpf/progs/test_sockmap_strp.c   |  53 ++
>  3 files changed, 505 insertions(+), 53 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/sockmap_strp.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_sockmap_strp.c
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c 
> b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
> index 0c51b7288978..f8953455db29 100644
> --- a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
> +++ b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
> @@ -531,57 +531,6 @@ static void test_sockmap_skb_verdict_shutdown(void)
>   test_sockmap_pass_prog__destroy(skel);
>  }
>  
> -static void test_sockmap_stream_pass(void)
> -{
> - int zero = 0, sent, recvd;
> - int verdict, parser;
> - int err, map;
> - int c = -1, p = -1;
> - struct test_sockmap_pass_prog *pass = NULL;
> - char snd[256] = "0123456789";
> - char rcv[256] = "0";
> -
> - pass = test_sockmap_pass_prog__open_and_load();
> - verdict = bpf_program__fd(pass->progs.prog_skb_verdict);
> - parser = bpf_program__fd(pass->progs.prog_skb_parser);
> - map = bpf_map__fd(pass->maps.sock_map_rx);
> -
> - err = bpf_prog_attach(parser, map, BPF_SK_SKB_STREAM_PARSER, 0);
> - if (!ASSERT_OK(err, "bpf_prog_attach stream parser"))
> - goto out;
> -
> - err = bpf_prog_attach(verdict, map, BPF_SK_SKB_STREAM_VERDICT, 0);
> - if (!ASSERT_OK(err, "bpf_prog_attach stream verdict"))
> - goto out;
> -
> - err = create_pair(AF_INET, SOCK_STREAM, &c, &p);
> - if (err)
> - goto out;
> -
> - /* sk_data_ready of 'p' will be replaced by strparser handler */
> - err = bpf_map_update_elem(map, &zero, &p, BPF_NOEXIST);
> - if (!ASSERT_OK(err, "bpf_map_update_elem(p)"))
> - goto out_close;
> -
> - /*
> -  * as 'prog_skb_parser' return the original skb len and
> -  * 'prog_skb_verdict' return SK_PASS, the kernel will just
> -  * pass it through to original socket 'p'
> -  */
> - sent = xsend(c, snd, sizeof(snd), 0);
> - ASSERT_EQ(sent, sizeof(snd), "xsend(c)");
> -
> - recvd = recv_timeout(p, rcv, sizeof(rcv), SOCK_NONBLOCK,
> -  IO_TIMEOUT_SEC);
> - ASSERT_EQ(recvd, sizeof(rcv), "recv_timeout(p)");
> -
> -out_close:
> - close(c);
> - close(p);
> -
> -out:
> - test_sockmap_pass_prog__destroy(pass);
> -}
>  
>  static void test_sockmap_skb_verdict_fionread(bool pass_prog)
>  {
> @@ -1101,8 +1050,6 @@ void test_sockmap_basic(void)
>   test_sockmap_progs_query(BPF_SK_SKB_VERDICT);
>   if (test__start_subtest("sockmap skb_verdict shutdown"))
>   test_sockmap_skb_verdict_shutdown();
> - if (test__start_subtest("sockmap stream parser and verdict pass"))
> - test_sockmap_stream_pass();
>   if (test__start_subtest("sockmap skb_verdict fionread"))
>   test_sockmap_skb_verdict_fionread(true);
>   if (test__start_subtest("sockmap skb_verdict fionread on drop"))
> diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_strp.c 
> b/tools/testing/selftests/bpf/prog_tests/sockmap_strp.c
> new file mode 100644
> index ..01ed1fca1d9c
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/sockmap_strp.c
> @@ -0,0 +1,452 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include 
> +#include 
> +#include 
> +#include "sockmap_helpers.h"
> +#include "test_skmsg_load_helpers.skel.h"
> +#include "test_sockmap_strp.skel.h"

Nit: add new line to separate cpp defines visually

> +#define STRP_PKT_HEAD_LEN 4
> +#define STRP_PKT_BODY_LEN 6
> +#define STRP_PKT_FULL_LEN (STRP_PKT_HEAD_LEN + STRP_PKT_BODY_LEN)

Nit: add new line to constants visually

> +static const char packet[STRP_PKT_FULL_LEN] = "head+body\0";
> +static const int test_packet_num = 100;
> +
> +/* current implementation of tcp_bpf_recvmsg_parser() invoke

Nit: grammar, "invoke*s*"

> + * data_ready with sk held if skb exist in sk_receive_queue.

Nit: grammar, "exist*s*"

> + * Then for data_ready implementation of strparser, it will
> + * delay the read operation if sk was held and EAGAIN is returned.
> + */
> +static int sockmap_strp_consume_pre_data(int p)
> +{
> + int recvd;
> + bool retried = false;
> + char rcv[10];
> +
> +retry:
> + errno = 0;
> + recvd = recv_timeout(p, rcv, sizeof(rcv), 0, 1);
> + if (recvd < 0 && errno == EAGAIN && retried == false) {
> + /* On the first call, EAGAIN will certainly

Re: [PATCH bpf v8 4/5] selftests/bpf: fix invalid flag of recv()

2025-01-21 Thread Jakub Sitnicki

On Tue, Jan 21, 2025 at 01:07 PM +08, Jiayuan Chen wrote:
> SOCK_NONBLOCK flag is only effective during socket creation, not during
> recv. Use MSG_DONTWAIT instead.
>
> Signed-off-by: Jiayuan Chen 
> ---

Good catch.

Fixes: 1fa1fe8ff161 ("bpf, sockmap: Test shutdown() correctly exits epoll and 
recv()=0")
Reviewed-by: Jakub Sitnicki

Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper

2025-01-21 Thread Jason Gunthorpe

On Mon, Jan 20, 2025 at 12:52:09PM -0800, Nicolin Chen wrote:
> The counter of the number of events in the vEVENTQ could decrease
> when userspace reads the queue. But you were saying "the number of
> events that were sent into the queue", which is like a PROD index
> that would keep growing but reset to 0 after UINT_MAX?

yes

> > IOMMU_VEVENTQ_STATE_OVERFLOW with a 0 length event is seen if events
> > have been lost and no subsequent events are present. It exists to
> > ensure timely delivery of the overflow event to userspace. counter
> > will be the sequence number of the next successful event.
> 
> So userspace should first read the header to decide whether or not
> to read a vEVENT. If header is overflows, it should skip the vEVENT
> struct and read the next header?

Yes, but there won't be a next header. overflow would always be the
last thing in a read() response. If there is another event then
overflow is indicated by non-monotonic count.

> > If events are lost in the middle of the queue then flags will remain 0
> > but counter will become non-montonic. A counter delta > 1 indicates
> > that many events have been lost.
> 
> I don't quite get the "no subsequent events" v.s. "in the middle of
> the queue"..

I mean to supress specific overflow events to userspace if the counter already
fully indicates overflow.

The purpose of the overflow event is specifically, and only, to
indicate immediately that an overflow occured at the end of the queue,
and no additional events have been pushed since the overflow.

Without this we could loose an event and userspace may not realize
it for a long time.

> The producer is the driver calling iommufd_viommu_report_event that
> only produces a single vEVENT at a time. When the number of vEVENTs
> in the vEVENTQ hits the @veventq_depth, it won't insert new vEVENTs
> but add an overflow (or exception) node to the head of deliver list
> and increase the producer index so the next vEVENT that can find an
> empty space in the queue will have an index with a gap (delta >= 1)?

Yes, but each new overflow should move the single preallocated
overflow node back to the end of the list, and the read side should
skip the overflow node if it is not the last entry in the list

Jason

Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper

2025-01-21 Thread Nicolin Chen

On Tue, Jan 21, 2025 at 04:09:24PM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 21, 2025 at 11:55:16AM -0800, Nicolin Chen wrote:
> > > > > IOMMU_VEVENTQ_STATE_OVERFLOW with a 0 length event is seen if events
> > > > > have been lost and no subsequent events are present. It exists to
> > > > > ensure timely delivery of the overflow event to userspace. counter
> > > > > will be the sequence number of the next successful event.
> > > > 
> > > > So userspace should first read the header to decide whether or not
> > > > to read a vEVENT. If header is overflows, it should skip the vEVENT
> > > > struct and read the next header?
> > > 
> > > Yes, but there won't be a next header. overflow would always be the
> > > last thing in a read() response. If there is another event then
> > > overflow is indicated by non-monotonic count.
> > 
> > I am not 100% sure why "overflow would always be the last thing
> > in a read() response". I thought that kernel should immediately
> > report an overflow to user space when the vEVENTQ is overflowed.
> 
> As below, if you observe overflow then it was at the end of the kernel
> queue and there is no further events after it. So it should always end
> up last.
> 
> Perhaps we could enforce this directly in the kernel's read by making
> it the only, first and last, response to read.

Hmm, since the overflow node is the last node in the list, isn't
it already an enforcement it's the only/first/last node to read?
(Assuming we choose to delete the overflow node from the list if
 new event can be inserted.)

> > Yet, thinking about this once again: user space actually has its
> > own queue. There's probably no point in letting it know about an
> > overflow immediately when the kernel vEVENTQ overflows until its
> > own user queue overflows after it reads the entire vEVENTQ so it
> > can trigger a vHW event/irq to the VM?
> 
> The kernel has no idea what userspace is doing, the kernel's job
> should be to ensure timely delivery of all events, if an event is lost
> it should ensure timely delivery of the lost event notification. There
> is little else it can do.

Yet, "timely" means still having an entire-queue-size-long delay
since the overflow node is at the end of the queue, right?

Thanks
Nicolin

Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper

2025-01-21 Thread Jason Gunthorpe

On Tue, Jan 21, 2025 at 01:02:16PM -0800, Nicolin Chen wrote:
> On Tue, Jan 21, 2025 at 04:09:24PM -0400, Jason Gunthorpe wrote:
> > On Tue, Jan 21, 2025 at 11:55:16AM -0800, Nicolin Chen wrote:
> > > > > > IOMMU_VEVENTQ_STATE_OVERFLOW with a 0 length event is seen if events
> > > > > > have been lost and no subsequent events are present. It exists to
> > > > > > ensure timely delivery of the overflow event to userspace. counter
> > > > > > will be the sequence number of the next successful event.
> > > > > 
> > > > > So userspace should first read the header to decide whether or not
> > > > > to read a vEVENT. If header is overflows, it should skip the vEVENT
> > > > > struct and read the next header?
> > > > 
> > > > Yes, but there won't be a next header. overflow would always be the
> > > > last thing in a read() response. If there is another event then
> > > > overflow is indicated by non-monotonic count.
> > > 
> > > I am not 100% sure why "overflow would always be the last thing
> > > in a read() response". I thought that kernel should immediately
> > > report an overflow to user space when the vEVENTQ is overflowed.
> > 
> > As below, if you observe overflow then it was at the end of the kernel
> > queue and there is no further events after it. So it should always end
> > up last.
> > 
> > Perhaps we could enforce this directly in the kernel's read by making
> > it the only, first and last, response to read.
> 
> Hmm, since the overflow node is the last node in the list, isn't
> it already an enforcement it's the only/first/last node to read?
> (Assuming we choose to delete the overflow node from the list if
>  new event can be inserted.)

Since we don't hold the spinlock the whole time there is a race where
we could pull the overflow off and then another entry could be pushed
while we do the copy_to_user.

> > > Yet, thinking about this once again: user space actually has its
> > > own queue. There's probably no point in letting it know about an
> > > overflow immediately when the kernel vEVENTQ overflows until its
> > > own user queue overflows after it reads the entire vEVENTQ so it
> > > can trigger a vHW event/irq to the VM?
> > 
> > The kernel has no idea what userspace is doing, the kernel's job
> > should be to ensure timely delivery of all events, if an event is lost
> > it should ensure timely delivery of the lost event notification. There
> > is little else it can do.
> 
> Yet, "timely" means still having an entire-queue-size-long delay
> since the overflow node is at the end of the queue, right?

Yes, but also in this case the vIOMMU isn't experiancing an overflow
so it doesn't need to know about it. 

The main point here is if we somehow loose an event the vIOMMU driver
may need to do something to recover from the lost event. It doesn't
become relavent until the lost event is present in the virtual HW
queue.

There is also the minor detail of what happens if the hypervisor HW
queue overflows - I don't know the answer here. It is security
concerning since the VM can spam DMA errors at high rate. :|

Jason

Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper

2025-01-21 Thread Jason Gunthorpe

On Tue, Jan 21, 2025 at 11:55:16AM -0800, Nicolin Chen wrote:
> Ack. Then I think we should name it "index", beside a "counter"
> indicating the number of events in the queue. Or perhaps a pair
> of consumer and producer indexes that wrap at end of a limit.

sequence perhaps would be a good name

> > > > IOMMU_VEVENTQ_STATE_OVERFLOW with a 0 length event is seen if events
> > > > have been lost and no subsequent events are present. It exists to
> > > > ensure timely delivery of the overflow event to userspace. counter
> > > > will be the sequence number of the next successful event.
> > > 
> > > So userspace should first read the header to decide whether or not
> > > to read a vEVENT. If header is overflows, it should skip the vEVENT
> > > struct and read the next header?
> > 
> > Yes, but there won't be a next header. overflow would always be the
> > last thing in a read() response. If there is another event then
> > overflow is indicated by non-monotonic count.
> 
> I am not 100% sure why "overflow would always be the last thing
> in a read() response". I thought that kernel should immediately
> report an overflow to user space when the vEVENTQ is overflowed.

As below, if you observe overflow then it was at the end of the kernel
queue and there is no further events after it. So it should always end
up last.

Perhaps we could enforce this directly in the kernel's read by making
it the only, first and last, response to read.

> Yet, thinking about this once again: user space actually has its
> own queue. There's probably no point in letting it know about an
> overflow immediately when the kernel vEVENTQ overflows until its
> own user queue overflows after it reads the entire vEVENTQ so it
> can trigger a vHW event/irq to the VM?

The kernel has no idea what userspace is doing, the kernel's job
should be to ensure timely delivery of all events, if an event is lost
it should ensure timely delivery of the lost event notification. There
is little else it can do.

I suppose userspace has a choice, it could discard events from the
kernel when its virtual HW queue gets full, or it could backpressure
the kernel and stop reading hoping the kernel queue will buffer it
futher.

> > Without this we could loose an event and userspace may not realize
> > it for a long time.
> 
> I see. Because there is no further new event, there would be no
> new index to indicate a gap. Thus, we need an overflow node.

yes

> If the number of events in the queue is below @veventq_depth as
> userspace consumed the events from the queue, I think a new
> iommufd_viommu_report_event call should delete the overflow node
> from the end of the list, right? 

You can do that, or the read side can ignore a non-end overflow node.

I'm not sure which option will turn out to be easier to implement..

Jason

Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper

2025-01-21 Thread Nicolin Chen

On Tue, Jan 21, 2025 at 02:36:11PM -0400, Jason Gunthorpe wrote:
> On Mon, Jan 20, 2025 at 12:52:09PM -0800, Nicolin Chen wrote:
> > The counter of the number of events in the vEVENTQ could decrease
> > when userspace reads the queue. But you were saying "the number of
> > events that were sent into the queue", which is like a PROD index
> > that would keep growing but reset to 0 after UINT_MAX?
> 
> yes

Ack. Then I think we should name it "index", beside a "counter"
indicating the number of events in the queue. Or perhaps a pair
of consumer and producer indexes that wrap at end of a limit.

> > > IOMMU_VEVENTQ_STATE_OVERFLOW with a 0 length event is seen if events
> > > have been lost and no subsequent events are present. It exists to
> > > ensure timely delivery of the overflow event to userspace. counter
> > > will be the sequence number of the next successful event.
> > 
> > So userspace should first read the header to decide whether or not
> > to read a vEVENT. If header is overflows, it should skip the vEVENT
> > struct and read the next header?
> 
> Yes, but there won't be a next header. overflow would always be the
> last thing in a read() response. If there is another event then
> overflow is indicated by non-monotonic count.

I am not 100% sure why "overflow would always be the last thing
in a read() response". I thought that kernel should immediately
report an overflow to user space when the vEVENTQ is overflowed.

Yet, thinking about this once again: user space actually has its
own queue. There's probably no point in letting it know about an
overflow immediately when the kernel vEVENTQ overflows until its
own user queue overflows after it reads the entire vEVENTQ so it
can trigger a vHW event/irq to the VM?

> > > If events are lost in the middle of the queue then flags will remain 0
> > > but counter will become non-montonic. A counter delta > 1 indicates
> > > that many events have been lost.
> > 
> > I don't quite get the "no subsequent events" v.s. "in the middle of
> > the queue"..
> 
> I mean to supress specific overflow events to userspace if the counter already
> fully indicates overflow.
> 
> The purpose of the overflow event is specifically, and only, to
> indicate immediately that an overflow occured at the end of the queue,
> and no additional events have been pushed since the overflow.
> 
> Without this we could loose an event and userspace may not realize
> it for a long time.

I see. Because there is no further new event, there would be no
new index to indicate a gap. Thus, we need an overflow node.

> > The producer is the driver calling iommufd_viommu_report_event that
> > only produces a single vEVENT at a time. When the number of vEVENTs
> > in the vEVENTQ hits the @veventq_depth, it won't insert new vEVENTs
> > but add an overflow (or exception) node to the head of deliver list
> > and increase the producer index so the next vEVENT that can find an
> > empty space in the queue will have an index with a gap (delta >= 1)?
> 
> Yes, but each new overflow should move the single preallocated
> overflow node back to the end of the list, and the read side should
> skip the overflow node if it is not the last entry in the list

If the number of events in the queue is below @veventq_depth as
userspace consumed the events from the queue, I think a new
iommufd_viommu_report_event call should delete the overflow node
from the end of the list, right? Since it can just insert a new
event to the list by marking it a non-monotonic index..

Thanks!
Nicolin

Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper

2025-01-21 Thread Jason Gunthorpe

On Tue, Jan 21, 2025 at 01:40:05PM -0800, Nicolin Chen wrote:
> > There is also the minor detail of what happens if the hypervisor HW
> > queue overflows - I don't know the answer here. It is security
> > concerning since the VM can spam DMA errors at high rate. :|
> 
> In my view, the hypervisor queue is the vHW queue for the VM, so
> it should act like a HW, which means it's up to the guest kernel
> driver that handles the high rate DMA errors..

I'm mainly wondering what happens if the single physical kernel
event queue overflows because it is DOS'd by a VM and the hypervisor
cannot drain it fast enough?

I haven't looked closely but is there some kind of rate limiting or
otherwise to mitigate DOS attacks on the shared event queue from VMs?

Jason

Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper

2025-01-21 Thread Nicolin Chen

On Tue, Jan 21, 2025 at 05:14:04PM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 21, 2025 at 01:02:16PM -0800, Nicolin Chen wrote:
> > On Tue, Jan 21, 2025 at 04:09:24PM -0400, Jason Gunthorpe wrote:
> > > On Tue, Jan 21, 2025 at 11:55:16AM -0800, Nicolin Chen wrote:
> > > > > > > IOMMU_VEVENTQ_STATE_OVERFLOW with a 0 length event is seen if 
> > > > > > > events
> > > > > > > have been lost and no subsequent events are present. It exists to
> > > > > > > ensure timely delivery of the overflow event to userspace. counter
> > > > > > > will be the sequence number of the next successful event.
> > > > > > 
> > > > > > So userspace should first read the header to decide whether or not
> > > > > > to read a vEVENT. If header is overflows, it should skip the vEVENT
> > > > > > struct and read the next header?
> > > > > 
> > > > > Yes, but there won't be a next header. overflow would always be the
> > > > > last thing in a read() response. If there is another event then
> > > > > overflow is indicated by non-monotonic count.
> > > > 
> > > > I am not 100% sure why "overflow would always be the last thing
> > > > in a read() response". I thought that kernel should immediately
> > > > report an overflow to user space when the vEVENTQ is overflowed.
> > > 
> > > As below, if you observe overflow then it was at the end of the kernel
> > > queue and there is no further events after it. So it should always end
> > > up last.
> > > 
> > > Perhaps we could enforce this directly in the kernel's read by making
> > > it the only, first and last, response to read.
> > 
> > Hmm, since the overflow node is the last node in the list, isn't
> > it already an enforcement it's the only/first/last node to read?
> > (Assuming we choose to delete the overflow node from the list if
> >  new event can be inserted.)
> 
> Since we don't hold the spinlock the whole time there is a race where
> we could pull the overflow off and then another entry could be pushed
> while we do the copy_to_user.

I see. I'll be careful around that. I can imagine that one tricky
thing can be to restore the overflow node back to the list when a
copy_to_user fails..

> > > > Yet, thinking about this once again: user space actually has its
> > > > own queue. There's probably no point in letting it know about an
> > > > overflow immediately when the kernel vEVENTQ overflows until its
> > > > own user queue overflows after it reads the entire vEVENTQ so it
> > > > can trigger a vHW event/irq to the VM?
> > > 
> > > The kernel has no idea what userspace is doing, the kernel's job
> > > should be to ensure timely delivery of all events, if an event is lost
> > > it should ensure timely delivery of the lost event notification. There
> > > is little else it can do.
> > 
> > Yet, "timely" means still having an entire-queue-size-long delay
> > since the overflow node is at the end of the queue, right?
> 
> Yes, but also in this case the vIOMMU isn't experiancing an overflow
> so it doesn't need to know about it. 
> 
> The main point here is if we somehow loose an event the vIOMMU driver
> may need to do something to recover from the lost event. It doesn't
> become relavent until the lost event is present in the virtual HW
> queue.

Ack.

> There is also the minor detail of what happens if the hypervisor HW
> queue overflows - I don't know the answer here. It is security
> concerning since the VM can spam DMA errors at high rate. :|

In my view, the hypervisor queue is the vHW queue for the VM, so
it should act like a HW, which means it's up to the guest kernel
driver that handles the high rate DMA errors..

The entire thing is complicated since we have a double buffer: a
kernel queue and a hypervisor queue. I am not very sure if both 
queues should have the same depth or perhaps the kernel one might
have a slightly bigger size to buffer the hypervisor one. If only
kernel could directly add events to the hypervisor queue and also
set the overflow flag in the hypervisor..

Thanks
Nicolin

Re: [PATCH v6 01/15] lib: Add TLV parser

2025-01-21 Thread Thomas Weißschuh

On 2025-01-21 15:55:28+0100, Roberto Sassu wrote:
> On Tue, 2025-01-21 at 14:29 +0100, Thomas Weißschuh wrote:
> > On 2024-11-19 11:49:08+0100, Roberto Sassu wrote:

[..]

> > > +typedef int (*callback)(void *callback_data, __u16 field,
> > > + const __u8 *field_data, __u32 field_len);
> > 
> > No need for __underscore types in kernel-only signatures.
> 
> It is just for convenience. I'm reusing the same file for the userspace
> counterpart digest-cache-tools. In that case, the parser is used for
> example to show the content of the digest list.

This reuse leads to quite some ugly constructs.
Given that the single function will be really simple after removing the
unnecessary parts, maybe two clean copies are easier.
One copy is needed for Frama-C anyways.

> > > +
> > > +int tlv_parse(callback callback, void *callback_data, const __u8 *data,
> > > +   size_t data_len, const char **fields, __u32 num_fields);
> > > +
> > > +#endif /* _LINUX_TLV_PARSER_H */
> > > diff --git a/include/uapi/linux/tlv_parser.h 
> > > b/include/uapi/linux/tlv_parser.h
> > > new file mode 100644
> > > index ..171d0cfd2c4c
> > > --- /dev/null
> > > +++ b/include/uapi/linux/tlv_parser.h
> > > @@ -0,0 +1,41 @@
> > > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> > > +/*
> > > + * Copyright (C) 2023-2024 Huawei Technologies Duesseldorf GmbH
> > > + *
> > > + * Author: Roberto Sassu 
> > > + *
> > > + * Implement the user space interface for the TLV parser.
> > > + */
> > 
> > Can you explain in the commit message where this will be exposed to
> > userspace as binary?
> 
> I see that my explanation is not ideal.
> 
> This is the format for data exchange between user space and kernel
> space, but it is still the kernel that reads and parses the TLV-
> formatted file for extracting the digests and adding them to the digest
> cache.

I figured that out :-)
It should be clear from the commit itself, though.

> > > +
> > > +#ifndef _UAPI_LINUX_TLV_PARSER_H
> > > +#define _UAPI_LINUX_TLV_PARSER_H
> > > +
> > > +#include 
> > > +
> > > +/*
> > > + * TLV format:
> > > + *
> > > + * +--+--+-++-+
> > > + * | field1 (u16) | len1 (u32) | value1 (u8 len1) |
> > > + * +--++--+
> > > + * | ...  |... |...   |
> > > + * +--++--+
> > > + * | fieldN (u16) | lenN (u32) | valueN (u8 lenN) |
> > > + * +--++--+
> > > + */
> > > +
> > > +/**
> > > + * struct tlv_entry - Entry of TLV format
> > > + * @field: Field identifier
> > > + * @length: Data length
> > > + * @data: Data
> > > + *
> > > + * This structure represents an entry of the TLV format.
> > > + */
> > > +struct tlv_entry {
> > > + __u16 field;
> > > + __u32 length;

Looking at this again, the "length" field is unaligned by default.

Also FYI there is already a TLV implementation in
include/uapi/linux/tipc_config.

> > > +} __attribute__((packed));

[..]

> > Some kunit tests would be great.
> 
> I implemented kselftests also injecting errors (patch 13). If it is not
> enough, I implement kunit tests too.

These selftests are for the digest_cache.
If the TLV library is meant to be used alone, some dedicated tests would
be nice. kunit has the advantage that it can directly call kernel
functions with arbitrary parameters and does not require any userspace
setup.

Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper

2025-01-21 Thread Nicolin Chen

On Tue, Jan 21, 2025 at 08:21:28PM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 21, 2025 at 01:40:05PM -0800, Nicolin Chen wrote:
> > > There is also the minor detail of what happens if the hypervisor HW
> > > queue overflows - I don't know the answer here. It is security
> > > concerning since the VM can spam DMA errors at high rate. :|
> > 
> > In my view, the hypervisor queue is the vHW queue for the VM, so
> > it should act like a HW, which means it's up to the guest kernel
> > driver that handles the high rate DMA errors..
> 
> I'm mainly wondering what happens if the single physical kernel
> event queue overflows because it is DOS'd by a VM and the hypervisor
> cannot drain it fast enough?
> 
> I haven't looked closely but is there some kind of rate limiting or
> otherwise to mitigate DOS attacks on the shared event queue from VMs?

SMMUv3 reads the event out of the physical kernel event queue,
and adds that to faultq or veventq or prints it out. So, it'd
not overflow because of DOS? And all other drivers should do
the same?

Thanks
Nicolin

Re: [PATCH v10 00/14] riscv: Add support for xtheadvector

Re: [PATCH bpf v8 1/5] strparser: add read_sock callback

Re: [PATCH v6 01/15] lib: Add TLV parser

Re: [PATCH net-next v4 8/9] tap: Keep hdr_len in tap_get_user()

Re: [PATCH net v3 9/9] tap: Use tun's vnet-related code

Re: [PATCH bpf v8 2/5] bpf: fix wrong copied_seq calculation

Re: [PATCH v6 01/15] lib: Add TLV parser

Re: [PATCH bpf v8 3/5] bpf: disable non stream socket for strparser

Re: [PATCH v6 01/15] lib: Add TLV parser

Re: [PATCH v5 2/3] RISC-V: hwprobe: Expose Zicbom extension and its block size

Re: [PATCH v2 0/6] module: Introduce hash-based integrity checking

Re: [PATCH v6 01/15] lib: Add TLV parser

Re: [PATCH v2 0/6] module: Introduce hash-based integrity checking

Re: [PATCH v6 01/15] lib: Add TLV parser

Re: [PATCH v2 0/6] module: Introduce hash-based integrity checking

Re: [PATCH v5 2/3] RISC-V: hwprobe: Expose Zicbom extension and its block size

Re: [PATCH bpf v8 5/5] selftests/bpf: add strparser test for bpf

Re: [PATCH bpf v8 4/5] selftests/bpf: fix invalid flag of recv()

Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper

Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper

Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper

Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper

Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper

Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper

Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper

Re: [PATCH v6 01/15] lib: Add TLV parser

Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper

27 matches

Site Navigation

Mail list logo

Footer information