Re: [PATCH v6 00/26] fs/dax: Fix ZONE_DEVICE page reference counts

2025-01-10 Thread Dan Williams
Andrew Morton wrote:
> On Thu, 9 Jan 2025 23:05:56 -0800 Dan Williams  
> wrote:
> 
> > >  - Remove PTE_DEVMAP definitions from Loongarch which were added since
> > >this series was initially written.
> > [..]
> > > 
> > > base-commit: e25c8d66f6786300b680866c0e0139981273feba
> > 
> > If this is going to go through nvdimm.git I will need it against a
> > mainline tag baseline. Linus will want to see the merge conflicts.
> > 
> > Otherwise if that merge commit is too messy, or you would rather not
> > rebase, then it either needs to go one of two options:
> > 
> > - Andrew's tree which is the only tree I know of that can carry
> >   patches relative to linux-next.
> 
> I used to be able to do that but haven't got around to setting up such
> a thing with mm.git.  This is the first time the need has arisen,
> really.

Oh, good to know.

> 
> > - Wait for v6.14-rc1 
> 
> I'm thinking so.  Darrick's review comments indicate that we'll be seeing a 
> v7.
> 
> > and get this into nvdimm.git early in the cycle
> >   when the conflict storm will be low.
> 
> erk.  This patchset hits mm/ a lot, and nvdimm hardly at all.  Is it
> not practical to carry this in mm.git?

I'm totally fine with it going through mm.git. nvdimm.git is just the
historical path for touches to fs/dax.c, and git blame points mostly to
me for the issues Alistair is fixing. I am happy to review and ack and
watch this go through mm.git.



Re: [PATCH v13 5/5] rust: Use gendwarfksyms + extended modversions for CONFIG_MODVERSIONS

2025-01-10 Thread Masahiro Yamada
On Sat, Jan 4, 2025 at 2:37 AM Matthew Maurer  wrote:
>
> From: Sami Tolvanen 
>
> Previously, two things stopped Rust from using MODVERSIONS:
> 1. Rust symbols are occasionally too long to be represented in the
>original versions table
> 2. Rust types cannot be properly hashed by the existing genksyms
>approach because:
> * Looking up type definitions in Rust is more complex than C
> * Type layout is potentially dependent on the compiler in Rust,
>   not just the source type declaration.
>
> CONFIG_EXTENDED_MODVERSIONS addresses the first point, and
> CONFIG_GENDWARFKSYMS the second. If Rust wants to use MODVERSIONS, allow
> it to do so by selecting both features.
>
> Signed-off-by: Sami Tolvanen 
> Co-developed-by: Matthew Maurer 
> Signed-off-by: Matthew Maurer 
> ---
>  init/Kconfig  |  3 ++-
>  rust/Makefile | 34 --
>  2 files changed, 34 insertions(+), 3 deletions(-)
>
> diff --git a/init/Kconfig b/init/Kconfig
> index 
> c1f9eb3d5f2e892e977ba1425599502dc830f552..b60acfd9431e0ac2bf401ecb6523b5104ad31150
>  100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1959,7 +1959,8 @@ config RUST
> bool "Rust support"
> depends on HAVE_RUST
> depends on RUST_IS_AVAILABLE
> -   depends on !MODVERSIONS
> +   select EXTENDED_MODVERSIONS if MODVERSIONS
> +   depends on !MODVERSIONS || GENDWARFKSYMS
> depends on !GCC_PLUGIN_RANDSTRUCT
> depends on !RANDSTRUCT
> depends on !DEBUG_INFO_BTF || PAHOLE_HAS_LANG_EXCLUDE
> diff --git a/rust/Makefile b/rust/Makefile
> index 
> a40a3936126d603836e0ec9b42a1285916b60e45..80f970ad81f7989afe5ff2b5f633f50feb7f6006
>  100644
> --- a/rust/Makefile
> +++ b/rust/Makefile
> @@ -329,10 +329,11 @@ $(obj)/bindings/bindings_helpers_generated.rs: private 
> bindgen_target_extra = ;
>  $(obj)/bindings/bindings_helpers_generated.rs: $(src)/helpers/helpers.c FORCE
> $(call if_changed_dep,bindgen)
>
> +rust_exports = $(NM) -p --defined-only $(1) | awk '$$2~/(T|R|D|B)/ && 
> $$3!~/__cfi/ { printf $(2),$(3) }'
> +
>  quiet_cmd_exports = EXPORTS $@
>cmd_exports = \
> -   $(NM) -p --defined-only $< \
> -   | awk '$$2~/(T|R|D|B)/ && $$3!~/__cfi/ {printf 
> "EXPORT_SYMBOL_RUST_GPL(%s);\n",$$3}' > $@
> +   $(call rust_exports,$<,"EXPORT_SYMBOL_RUST_GPL(%s);\n",$$3) > $@

I noticed a nit:

Both of the two callsites of rust_exports pass
'$$3' to the last parameter instead of hardcoding it.

Is it a flexibility for future extensions?

I cannot think of any other use except for printing
the third column, i.e. symbol name.





-- 
Best Regards
Masahiro Yamada



Re: [PATCH v6 00/26] fs/dax: Fix ZONE_DEVICE page reference counts

2025-01-10 Thread Andrew Morton
On Thu, 9 Jan 2025 23:05:56 -0800 Dan Williams  wrote:

> >  - Remove PTE_DEVMAP definitions from Loongarch which were added since
> >this series was initially written.
> [..]
> > 
> > base-commit: e25c8d66f6786300b680866c0e0139981273feba
> 
> If this is going to go through nvdimm.git I will need it against a
> mainline tag baseline. Linus will want to see the merge conflicts.
> 
> Otherwise if that merge commit is too messy, or you would rather not
> rebase, then it either needs to go one of two options:
> 
> - Andrew's tree which is the only tree I know of that can carry
>   patches relative to linux-next.

I used to be able to do that but haven't got around to setting up such
a thing with mm.git.  This is the first time the need has arisen,
really.

> - Wait for v6.14-rc1 

I'm thinking so.  Darrick's review comments indicate that we'll be seeing a v7.

> and get this into nvdimm.git early in the cycle
>   when the conflict storm will be low.

erk.  This patchset hits mm/ a lot, and nvdimm hardly at all.  Is it
not practical to carry this in mm.git?




Re: [PATCH v5 06/14] iommufd: Add IOMMUFD_OBJ_VEVENTQ and IOMMUFD_CMD_VEVENTQ_ALLOC

2025-01-10 Thread Nicolin Chen
On Fri, Jan 10, 2025 at 03:49:50PM -0400, Jason Gunthorpe wrote:
> On Fri, Jan 10, 2025 at 11:27:53AM -0800, Nicolin Chen wrote:
> > On Fri, Jan 10, 2025 at 01:48:42PM -0400, Jason Gunthorpe wrote:
> > > On Tue, Jan 07, 2025 at 09:10:09AM -0800, Nicolin Chen wrote:
> > > 
> > > > +static ssize_t iommufd_veventq_fops_read(struct iommufd_eventq *eventq,
> > > > +char __user *buf, size_t count,
> > > > +loff_t *ppos)
> > > > +{
> > > > +   size_t done = 0;
> > > > +   int rc = 0;
> > > > +
> > > > +   if (*ppos)
> > > > +   return -ESPIPE;
> > > > +
> > > > +   mutex_lock(&eventq->mutex);
> > > > +   while (!list_empty(&eventq->deliver) && count > done) {
> > > > +   struct iommufd_vevent *cur = list_first_entry(
> > > > +   &eventq->deliver, struct iommufd_vevent, node);
> > > > +
> > > > +   if (cur->data_len > count - done)
> > > > +   break;
> > > > +
> > > > +   if (copy_to_user(buf + done, cur->event_data, 
> > > > cur->data_len)) {
> > > > +   rc = -EFAULT;
> > > > +   break;
> > > > +   }
> > > 
> > > Now that I look at this more closely, the fault path this is copied
> > > from is not great.
> > > 
> > > This copy_to_user() can block while waiting on a page fault, possibily
> > > for a long time. While blocked the mutex is held and we can't add more
> > > entries to the list.
> > >
> > > That will cause the shared IRQ handler in the iommu driver to back up,
> > > which would cause a global DOS.
> > >
> > > This probably wants to be organized to look more like
> > > 
> > > while (itm = eventq_get_next_item(eventq)) {
> > >if (..) {
> > >eventq_restore_failed_item(eventq);
> > >return -1;
> > >}
> > > }
> > > 
> > > Where the next_item would just be a simple spinlock across the linked
> > > list manipulation.
> > 
> > Would it be simpler by just limiting one node per read(), i.e.
> > no "while (!list_empty)" and no block?
> > 
> > The report() adds one node at a time, and wakes up the poll()
> > each time of adding a node. And user space could read one event
> > at a time too?
> 
> That doesn't really help, the issue is it holds the lock over the
> copy_to_user() which it is doing because it doesn't want pull the item off
> the list and then try to handle the failure and put it back.

Hmm, it seems that I haven't got your first narrative straight..

Would you mind elaborate "copy_to_user() can block while waiting
on a page fault"? When would this happen?

Thanks
Nicolin



Re: [PATCH] Add short author date to Fixes tag

2025-01-10 Thread Greg Kroah-Hartman
On Fri, Jan 10, 2025 at 04:21:35PM -0800, Jacob Keller wrote:
> However, all of the existing tooling we have for the kernel does not
> support the date, and I think its not worth trying to change it at this
> point. It doesn't make sense to break all this tooling for information
> which is accessible in other forms. Indeed, as long as the hash is
> sufficiently long, the change of a collision is minimal.

And if it isn't long enough, tools like:
https://lore.kernel.org/r/20241226220555.3540872-1-sas...@kernel.org
can help figure it out as well.

thanks,

greg k-h



Re: [PATCH v2 2/3] tun: Pad virtio header with zero

2025-01-10 Thread Michael S. Tsirkin
On Fri, Jan 10, 2025 at 01:38:06PM +0900, Akihiko Odaki wrote:
> On 2025/01/09 21:46, Willem de Bruijn wrote:
> > Akihiko Odaki wrote:
> > > On 2025/01/09 16:31, Michael S. Tsirkin wrote:
> > > > On Thu, Jan 09, 2025 at 03:58:44PM +0900, Akihiko Odaki wrote:
> > > > > tun used to simply advance iov_iter when it needs to pad virtio 
> > > > > header,
> > > > > which leaves the garbage in the buffer as is. This is especially
> > > > > problematic when tun starts to allow enabling the hash reporting
> > > > > feature; even if the feature is enabled, the packet may lack a hash
> > > > > value and may contain a hole in the virtio header because the packet
> > > > > arrived before the feature gets enabled or does not contain the
> > > > > header fields to be hashed. If the hole is not filled with zero, it is
> > > > > impossible to tell if the packet lacks a hash value.
> > 
> > Zero is a valid hash value, so cannot be used as an indication that
> > hashing is inactive.
> 
> Zeroing will initialize the hash_report field to
> VIRTIO_NET_HASH_REPORT_NONE, which tells it does not have a hash value.
> 
> > 
> > > > > In theory, a user of tun can fill the buffer with zero before calling
> > > > > read() to avoid such a problem, but leaving the garbage in the buffer 
> > > > > is
> > > > > awkward anyway so fill the buffer in tun.
> > > > > 
> > > > > Signed-off-by: Akihiko Odaki 
> > > > 
> > > > But if the user did it, you have just overwritten his value,
> > > > did you not?
> > > 
> > > Yes. but that means the user expects some part of buffer is not filled
> > > after read() or recvmsg(). I'm a bit worried that not filling the buffer
> > > may break assumptions others (especially the filesystem and socket
> > > infrastructures in the kernel) may have.
> > 
> > If this is user memory that is ignored by the kernel, just reflected
> > back, then there is no need in general to zero it. There are many such
> > instances, also in msg_control.
> 
> More specifically, is there any instance of recvmsg() implementation which
> returns N and does not fill the complete N bytes of msg_iter?

The one in tun. It was a silly idea but it has been here for years now.


> > 
> > If not zeroing leads to ambiguity with the new feature, that would be
> > a reason to add it -- it is always safe to do so.
> > > If we are really confident that it will not cause problems, this
> > > behavior can be opt-in based on a flag or we can just write some
> > > documentation warning userspace programmers to initialize the buffer.




Re: [PATCH v2 3/3] tun: Set num_buffers for virtio 1.0

2025-01-10 Thread Akihiko Odaki

On 2025/01/10 12:27, Jason Wang wrote:

On Thu, Jan 9, 2025 at 2:59 PM Akihiko Odaki  wrote:


The specification says the device MUST set num_buffers to 1 if
VIRTIO_NET_F_MRG_RXBUF has not been negotiated.


Have we agreed on how to fix the spec or not?

As I replied in the spec patch, if we just remove this "MUST", it
looks like we are all fine?


My understanding is that we should fix the kernel and QEMU instead. 
There may be some driver implementations that assumes num_buffers is 1 
so the kernel and QEMU should be fixed to be compatible with such 
potential implementations.


It is also possible to make future drivers with existing kernels and 
QEMU by ensuring they will not read num_buffers when 
VIRTIO_NET_F_MRG_RXBUF has not negotiated, and that's what "[PATCH v3] 
virtio-net: Ignore num_buffers when unused" does.

https://lore.kernel.org/r/20250110-reserved-v3-1-2ade0a5d2...@daynix.com

Regards,
Akihiko Odaki



Re: [PATCH v2 2/3] tun: Pad virtio header with zero

2025-01-10 Thread Akihiko Odaki

On 2025/01/10 12:27, Jason Wang wrote:

On Thu, Jan 9, 2025 at 2:59 PM Akihiko Odaki  wrote:


tun used to simply advance iov_iter when it needs to pad virtio header,
which leaves the garbage in the buffer as is. This is especially
problematic when tun starts to allow enabling the hash reporting
feature; even if the feature is enabled, the packet may lack a hash
value and may contain a hole in the virtio header because the packet
arrived before the feature gets enabled or does not contain the
header fields to be hashed. If the hole is not filled with zero, it is
impossible to tell if the packet lacks a hash value.


I'm not sure I will get here, could we do this in the series of hash reporting?


I'll create another series dedicated for this and num_buffers change as 
suggested by Willem.






In theory, a user of tun can fill the buffer with zero before calling
read() to avoid such a problem, but leaving the garbage in the buffer is
awkward anyway so fill the buffer in tun.

Signed-off-by: Akihiko Odaki 
---
  drivers/net/tun_vnet.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/tun_vnet.c b/drivers/net/tun_vnet.c
index fe842df9e9ef..ffb2186facd3 100644
--- a/drivers/net/tun_vnet.c
+++ b/drivers/net/tun_vnet.c
@@ -138,7 +138,8 @@ int tun_vnet_hdr_put(int sz, struct iov_iter *iter,
 if (copy_to_iter(hdr, sizeof(*hdr), iter) != sizeof(*hdr))
 return -EFAULT;

-   iov_iter_advance(iter, sz - sizeof(*hdr));
+   if (iov_iter_zero(sz - sizeof(*hdr), iter) != sz - sizeof(*hdr))
+   return -EFAULT;

 return 0;


There're various callers of iov_iter_advance(), do we need to fix them all?


No. For example, there are iov_iter_advance() calls for SOCK_ZEROCOPY in 
tun_get_user() and tap_get_user(). They are fine as they are not writing 
buffers after skipping.


The problem is that read_iter() and recvmsg() says it wrote N bytes but 
it leaves some of this N bytes uninialized. Such an implementation may 
be created even without iov_iter_advance() (for example just returning a 
too big number), and it is equally problematic with the current 
tun_get_user()/tap_get_user().


Regards,
Akihiko Odaki



Thanks


  }




--
2.47.1








Re: [PATCH v2 3/3] tun: Set num_buffers for virtio 1.0

2025-01-10 Thread Michael S. Tsirkin
On Fri, Jan 10, 2025 at 11:27:13AM +0800, Jason Wang wrote:
> On Thu, Jan 9, 2025 at 2:59 PM Akihiko Odaki  wrote:
> >
> > The specification says the device MUST set num_buffers to 1 if
> > VIRTIO_NET_F_MRG_RXBUF has not been negotiated.
> 
> Have we agreed on how to fix the spec or not?
> 
> As I replied in the spec patch, if we just remove this "MUST", it
> looks like we are all fine?
> 
> Thanks

We should replace MUST with SHOULD but it is not all fine,
ignoring SHOULD is a quality of implementation issue.




Re: [PATCH v2 1/3] tun: Unify vnet implementation

2025-01-10 Thread Akihiko Odaki

On 2025/01/09 23:06, Willem de Bruijn wrote:

Akihiko Odaki wrote:

Both tun and tap exposes the same set of virtio-net-related features.
Unify their implementations to ease future changes.

Signed-off-by: Akihiko Odaki 
---
  MAINTAINERS|   1 +
  drivers/net/Kconfig|   5 ++
  drivers/net/Makefile   |   1 +
  drivers/net/tap.c  | 172 ++--
  drivers/net/tun.c  | 208 -
  drivers/net/tun_vnet.c | 186 +++
  drivers/net/tun_vnet.h |  24 ++
  7 files changed, 273 insertions(+), 324 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 910305c11e8a..1be8a452d11f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -23903,6 +23903,7 @@ F:  Documentation/networking/tuntap.rst
  F:arch/um/os-Linux/drivers/
  F:drivers/net/tap.c
  F:drivers/net/tun.c
+F: drivers/net/tun_vnet.h
  
  TURBOCHANNEL SUBSYSTEM

  M:"Maciej W. Rozycki" 
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 1fd5acdc73c6..255c8f9f1d7c 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -395,6 +395,7 @@ config TUN
tristate "Universal TUN/TAP device driver support"
depends on INET
select CRC32
+   select TUN_VNET


No need for this new Kconfig


I will merge tun_vnet.c into TAP.




  static struct proto tap_proto = {
.name = "tap",
@@ -641,10 +576,10 @@ static ssize_t tap_get_user(struct tap_queue *q, void 
*msg_control,
struct sk_buff *skb;
struct tap_dev *tap;
unsigned long total_len = iov_iter_count(from);
-   unsigned long len = total_len;
+   unsigned long len;
int err;
struct virtio_net_hdr vnet_hdr = { 0 };
-   int vnet_hdr_len = 0;
+   int hdr_len;
int copylen = 0;
int depth;
bool zerocopy = false;
@@ -652,38 +587,20 @@ static ssize_t tap_get_user(struct tap_queue *q, void 
*msg_control,
enum skb_drop_reason drop_reason;
  
  	if (q->flags & IFF_VNET_HDR) {

-   vnet_hdr_len = READ_ONCE(q->vnet_hdr_sz);
-
-   err = -EINVAL;
-   if (len < vnet_hdr_len)
-   goto err;
-   len -= vnet_hdr_len;
-
-   err = -EFAULT;
-   if (!copy_from_iter_full(&vnet_hdr, sizeof(vnet_hdr), from))
-   goto err;
-   iov_iter_advance(from, vnet_hdr_len - sizeof(vnet_hdr));
-   if ((vnet_hdr.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
-tap16_to_cpu(q, vnet_hdr.csum_start) +
-tap16_to_cpu(q, vnet_hdr.csum_offset) + 2 >
-tap16_to_cpu(q, vnet_hdr.hdr_len))
-   vnet_hdr.hdr_len = cpu_to_tap16(q,
-tap16_to_cpu(q, vnet_hdr.csum_start) +
-tap16_to_cpu(q, vnet_hdr.csum_offset) + 2);
-   err = -EINVAL;
-   if (tap16_to_cpu(q, vnet_hdr.hdr_len) > len)
+   hdr_len = tun_vnet_hdr_get(READ_ONCE(q->vnet_hdr_sz), q->flags, 
from, &vnet_hdr);
+   if (hdr_len < 0) {
+   err = hdr_len;
goto err;
+   }
+   } else {
+   hdr_len = 0;
}
  
-	err = -EINVAL;

-   if (unlikely(len < ETH_HLEN))
-   goto err;
-


Is this check removal intentional?


No, I'm not sure what this check is for, but it is irrlevant with vnet 
header and shouldn't be modified with this patch. I'll restore the check 
with the next version.





+   len = iov_iter_count(from);
if (msg_control && sock_flag(&q->sk, SOCK_ZEROCOPY)) {
struct iov_iter i;
  
-		copylen = vnet_hdr.hdr_len ?

-   tap16_to_cpu(q, vnet_hdr.hdr_len) : GOODCOPY_LEN;
+   copylen = hdr_len ? hdr_len : GOODCOPY_LEN;
if (copylen > good_linear)
copylen = good_linear;
else if (copylen < ETH_HLEN)
@@ -697,7 +614,7 @@ static ssize_t tap_get_user(struct tap_queue *q, void 
*msg_control,
  
  	if (!zerocopy) {

copylen = len;
-   linear = tap16_to_cpu(q, vnet_hdr.hdr_len);
+   linear = hdr_len;
if (linear > good_linear)
linear = good_linear;
else if (linear < ETH_HLEN)
@@ -732,9 +649,8 @@ static ssize_t tap_get_user(struct tap_queue *q, void 
*msg_control,
}
skb->dev = tap->dev;
  
-	if (vnet_hdr_len) {

-   err = virtio_net_hdr_to_skb(skb, &vnet_hdr,
-   tap_is_little_endian(q));
+   if (q->flags & IFF_VNET_HDR) {
+   err = tun_vnet_hdr_to_skb(q->flags, skb, &vnet_hdr);
if (err) {
rcu_read_unlock();
drop_reason = SKB_DROP_REASON_DEV_HDR;
@@ -797,23 +713,17 @@ static ss

Re: [PATCH v6 5/6] selftest: tun: Add tests for virtio-net hashing

2025-01-10 Thread Akihiko Odaki

On 2025/01/09 23:36, Willem de Bruijn wrote:

Akihiko Odaki wrote:

The added tests confirm tun can perform RSS and hash reporting, and
reject invalid configurations for them.

Signed-off-by: Akihiko Odaki 
---
  tools/testing/selftests/net/Makefile |   2 +-
  tools/testing/selftests/net/tun.c| 558 ++-
  2 files changed, 551 insertions(+), 9 deletions(-)

diff --git a/tools/testing/selftests/net/Makefile 
b/tools/testing/selftests/net/Makefile
index cb2fc601de66..92762ce3ebd4 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -121,6 +121,6 @@ $(OUTPUT)/reuseport_bpf_numa: LDLIBS += -lnuma
  $(OUTPUT)/tcp_mmap: LDLIBS += -lpthread -lcrypto
  $(OUTPUT)/tcp_inq: LDLIBS += -lpthread
  $(OUTPUT)/bind_bhash: LDLIBS += -lpthread
-$(OUTPUT)/io_uring_zerocopy_tx: CFLAGS += -I../../../include/
+$(OUTPUT)/io_uring_zerocopy_tx $(OUTPUT)/tun: CFLAGS += -I../../../include/
  
  include bpf.mk

diff --git a/tools/testing/selftests/net/tun.c 
b/tools/testing/selftests/net/tun.c
index 463dd98f2b80..9424d897e341 100644
--- a/tools/testing/selftests/net/tun.c
+++ b/tools/testing/selftests/net/tun.c
@@ -2,21 +2,37 @@
  
  #define _GNU_SOURCE
  
+#include 

  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
  #include 
-#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
  #include 
+#include 
  #include 
  #include 
-#include 
-#include 
+#include 
+#include 
+#include 
+#include 


Are all these include changes strictly needed? Iff so, might as well
fix ordering to be alphabetical (lexicographic).
   


Yes. I placed header files in linux/ after the other header files 
because include/uapi/linux/libc-compat.h requires libc header files to 
be placed before linux/ ones.




Re: [PATCH v6 1/6] virtio_net: Add functions for hashing

2025-01-10 Thread Akihiko Odaki

On 2025/01/09 23:13, Willem de Bruijn wrote:

Akihiko Odaki wrote:

They are useful to implement VIRTIO_NET_F_RSS and
VIRTIO_NET_F_HASH_REPORT.


Toeplitz potentially has users beyond virtio. I wonder if we should
from the start implement this as net/core/rss.c.


Or in lib/toeplitz.c just as like lib/siphash.c. I just chose the 
easiest option to implement everything in include/linux/virtio_net.h.




  

Signed-off-by: Akihiko Odaki 
---
  include/linux/virtio_net.h | 188 +
  1 file changed, 188 insertions(+)

diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index 02a9f4dc594d..3b25ca75710b 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -9,6 +9,194 @@
  #include 
  #include 
  
+struct virtio_net_hash {

+   u32 value;
+   u16 report;
+};
+
+struct virtio_net_toeplitz_state {
+   u32 hash;
+   const u32 *key;
+};
+
+#define VIRTIO_NET_SUPPORTED_HASH_TYPES (VIRTIO_NET_RSS_HASH_TYPE_IPv4 | \
+VIRTIO_NET_RSS_HASH_TYPE_TCPv4 | \
+VIRTIO_NET_RSS_HASH_TYPE_UDPv4 | \
+VIRTIO_NET_RSS_HASH_TYPE_IPv6 | \
+VIRTIO_NET_RSS_HASH_TYPE_TCPv6 | \
+VIRTIO_NET_RSS_HASH_TYPE_UDPv6)
+
+#define VIRTIO_NET_RSS_MAX_KEY_SIZE 40
+
+static inline void virtio_net_toeplitz_convert_key(u32 *input, size_t len)
+{
+   while (len >= sizeof(*input)) {
+   *input = be32_to_cpu((__force __be32)*input);
+   input++;
+   len -= sizeof(*input);
+   }
+}
+
+static inline void virtio_net_toeplitz_calc(struct virtio_net_toeplitz_state 
*state,
+   const __be32 *input, size_t len)
+{
+   while (len >= sizeof(*input)) {
+   for (u32 map = be32_to_cpu(*input); map; map &= (map - 1)) {
+   u32 i = ffs(map);
+
+   state->hash ^= state->key[0] << (32 - i) |
+  (u32)((u64)state->key[1] >> i);
+   }
+
+   state->key++;
+   input++;
+   len -= sizeof(*input);
+   }
+}


Have you verified that this algorithm matches a known toeplitz
implementation. And computes the expected values for the test
inputs in

https://learn.microsoft.com/en-us/windows-hardware/drivers/network/verifying-the-rss-hash-calculation


Yes.



We have a toeplitz implementation in
tools/testing/selftests/net/toeplitz.c that can also be used as
reference.

> >> +

+static inline u8 virtio_net_hash_key_length(u32 types)
+{
+   size_t len = 0;
+
+   if (types & VIRTIO_NET_HASH_REPORT_IPv4)
+   len = max(len,
+ sizeof(struct flow_dissector_key_ipv4_addrs));
+
+   if (types &
+   (VIRTIO_NET_HASH_REPORT_TCPv4 | VIRTIO_NET_HASH_REPORT_UDPv4))
+   len = max(len,
+ sizeof(struct flow_dissector_key_ipv4_addrs) +
+ sizeof(struct flow_dissector_key_ports));
+
+   if (types & VIRTIO_NET_HASH_REPORT_IPv6)
+   len = max(len,
+ sizeof(struct flow_dissector_key_ipv6_addrs));
+
+   if (types &
+   (VIRTIO_NET_HASH_REPORT_TCPv6 | VIRTIO_NET_HASH_REPORT_UDPv6))
+   len = max(len,
+ sizeof(struct flow_dissector_key_ipv6_addrs) +
+ sizeof(struct flow_dissector_key_ports));
+
+   return len + 4;


Avoid magic constants. Please use sizeof or something else to signal
what this 4 derives from.




Re: [PATCH] kcov: add unique cover, edge, and cmp modes

2025-01-10 Thread Marco Elver
On Fri, 10 Jan 2025 at 08:33, Joey Jiao  wrote:
>
> From: "Jiao, Joey" 
>
> The current design of KCOV risks frequent buffer overflows. To mitigate
> this, new modes are introduced: KCOV_TRACE_UNIQ_PC, KCOV_TRACE_UNIQ_EDGE,
> and KCOV_TRACE_UNIQ_CMP. These modes allow for the recording of unique
> PCs, edges, and comparison operands (CMP).

There ought to be a cover letter explaining the motivation for this,
and explaining why the new modes would help. Ultimately, what are you
using KCOV for where you encountered this problem?

> Key changes include:
> - KCOV_TRACE_UNIQ_[PC|EDGE] can be used together to replace KCOV_TRACE_PC.
> - KCOV_TRACE_UNIQ_CMP can be used to replace KCOV_TRACE_CMP mode.
> - Introduction of hashmaps to store unique coverage data.
> - Pre-allocated entries in kcov_map_init during KCOV_INIT_TRACE to avoid
>   performance issues with kmalloc.
> - New structs and functions for managing memory and unique coverage data.
> - Example program demonstrating the usage of the new modes.

This should be a patch series, carefully splitting each change into a
separate patch.
https://docs.kernel.org/process/submitting-patches.html#split-changes

> With the new hashmap and pre-alloced memory pool added, cover size can't
> be set to higher value like 1MB in KCOV_TRACE_PC or KCOV_TRACE_CMP modes
> in 2GB device with 8 procs, otherwise it causes frequent oom.
>
> For KCOV_TRACE_UNIQ_[PC|EDGE|CMP] modes, smaller cover size like 8KB can
> be used.
>
> Signed-off-by: Jiao, Joey 

As-is it's hard to review, and the motivation is unclear. A lot of
code was moved and changed, and reviewers need to understand why that
was done besides your brief explanation above.

Generally, KCOV has very tricky constraints, due to being callable
from any context, including NMI. This means adding new dependencies
need to be carefully reviewed. For one, we can see this in genalloc's
header:

> * The lockless operation only works if there is enough memory
> * available.  If new memory is added to the pool a lock has to be
> * still taken.  So any user relying on locklessness has to ensure
> * that sufficient memory is preallocated.
> *
> * The basic atomic operation of this allocator is cmpxchg on long.
> * On architectures that don't have NMI-safe cmpxchg implementation,
> * the allocator can NOT be used in NMI handler.  So code uses the
> * allocator in NMI handler should depend on
> * CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG.

And you are calling gen_pool_alloc() from __sanitizer_cov_trace_pc.
Which means this implementation is likely broken on
!CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG architectures (do we have
architectures like that, that support KCOV?).

There are probably other sharp corners due to the contexts KCOV can
run in, but would simply ask you to carefully reason about why each
new dependency is safe.



Re: [PATCH v6 3/6] tun: Introduce virtio-net hash feature

2025-01-10 Thread Akihiko Odaki

On 2025/01/09 23:28, Willem de Bruijn wrote:

Akihiko Odaki wrote:

Hash reporting
--

Allow the guest to reuse the hash value to make receive steering
consistent between the host and guest, and to save hash computation.

RSS
---

RSS is a receive steering algorithm that can be negotiated to use with
virtio_net. Conventionally the hash calculation was done by the VMM.
However, computing the hash after the queue was chosen defeats the
purpose of RSS.

Another approach is to use eBPF steering program. This approach has
another downside: it cannot report the calculated hash due to the
restrictive nature of eBPF steering program.

Introduce the code to perform RSS to the kernel in order to overcome
thse challenges. An alternative solution is to extend the eBPF steering
program so that it will be able to report to the userspace, but I didn't
opt for it because extending the current mechanism of eBPF steering
program as is because it relies on legacy context rewriting, and
introducing kfunc-based eBPF will result in non-UAPI dependency while
the other relevant virtualization APIs such as KVM and vhost_net are
UAPIs.

Signed-off-by: Akihiko Odaki 
---
  Documentation/networking/tuntap.rst |   7 ++
  drivers/net/Kconfig |   1 +
  drivers/net/tap.c   |  50 ++-
  drivers/net/tun.c   |  93 +++-
  drivers/net/tun_vnet.c  | 167 +---
  drivers/net/tun_vnet.h  |  33 ++-
  include/linux/if_tap.h  |   2 +
  include/linux/skbuff.h  |   3 +
  include/uapi/linux/if_tun.h |  75 
  net/core/skbuff.c   |   4 +
  10 files changed, 397 insertions(+), 38 deletions(-)

diff --git a/Documentation/networking/tuntap.rst 
b/Documentation/networking/tuntap.rst
index 4d7087f727be..86b4ae8caa8a 100644
--- a/Documentation/networking/tuntap.rst
+++ b/Documentation/networking/tuntap.rst
@@ -206,6 +206,13 @@ enable is true we enable it, otherwise we disable it::
return ioctl(fd, TUNSETQUEUE, (void *)&ifr);
}
  
+3.4 Reference

+-
+
+``linux/if_tun.h`` defines the interface described below:
+
+.. kernel-doc:: include/uapi/linux/if_tun.h
+
  Universal TUN/TAP device driver Frequently Asked Question
  =
  
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig

index 255c8f9f1d7c..f7b0d9a89a71 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -395,6 +395,7 @@ config TUN
tristate "Universal TUN/TAP device driver support"
depends on INET
select CRC32
+   select SKB_EXTENSIONS
select TUN_VNET
help
  TUN/TAP provides packet reception and transmission for user space
diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index fe9554ee5b8b..27659df1f96e 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -179,6 +179,16 @@ static void tap_put_queue(struct tap_queue *q)
sock_put(&q->sk);
  }
  
+static struct virtio_net_hash *tap_add_hash(struct sk_buff *skb)

+{
+   return (struct virtio_net_hash *)skb->cb;
+}
+
+static const struct virtio_net_hash *tap_find_hash(const struct sk_buff *skb)
+{
+   return (const struct virtio_net_hash *)skb->cb;
+}
+


If introducing a cb for tap, define a struct tuntap_skb_cb.

So that we do not have to change types if we ever need to extend it further.

And in line with your other patch that deduplicates between tun and tap,
define only one new struct, not two (as this patch currently does).


The previous version did that, but Jason suggested the added 
TUNSETVNETHASH ioctl should support all the flags we are implementing 
(TUN_VNET_HASH_REPORT and TUN_VNET_HASH_RSS) in one patch.





  /*
   * Select a queue based on the rxq of the device on which this packet
   * arrived. If the incoming device is not mq, calculate a flow hash
@@ -189,6 +199,7 @@ static void tap_put_queue(struct tap_queue *q)
  static struct tap_queue *tap_get_queue(struct tap_dev *tap,
   struct sk_buff *skb)
  {
+   struct flow_keys_basic keys_basic;
struct tap_queue *queue = NULL;
/* Access to taps array is protected by rcu, but access to numvtaps
 * isn't. Below we use it to lookup a queue, but treat it as a hint
@@ -196,17 +207,41 @@ static struct tap_queue *tap_get_queue(struct tap_dev 
*tap,
 * racing against queue removal.
 */
int numvtaps = READ_ONCE(tap->numvtaps);
+   struct tun_vnet_hash_container *vnet_hash = 
rcu_dereference(tap->vnet_hash);
__u32 rxq;
  
+	*tap_add_hash(skb) = (struct virtio_net_hash) { .report = VIRTIO_NET_HASH_REPORT_NONE };

+
if (!numvtaps)
goto out;
  
  	if (numvtaps == 1)

goto single;
  
+	if (vnet_hash && (vnet_hash->common.flags & TUN_VNET_HASH_RSS)) {

+   rxq = tun_vnet_rss_select_queue(numvta

Re: [PATCH v2 3/3] tun: Set num_buffers for virtio 1.0

2025-01-10 Thread Akihiko Odaki

On 2025/01/10 19:23, Michael S. Tsirkin wrote:

On Fri, Jan 10, 2025 at 11:27:13AM +0800, Jason Wang wrote:

On Thu, Jan 9, 2025 at 2:59 PM Akihiko Odaki  wrote:


The specification says the device MUST set num_buffers to 1 if
VIRTIO_NET_F_MRG_RXBUF has not been negotiated.


Have we agreed on how to fix the spec or not?

As I replied in the spec patch, if we just remove this "MUST", it
looks like we are all fine?

Thanks


We should replace MUST with SHOULD but it is not all fine,
ignoring SHOULD is a quality of implementation issue.



Should we really replace it? It would mean that a driver conformant with 
the current specification may not be compatible with a device conformant 
with the future specification.


We are going to fix all implementations known to buggy (QEMU and Linux) 
anyway so I think it's just fine to leave that part of specification as is.




Re: [PATCH] Add short author date to Fixes tag

2025-01-10 Thread Jacob Keller



On 1/10/2025 5:03 AM, Steven Rostedt wrote:
> On Fri, 10 Jan 2025 12:20:09 +
> yek...@red54.com wrote:
> 
>> The old Fixes tag style is at least 10 years old. It lacks date
>> information, which can lead to misjudgment. So I added short author date
>> to avoid this. This make it clear at a glance and reduce
>> misunderstandings.
> 
> How can it lead to misjudgment? If you have two or more hashes matching, do
> you really think they'll have the same subjects?
> 
> I do not plan on doing this. It's pointless.
> 
> -- Steve

While the addition of the date is a widely used variant within the git
community, this was rejected by the kernel community in the past as
well. I remember posting fixes tags with the date several years ago and
getting push back.

I tried to find reference to these discussions but I can't seem to
locate them anymore on the archives.

I personally find the date helpful as it can help place a commit without
needing to take the extra time to do a lookup.

However, all of the existing tooling we have for the kernel does not
support the date, and I think its not worth trying to change it at this
point. It doesn't make sense to break all this tooling for information
which is accessible in other forms. Indeed, as long as the hash is
sufficiently long, the change of a collision is minimal.



Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper

2025-01-10 Thread Jason Gunthorpe
On Fri, Jan 10, 2025 at 07:12:46AM +, Tian, Kevin wrote:

> > +   if (!viommu)
> > +   return -ENODEV;
> > +   if (WARN_ON_ONCE(!viommu->ops || !viommu->ops-
> > >supports_veventq ||
> > +!viommu->ops->supports_veventq(type)))
> > +   return -EOPNOTSUPP;
> 
> Hmm the driver knows which type is supported by itself before
> calling this helper. Why bother having the helper calling into
> the driver again to verify?

Indeed, it might make sense to protect this with

if (IS_ENABLED(CONFIG_IOMMUFD_TEST))

As a compiled out assertion

Or drop it

We shouldn't have unnecessary argument validation on fast paths,
!viommu should go too.

Jason



Re: [PATCH v2 2/3] tun: Pad virtio header with zero

2025-01-10 Thread Willem de Bruijn
Akihiko Odaki wrote:
> On 2025/01/10 17:33, Michael S. Tsirkin wrote:
> > On Fri, Jan 10, 2025 at 01:38:06PM +0900, Akihiko Odaki wrote:
> >> On 2025/01/09 21:46, Willem de Bruijn wrote:
> >>> Akihiko Odaki wrote:
>  On 2025/01/09 16:31, Michael S. Tsirkin wrote:
> > On Thu, Jan 09, 2025 at 03:58:44PM +0900, Akihiko Odaki wrote:
> >> tun used to simply advance iov_iter when it needs to pad virtio header,
> >> which leaves the garbage in the buffer as is. This is especially
> >> problematic when tun starts to allow enabling the hash reporting
> >> feature; even if the feature is enabled, the packet may lack a hash
> >> value and may contain a hole in the virtio header because the packet
> >> arrived before the feature gets enabled or does not contain the
> >> header fields to be hashed. If the hole is not filled with zero, it is
> >> impossible to tell if the packet lacks a hash value.
> >>>
> >>> Zero is a valid hash value, so cannot be used as an indication that
> >>> hashing is inactive.
> >>
> >> Zeroing will initialize the hash_report field to
> >> VIRTIO_NET_HASH_REPORT_NONE, which tells it does not have a hash value.
> >>
> >>>
> >> In theory, a user of tun can fill the buffer with zero before calling
> >> read() to avoid such a problem, but leaving the garbage in the buffer 
> >> is
> >> awkward anyway so fill the buffer in tun.
> >>
> >> Signed-off-by: Akihiko Odaki 
> >
> > But if the user did it, you have just overwritten his value,
> > did you not?
> 
>  Yes. but that means the user expects some part of buffer is not filled
>  after read() or recvmsg(). I'm a bit worried that not filling the buffer
>  may break assumptions others (especially the filesystem and socket
>  infrastructures in the kernel) may have.
> >>>
> >>> If this is user memory that is ignored by the kernel, just reflected
> >>> back, then there is no need in general to zero it. There are many such
> >>> instances, also in msg_control.
> >>
> >> More specifically, is there any instance of recvmsg() implementation which
> >> returns N and does not fill the complete N bytes of msg_iter?
> > 
> > The one in tun. It was a silly idea but it has been here for years now.
> 
> Except tun. If there is such an example of recvmsg() implementation and 
> it is not accidental and people have agreed to keep it functioning, we 
> can confidently say this construct is safe without fearing pushback from 
> people maintaining filesystem/networking infrastructure. Ultimately I 
> want those people decide if this can be supported for the future or not.

It seems preferable to write a value.

Userspace should have not assumption that whatever it writes there
will be reflected unmodified. That said, that is the tiny risk of
changing this in established code.

If it worked without issue so far, without hashing, then probably the
change should only go to net-next.

As said, there are examples in msg_control. I don't immediately have
an example where this is the case in msg_data today. A search for
iov_iter_advance might show something.



Re: [PATCH] Add short author date to Fixes tag

2025-01-10 Thread Greg Kroah-Hartman
On Fri, Jan 10, 2025 at 12:20:09PM +, yek...@red54.com wrote:
> From: 谢致邦 (XIE Zhibang) 
> 
> The old Fixes tag style is at least 10 years old. It lacks date
> information, which can lead to misjudgment. So I added short author date
> to avoid this. This make it clear at a glance and reduce
> misunderstandings.
> 
> For example:
> 
> Old Fixes tag style:
> * Fixes: fd3040b9394c ("net: ethernet: Add driver for Sunplus SP7021")
> * Fixes: a76053707dbf ("dev_ioctl: split out ndo_eth_ioctl")
> This will make people mistakenly think that "a76053707dbf" broke
> "fd3040b9394c".[1]
> 
> New Fixes tag style:
> * Fixes: fd3040b9394c ("net: ethernet: Add driver for Sunplus SP7021", 
> 2022-05-08)
> * Fixes: a76053707dbf ("dev_ioctl: split out ndo_eth_ioctl", 2021-07-27)
> This makes it clear that the newly introduced driver did not follow the
> existing changes.

Please no, you will break all of our tooling and scripts that parse
these types of fields.  The git commit id and commit header is all we
need to properly determine this type of information, no need to add a
date in here at all.

There's no issue with "making things clear" that I can tell, and no,
listing multiple fixes lines does NOT mean that a previous line broke
something at all.  It just means that a single commit happened to fix
multiple commits (which frankly is a rare thing, and one that our
scripts already have a hard enough time dealing with...)

So I don't think you need to add a date here.  Dates also really do not
mean much with commits, what matters is what release a commit is in, not
when it was originally made.  We have commits that take years to show up
in a release, so if you only look at a date, you will be mistaken many
times as it's not showing what came before what many times (i.e. we
apply commits out-of-date-order all the time.)

thanks,

greg k-h



[PATCH] Add short author date to Fixes tag

2025-01-10 Thread Yeking
From: 谢致邦 (XIE Zhibang) 

The old Fixes tag style is at least 10 years old. It lacks date
information, which can lead to misjudgment. So I added short author date
to avoid this. This make it clear at a glance and reduce
misunderstandings.

For example:

Old Fixes tag style:
* Fixes: fd3040b9394c ("net: ethernet: Add driver for Sunplus SP7021")
* Fixes: a76053707dbf ("dev_ioctl: split out ndo_eth_ioctl")
This will make people mistakenly think that "a76053707dbf" broke
"fd3040b9394c".[1]

New Fixes tag style:
* Fixes: fd3040b9394c ("net: ethernet: Add driver for Sunplus SP7021", 
2022-05-08)
* Fixes: a76053707dbf ("dev_ioctl: split out ndo_eth_ioctl", 2021-07-27)
This makes it clear that the newly introduced driver did not follow the
existing changes.

[1] https://lore.kernel.org/all/20250109180212.71e4e...@kernel.org/

docs: submitting-patches: The short author date of old example
"54a4f0239f2e" is 2010-05-05, which is not immediately obvious as
-MM-DD, so change example.

Fixes: 8401aa1f5997 ("Documentation/SubmittingPatches: describe the Fixes: 
tag", 2014-06-06)
Signed-off-by: 谢致邦 (XIE Zhibang) 
---
 Documentation/process/5.Posting.rst   |  2 +-
 Documentation/process/maintainer-tip.rst  |  4 +--
 .../process/researcher-guidelines.rst |  2 +-
 Documentation/process/submitting-patches.rst  |  8 ++---
 scripts/checkpatch.pl | 30 +++
 5 files changed, 25 insertions(+), 21 deletions(-)

diff --git a/Documentation/process/5.Posting.rst 
b/Documentation/process/5.Posting.rst
index b3eff03ea249..38c9c94e7448 100644
--- a/Documentation/process/5.Posting.rst
+++ b/Documentation/process/5.Posting.rst
@@ -199,7 +199,7 @@ document; what follows here is a brief summary.
 One tag is used to refer to earlier commits which introduced problems fixed by
 the patch::
 
-   Fixes: 1f2e3d4c5b6a ("The first line of the commit specified by the 
first 12 characters of its SHA-1 ID")
+   Fixes: 1f2e3d4c5b6a ("The first line of the commit specified by the 
first 12 characters of its SHA-1 ID", 2024-12-31)
 
 Another tag is used for linking web pages with additional backgrounds or
 details, for example an earlier discussion which leads to the patch or a
diff --git a/Documentation/process/maintainer-tip.rst 
b/Documentation/process/maintainer-tip.rst
index e374b67b3277..fb97a6853e42 100644
--- a/Documentation/process/maintainer-tip.rst
+++ b/Documentation/process/maintainer-tip.rst
@@ -270,7 +270,7 @@ Ordering of commit tags
 To have a uniform view of the commit tags, the tip maintainers use the
 following tag ordering scheme:
 
- - Fixes: 12char-SHA1 ("sub/sys: Original subject line")
+ - Fixes: 12char-SHA1 ("sub/sys: Original subject line", -MM-DD)
 
A Fixes tag should be added even for changes which do not need to be
backported to stable kernels, i.e. when addressing a recently introduced
@@ -295,7 +295,7 @@ following tag ordering scheme:
  The recent replacement of foo with bar left an unused instance of
  variable foo around. Remove it.
 
- Fixes: abcdef012345678 ("x86/xxx: Replace foo with bar")
+ Fixes: abcdef012345678 ("x86/xxx: Replace foo with bar", 2024-12-31)
  Signed-off-by: J.Dev 
 
The latter puts the information about the patch into the focus and
diff --git a/Documentation/process/researcher-guidelines.rst 
b/Documentation/process/researcher-guidelines.rst
index beb484c5965d..472ca0a4684d 100644
--- a/Documentation/process/researcher-guidelines.rst
+++ b/Documentation/process/researcher-guidelines.rst
@@ -149,7 +149,7 @@ For example::
   [1] https://url/to/leakmagic/details
 
   Reported-by: Researcher 
-  Fixes:  ("Introduce support for FooBar")
+  Fixes:  ("Introduce support for FooBar", 2024-12-31)
   Signed-off-by: Author 
   Reviewed-by: Reviewer 
 
diff --git a/Documentation/process/submitting-patches.rst 
b/Documentation/process/submitting-patches.rst
index 1518bd57adab..4df1e722130b 100644
--- a/Documentation/process/submitting-patches.rst
+++ b/Documentation/process/submitting-patches.rst
@@ -148,7 +148,7 @@ the SHA-1 ID, and the one line summary.  Do not split the 
tag across multiple
 lines, tags are exempt from the "wrap at 75 columns" rule in order to simplify
 parsing scripts.  For example::
 
-   Fixes: 54a4f0239f2e ("KVM: MMU: make kvm_mmu_zap_page() return the 
number of pages it actually freed")
+   Fixes: 6a451e2c5c03 ("ALSA: hda/tas2781: Ignore SUBSYS_ID not found for 
tas2563 projects", 2024-12-30)
 
 The following ``git config`` settings can be used to add a pretty format for
 outputting the above style in the ``git log`` or ``git show`` commands::
@@ -156,12 +156,12 @@ outputting the above style in the ``git log`` or ``git 
show`` commands::
[core]
abbrev = 12
[pretty]
-   fixes = Fixes: %h (\"%s\")
+   fixes = Fixes: %h (\"%s\", %as)
 
 An example call::
 
-   $ git log -1 --pretty=fixes 

Re: [PATCH] kcov: add unique cover, edge, and cmp modes

2025-01-10 Thread Dmitry Vyukov
On Fri, 10 Jan 2025 at 10:23, Marco Elver  wrote:
> > From: "Jiao, Joey" 
> >
> > The current design of KCOV risks frequent buffer overflows. To mitigate
> > this, new modes are introduced: KCOV_TRACE_UNIQ_PC, KCOV_TRACE_UNIQ_EDGE,
> > and KCOV_TRACE_UNIQ_CMP. These modes allow for the recording of unique
> > PCs, edges, and comparison operands (CMP).
>
> There ought to be a cover letter explaining the motivation for this,
> and explaining why the new modes would help. Ultimately, what are you
> using KCOV for where you encountered this problem?
>
> > Key changes include:
> > - KCOV_TRACE_UNIQ_[PC|EDGE] can be used together to replace KCOV_TRACE_PC.
> > - KCOV_TRACE_UNIQ_CMP can be used to replace KCOV_TRACE_CMP mode.
> > - Introduction of hashmaps to store unique coverage data.
> > - Pre-allocated entries in kcov_map_init during KCOV_INIT_TRACE to avoid
> >   performance issues with kmalloc.
> > - New structs and functions for managing memory and unique coverage data.
> > - Example program demonstrating the usage of the new modes.
>
> This should be a patch series, carefully splitting each change into a
> separate patch.
> https://docs.kernel.org/process/submitting-patches.html#split-changes
>
> > With the new hashmap and pre-alloced memory pool added, cover size can't
> > be set to higher value like 1MB in KCOV_TRACE_PC or KCOV_TRACE_CMP modes
> > in 2GB device with 8 procs, otherwise it causes frequent oom.
> >
> > For KCOV_TRACE_UNIQ_[PC|EDGE|CMP] modes, smaller cover size like 8KB can
> > be used.
> >
> > Signed-off-by: Jiao, Joey 
>
> As-is it's hard to review, and the motivation is unclear. A lot of
> code was moved and changed, and reviewers need to understand why that
> was done besides your brief explanation above.
>
> Generally, KCOV has very tricky constraints, due to being callable
> from any context, including NMI. This means adding new dependencies
> need to be carefully reviewed. For one, we can see this in genalloc's
> header:
>
> > * The lockless operation only works if there is enough memory
> > * available.  If new memory is added to the pool a lock has to be
> > * still taken.  So any user relying on locklessness has to ensure
> > * that sufficient memory is preallocated.
> > *
> > * The basic atomic operation of this allocator is cmpxchg on long.
> > * On architectures that don't have NMI-safe cmpxchg implementation,
> > * the allocator can NOT be used in NMI handler.  So code uses the
> > * allocator in NMI handler should depend on
> > * CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG.
>
> And you are calling gen_pool_alloc() from __sanitizer_cov_trace_pc.
> Which means this implementation is likely broken on
> !CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG architectures (do we have
> architectures like that, that support KCOV?).
>
> There are probably other sharp corners due to the contexts KCOV can
> run in, but would simply ask you to carefully reason about why each
> new dependency is safe.

I am also concerned about the performance effect. Does it add a stack
frame to __sanitizer_cov_trace_pc()? Please show disassm of the
function before/after.

Also, I have concerns about interrupts and reentrancy. We are still
getting some reentrant calls from interrupts (not all of them are
filtered by in_task() check). I am afraid these complex hashmaps will
corrupt.



Re: [PATCH] Add short author date to Fixes tag

2025-01-10 Thread Steven Rostedt
On Fri, 10 Jan 2025 12:20:09 +
yek...@red54.com wrote:

> The old Fixes tag style is at least 10 years old. It lacks date
> information, which can lead to misjudgment. So I added short author date
> to avoid this. This make it clear at a glance and reduce
> misunderstandings.

How can it lead to misjudgment? If you have two or more hashes matching, do
you really think they'll have the same subjects?

I do not plan on doing this. It's pointless.

-- Steve



Re: [PATCH net-next v2 07/14] net: ethernet: qualcomm: Initialize PPE queue settings

2025-01-10 Thread Jie Luo




On 1/10/2025 1:52 AM, Simon Horman wrote:

On Wed, Jan 08, 2025 at 09:47:14PM +0800, Luo Jie wrote:

Configure unicast and multicast hardware queues for the PPE
ports to enable packet forwarding between the ports.

Each PPE port is assigned with a range of queues. The queue ID
selection for a packet is decided by the queue base and queue
offset that is configured based on the internal priority and
the RSS hash value of the packet.

Signed-off-by: Luo Jie 
---
  drivers/net/ethernet/qualcomm/ppe/ppe_config.c | 357 -
  drivers/net/ethernet/qualcomm/ppe/ppe_config.h |  63 +
  drivers/net/ethernet/qualcomm/ppe/ppe_regs.h   |  21 ++
  3 files changed, 440 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/qualcomm/ppe/ppe_config.c 
b/drivers/net/ethernet/qualcomm/ppe/ppe_config.c


...


@@ -673,6 +701,111 @@ static struct ppe_scheduler_port_config 
ppe_port_sch_config[] = {
},
  };
  
+/* The scheduler resource is applied to each PPE port, The resource

+ * includes the unicast & multicast queues, flow nodes and DRR nodes.
+ */
+static struct ppe_port_schedule_resource ppe_scheduler_res[] = {
+   {   .ucastq_start   = 0,
+   .ucastq_end = 63,
+   .mcastq_start   = 256,
+   .ucastq_end = 271,


Hi Luo Jie,

This appears to duplicate the initialisation of .ucastq_end.
Should the line above initialise .mcastq_end instead?

Likewise for other elements of this array.

Flagged by W=1 builds with both clang-19 and gcc-14.


Thanks for pointing to this. I will update the code and build the
patches with latest GCC version gcc-14 flagged by W=1 to check the
patches.




+   .flow_id_start  = 0,
+   .flow_id_end= 0,
+   .l0node_start   = 0,
+   .l0node_end = 7,
+   .l1node_start   = 0,
+   .l1node_end = 0,
+   },
+   {   .ucastq_start   = 144,
+   .ucastq_end = 159,
+   .mcastq_start   = 272,
+   .ucastq_end = 275,
+   .flow_id_start  = 36,
+   .flow_id_end= 39,
+   .l0node_start   = 48,
+   .l0node_end = 63,
+   .l1node_start   = 8,
+   .l1node_end = 11,
+   },


...


+};


...





Re: [PATCH net-next v2 04/14] net: ethernet: qualcomm: Initialize PPE buffer management for IPQ9574

2025-01-10 Thread Jie Luo




On 1/10/2025 4:11 AM, Simon Horman wrote:

On Thu, Jan 09, 2025 at 05:27:14PM +, Simon Horman wrote:

On Wed, Jan 08, 2025 at 09:47:11PM +0800, Luo Jie wrote:

The BM (Buffer Management) config controls the pause frame generated
on the PPE port. There are maximum 15 BM ports and 4 groups supported,
all BM ports are assigned to group 0 by default. The number of hardware
buffers configured for the port influence the threshold of the flow
control for that port.

Signed-off-by: Luo Jie 


...


diff --git a/drivers/net/ethernet/qualcomm/ppe/ppe_config.c 
b/drivers/net/ethernet/qualcomm/ppe/ppe_config.c


...


+/* The buffer configurations per PPE port. There are 15 BM ports and
+ * 4 BM groups supported by PPE. BM port (0-7) is for EDMA port 0,
+ * BM port (8-13) is for PPE physical port 1-6 and BM port 14 is for
+ * EIP port.
+ */
+static struct ppe_bm_port_config ipq9574_ppe_bm_port_config[] = {
+   {
+   /* Buffer configuration for the BM port ID 0 of EDMA. */
+   .port_id_start  = 0,
+   .port_id_end= 0,
+   .pre_alloc  = 0,
+   .in_fly_buf = 100,
+   .ceil   = 1146,
+   .weight = 7,
+   .resume_offset  = 8,
+   .resume_ceil= 0,
+   .dynamic= true,
+   },
+   {
+   /* Buffer configuration for the BM port ID 1-7 of EDMA. */
+   .port_id_start  = 1,
+   .port_id_end= 7,
+   .pre_alloc  = 0,
+   .in_fly_buf = 100,
+   .ceil   = 250,
+   .weight = 4,
+   .resume_offset  = 36,
+   .resume_ceil= 0,
+   .dynamic= true,
+   },
+   {
+   /* Buffer configuration for the BM port ID 8-13 of PPE ports. */
+   .port_id_start  = 8,
+   .port_id_end= 13,
+   .pre_alloc  = 0,
+   .in_fly_buf = 128,
+   .ceil   = 250,
+   .weight = 4,
+   .resume_offset  = 36,
+   .resume_ceil= 0,
+   .dynamic= true,
+   },
+   {
+   /* Buffer configuration for the BM port ID 14 of EIP. */
+   .port_id_start  = 14,
+   .port_id_end= 14,
+   .pre_alloc  = 0,
+   .in_fly_buf = 40,
+   .ceil   = 250,
+   .weight = 4,
+   .resume_offset  = 36,
+   .resume_ceil= 0,
+   .dynamic= true,
+   },
+};
+
+static int ppe_config_bm_threshold(struct ppe_device *ppe_dev, int bm_port_id,
+  struct ppe_bm_port_config port_cfg)
+{
+   u32 reg, val, bm_fc_val[2];
+   int ret;
+
+   /* Configure BM flow control related threshold. */
+   PPE_BM_PORT_FC_SET_WEIGHT(bm_fc_val, port_cfg.weight);


Hi Luo Jie,

When compiling with W=1 for x86_32 and ARM (32bit)
(but, curiously not x86_64 or arm64), gcc-14 complains that
bm_fc_val is uninitialised, I believe due to the line above and
similar lines below.

In file included from drivers/net/ethernet/qualcomm/ppe/ppe_config.c:10:
In function 'u32p_replace_bits',
 inlined from 'ppe_config_bm_threshold' at 
drivers/net/ethernet/qualcomm/ppe/ppe_config.c:112:2:
./include/linux/bitfield.h:189:15: warning: 'bm_fc_val' is used uninitialized 
[-Wuninitialized]
   189 | *p = (*p & ~to(field)) | type##_encode_bits(val, field);   
 \
   |   ^~
./include/linux/bitfield.h:198:9: note: in expansion of macro 'MAKE_OP'
   198 | MAKE_OP(u##size,u##size,,)
   | ^~~
./include/linux/bitfield.h:201:1: note: in expansion of macro '__MAKE_OP'
   201 | __MAKE_OP(32)
   | ^
drivers/net/ethernet/qualcomm/ppe/ppe_config.c: In function 
'ppe_config_bm_threshold':
drivers/net/ethernet/qualcomm/ppe/ppe_config.c:108:23: note: 'bm_fc_val' 
declared here
   108 | u32 reg, val, bm_fc_val[2];
   |   ^


+   PPE_BM_PORT_FC_SET_RESUME_OFFSET(bm_fc_val, port_cfg.resume_offset);
+   PPE_BM_PORT_FC_SET_RESUME_THRESHOLD(bm_fc_val, port_cfg.resume_ceil);
+   PPE_BM_PORT_FC_SET_DYNAMIC(bm_fc_val, port_cfg.dynamic);
+   PPE_BM_PORT_FC_SET_REACT_LIMIT(bm_fc_val, port_cfg.in_fly_buf);
+   PPE_BM_PORT_FC_SET_PRE_ALLOC(bm_fc_val, port_cfg.pre_alloc);
+
+   /* Configure low/high bits of the ceiling for the BM port. */
+   val = FIELD_PREP(GENMASK(2, 0), port_cfg.ceil);


The value of port_cfg.ceil is 250 or 1146, as set in
ipq9574_ppe_bm_port_config. clang-19 W=1 builds complain that this
value is too large for the field (3 bits).

drivers/net/ethernet/qualcomm/ppe/ppe_config.c:120:8: error: call to 
'__compiletime_assert_925' declared with 'error' attribute: FIELD_PREP: value 
too large for the field
   120

Re: [PATCH net-next v2 06/14] net: ethernet: qualcomm: Initialize the PPE scheduler settings

2025-01-10 Thread Jie Luo




On 1/10/2025 1:42 AM, Simon Horman wrote:

On Wed, Jan 08, 2025 at 09:47:13PM +0800, Luo Jie wrote:

The PPE scheduler settings determine the priority of scheduling the
packet across the different hardware queues per PPE port.

Signed-off-by: Luo Jie 
---
  drivers/net/ethernet/qualcomm/ppe/ppe_config.c | 789 -
  drivers/net/ethernet/qualcomm/ppe/ppe_config.h |  37 ++
  drivers/net/ethernet/qualcomm/ppe/ppe_regs.h   |  97 +++
  3 files changed, 922 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/qualcomm/ppe/ppe_config.c 
b/drivers/net/ethernet/qualcomm/ppe/ppe_config.c


...


+/**
+ * ppe_queue_scheduler_set - Configure scheduler for PPE hardware queue
+ * @ppe_dev: PPE device
+ * @node_id: PPE queue ID or flow ID
+ * @flow_level: Flow level scheduler or queue level scheduler
+ * @port: PPE port ID set scheduler configuration
+ * @scheduler_cfg: PPE scheduler configuration
+ *
+ * PPE scheduler configuration supports queue level and flow level on
+ * the PPE egress port.
+ *
+ * Return 0 on success, negative error code on failure.


Nit: The tooling would prefer this last line formatted as;

 * Return: ...

or

 * Returns: ...

Flagged by ./scripts/kernel-doc -none -Wall


OK, I will update the kernel-doc comments and run the kernel-doc to
verify proper formatting of the documentation comments on all patches.




+ */
+int ppe_queue_scheduler_set(struct ppe_device *ppe_dev,
+   int node_id, bool flow_level, int port,
+   struct ppe_scheduler_cfg scheduler_cfg)
+{
+   if (flow_level)
+   return ppe_scheduler_l1_queue_map_set(ppe_dev, node_id,
+ port, scheduler_cfg);
+
+   return ppe_scheduler_l0_queue_map_set(ppe_dev, node_id,
+ port, scheduler_cfg);
+}


...




Re: [PATCH net-next v2 01/14] dt-bindings: net: Add PPE for Qualcomm IPQ9574 SoC

2025-01-10 Thread Jie Luo




On 1/9/2025 5:15 PM, Krzysztof Kozlowski wrote:

On Wed, Jan 08, 2025 at 09:47:08PM +0800, Luo Jie wrote:

+required:
+  - clocks
+  - clock-names
+  - resets
+  - interrupts
+  - interrupt-names
+
+  ethernet-ports:


This device really looks like DSA or other ethernet switch, so I would
really expect proper $ref in top-level.


Sure, agree that the PPE is better modeled as an Ethernet switch. I will
add and use the $ref ethernet-switch.yaml in the top-level.




+type: object
+additionalProperties: false
+properties:
+  '#address-cells':
+const: 1
+  '#size-cells':
+const: 0
+
+patternProperties:
+  "^port@[1-6]$":
+type: object
+$ref: ethernet-controller.yaml#


Everything here is duplicating DSA or ethernet-switch, so that's
surprising.


I will remove the current 'ethernet-ports' node and the "$ref: ethernet-
controller.yaml#" from the port node. As the top-level $ref, will use 
ethernet-switch.yaml instead.


The PPE Ethernet port node requires the additional DT properties clocks
and resets, which will be added into the switch port node. Thanks.




+unevaluatedProperties: false
+description:
+  PPE port that includes the MAC used to connect the external
+  switch or PHY via the PCS.


Best regards,
Krzysztof






Re: [PATCH net-next v2 08/14] net: ethernet: qualcomm: Initialize PPE service code settings

2025-01-10 Thread Jie Luo




On 1/10/2025 1:58 AM, Simon Horman wrote:

On Wed, Jan 08, 2025 at 09:47:15PM +0800, Luo Jie wrote:

PPE service code is a special code (0-255) that is defined by PPE for
PPE's packet processing stages, as per the network functions required
for the packet.

For packet being sent out by ARM cores on Ethernet ports, The service
code 1 is used as the default service code. This service code is used
to bypass most of packet processing stages of the PPE before the packet
transmitted out PPE port, since the software network stack has already
processed the packet.

Signed-off-by: Luo Jie 


...


diff --git a/drivers/net/ethernet/qualcomm/ppe/ppe_config.h 
b/drivers/net/ethernet/qualcomm/ppe/ppe_config.h


...


+/**
+ * struct ppe_sc_bypss - PPE service bypass bitmaps


nit: struct ppe_sc_bypass


OK. Will correct it.




+ * @ingress: Bitmap of features that can be bypassed on the ingress packet.
+ * @egress: Bitmap of features that can be bypassed on the egress packet.
+ * @counter: Bitmap of features that can be bypassed on the counter type.
+ * @tunnel: Bitmap of features that can be bypassed on the tunnel packet.
+ */
+struct ppe_sc_bypass {
+   DECLARE_BITMAP(ingress, PPE_SC_BYPASS_INGRESS_SIZE);
+   DECLARE_BITMAP(egress, PPE_SC_BYPASS_EGRESS_SIZE);
+   DECLARE_BITMAP(counter, PPE_SC_BYPASS_COUNTER_SIZE);
+   DECLARE_BITMAP(tunnel, PPE_SC_BYPASS_TUNNEL_SIZE);
+};


...





Re: [PATCH v6 07/26] fs/dax: Ensure all pages are idle prior to filesystem unmount

2025-01-10 Thread Darrick J. Wong
On Fri, Jan 10, 2025 at 05:00:35PM +1100, Alistair Popple wrote:
> File systems call dax_break_mapping() prior to reallocating file
> system blocks to ensure the page is not undergoing any DMA or other
> accesses. Generally this is needed when a file is truncated to ensure
> that if a block is reallocated nothing is writing to it. However
> filesystems currently don't call this when an FS DAX inode is evicted.
> 
> This can cause problems when the file system is unmounted as a page
> can continue to be under going DMA or other remote access after
> unmount. This means if the file system is remounted any truncate or
> other operation which requires the underlying file system block to be
> freed will not wait for the remote access to complete. Therefore a
> busy block may be reallocated to a new file leading to corruption.
> 
> Signed-off-by: Alistair Popple 
> 
> ---
> 
> Changes for v5:
> 
>  - Don't wait for pages to be idle in non-DAX mappings
> ---
>  fs/dax.c| 29 +
>  fs/ext4/inode.c | 32 ++--
>  fs/xfs/xfs_inode.c  |  9 +
>  fs/xfs/xfs_inode.h  |  1 +
>  fs/xfs/xfs_super.c  | 18 ++
>  include/linux/dax.h |  2 ++
>  6 files changed, 73 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 7008a73..4e49cc4 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -883,6 +883,14 @@ static int wait_page_idle(struct page *page,
>   TASK_INTERRUPTIBLE, 0, 0, cb(inode));
>  }
>  
> +static void wait_page_idle_uninterruptible(struct page *page,
> + void (cb)(struct inode *),
> + struct inode *inode)
> +{
> + ___wait_var_event(page, page_ref_count(page) == 1,
> + TASK_UNINTERRUPTIBLE, 0, 0, cb(inode));
> +}
> +
>  /*
>   * Unmaps the inode and waits for any DMA to complete prior to deleting the
>   * DAX mapping entries for the range.
> @@ -911,6 +919,27 @@ int dax_break_mapping(struct inode *inode, loff_t start, 
> loff_t end,
>  }
>  EXPORT_SYMBOL_GPL(dax_break_mapping);
>  
> +void dax_break_mapping_uninterruptible(struct inode *inode,
> + void (cb)(struct inode *))
> +{
> + struct page *page;
> +
> + if (!dax_mapping(inode->i_mapping))
> + return;
> +
> + do {
> + page = dax_layout_busy_page_range(inode->i_mapping, 0,
> + LLONG_MAX);
> + if (!page)
> + break;
> +
> + wait_page_idle_uninterruptible(page, cb, inode);
> + } while (true);
> +
> + dax_delete_mapping_range(inode->i_mapping, 0, LLONG_MAX);
> +}
> +EXPORT_SYMBOL_GPL(dax_break_mapping_uninterruptible);
> +
>  /*
>   * Invalidate DAX entry if it is clean.
>   */
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index ee8e83f..fa35161 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -163,6 +163,18 @@ int ext4_inode_is_fast_symlink(struct inode *inode)
>  (inode->i_size < EXT4_N_BLOCKS * 4);
>  }
>  
> +static void ext4_wait_dax_page(struct inode *inode)
> +{
> + filemap_invalidate_unlock(inode->i_mapping);
> + schedule();
> + filemap_invalidate_lock(inode->i_mapping);
> +}
> +
> +int ext4_break_layouts(struct inode *inode)
> +{
> + return dax_break_mapping_inode(inode, ext4_wait_dax_page);
> +}
> +
>  /*
>   * Called at the last iput() if i_nlink is zero.
>   */
> @@ -181,6 +193,8 @@ void ext4_evict_inode(struct inode *inode)
>  
>   trace_ext4_evict_inode(inode);
>  
> + dax_break_mapping_uninterruptible(inode, ext4_wait_dax_page);
> +
>   if (EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL)
>   ext4_evict_ea_inode(inode);
>   if (inode->i_nlink) {
> @@ -3902,24 +3916,6 @@ int ext4_update_disksize_before_punch(struct inode 
> *inode, loff_t offset,
>   return ret;
>  }
>  
> -static void ext4_wait_dax_page(struct inode *inode)
> -{
> - filemap_invalidate_unlock(inode->i_mapping);
> - schedule();
> - filemap_invalidate_lock(inode->i_mapping);
> -}
> -
> -int ext4_break_layouts(struct inode *inode)
> -{
> - struct page *page;
> - int error;
> -
> - if (WARN_ON_ONCE(!rwsem_is_locked(&inode->i_mapping->invalidate_lock)))
> - return -EINVAL;
> -
> - return dax_break_mapping_inode(inode, ext4_wait_dax_page);
> -}
> -
>  /*
>   * ext4_punch_hole: punches a hole in a file by releasing the blocks
>   * associated with the given offset and length
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 4410b42..c7ec5ab 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2997,6 +2997,15 @@ xfs_break_dax_layouts(
>   return dax_break_mapping_inode(inode, xfs_wait_dax_page);
>  }
>  
> +void
> +xfs_break_dax_layouts_uninterruptible(
> + struct inode*inode)
> +{
> + xfs_assert_ilocked(XFS_I(inode), XFS_MMAPLOCK_EXCL);
> +
> + d

Re: [PATCH v5 04/14] iommufd: Abstract an iommufd_eventq from iommufd_fault

2025-01-10 Thread Jason Gunthorpe
On Tue, Jan 07, 2025 at 09:10:07AM -0800, Nicolin Chen wrote:

> @@ -433,32 +434,35 @@ void iopt_remove_access(struct io_pagetable *iopt,
>   u32 iopt_access_list_id);
>  void iommufd_access_destroy_object(struct iommufd_object *obj);
>  
> -/*
> - * An iommufd_fault object represents an interface to deliver I/O page faults
> - * to the user space. These objects are created/destroyed by the user space 
> and
> - * associated with hardware page table objects during page-table allocation.
> - */
> -struct iommufd_fault {
> +struct iommufd_eventq_ops {
> + ssize_t (*read)(struct iommufd_eventq *eventq, char __user *buf,
> + size_t count, loff_t *ppos); /* Mandatory op */
> + ssize_t (*write)(struct iommufd_eventq *eventq, const char __user *buf,
> +  size_t count, loff_t *ppos); /* Optional op */
> +};

I think I recommend to avoid this, more like this:

diff --git a/drivers/iommu/iommufd/eventq.c b/drivers/iommu/iommufd/eventq.c
index b5be629f38eda7..73f5e8a6b17f54 100644
--- a/drivers/iommu/iommufd/eventq.c
+++ b/drivers/iommu/iommufd/eventq.c
@@ -341,11 +341,6 @@ static ssize_t iommufd_fault_fops_write(struct 
iommufd_eventq *eventq,
return done == 0 ? rc : done;
 }
 
-static const struct iommufd_eventq_ops iommufd_fault_ops = {
-   .read = &iommufd_fault_fops_read,
-   .write = &iommufd_fault_fops_write,
-};
-
 /* IOMMUFD_OBJ_VEVENTQ Functions */
 
 void iommufd_veventq_abort(struct iommufd_object *obj)
@@ -409,31 +404,8 @@ static ssize_t iommufd_veventq_fops_read(struct 
iommufd_eventq *eventq,
return done == 0 ? rc : done;
 }
 
-static const struct iommufd_eventq_ops iommufd_veventq_ops = {
-   .read = &iommufd_veventq_fops_read,
-};
-
 /* Common Event Queue Functions */
 
-static ssize_t iommufd_eventq_fops_read(struct file *filep, char __user *buf,
-   size_t count, loff_t *ppos)
-{
-   struct iommufd_eventq *eventq = filep->private_data;
-
-   return eventq->ops->read(eventq, buf, count, ppos);
-}
-
-static ssize_t iommufd_eventq_fops_write(struct file *filep,
-const char __user *buf, size_t count,
-loff_t *ppos)
-{
-   struct iommufd_eventq *eventq = filep->private_data;
-
-   if (!eventq->ops->write)
-   return -EOPNOTSUPP;
-   return eventq->ops->write(eventq, buf, count, ppos);
-}
-
 static __poll_t iommufd_eventq_fops_poll(struct file *filep,
 struct poll_table_struct *wait)
 {
@@ -458,34 +430,31 @@ static int iommufd_eventq_fops_release(struct inode 
*inode, struct file *filep)
return 0;
 }
 
-static const struct file_operations iommufd_eventq_fops = {
-   .owner  = THIS_MODULE,
-   .open   = nonseekable_open,
-   .read   = iommufd_eventq_fops_read,
-   .write  = iommufd_eventq_fops_write,
-   .poll   = iommufd_eventq_fops_poll,
-   .release= iommufd_eventq_fops_release,
-};
+#define INIT_EVENTQ_FOPS(read_op, write_op) \
+   (struct file_operations){   \
+   .owner = THIS_MODULE,   \
+   .open = nonseekable_open,   \
+   .read = read_op,\
+   .write = write_op,  \
+   .poll = iommufd_eventq_fops_poll,   \
+   .release = iommufd_eventq_fops_release, \
+   }
 
 static int iommufd_eventq_init(struct iommufd_eventq *eventq, char *name,
   struct iommufd_ctx *ictx,
-  const struct iommufd_eventq_ops *ops)
+  const struct file_operations *fops)
 {
struct file *filep;
int fdno;
 
-   if (WARN_ON_ONCE(!ops || !ops->read))
-   return -EINVAL;
-
mutex_init(&eventq->mutex);
INIT_LIST_HEAD(&eventq->deliver);
init_waitqueue_head(&eventq->wait_queue);
 
-   filep = anon_inode_getfile(name, &iommufd_eventq_fops, eventq, O_RDWR);
+   filep = anon_inode_getfile(name, fops, eventq, O_RDWR);
if (IS_ERR(filep))
return PTR_ERR(filep);
 
-   eventq->ops = ops;
eventq->ictx = ictx;
iommufd_ctx_get(eventq->ictx);
refcount_inc(&eventq->obj.users);
@@ -497,6 +466,9 @@ static int iommufd_eventq_init(struct iommufd_eventq 
*eventq, char *name,
return fdno;
 }
 
+static const struct file_operations iommufd_pgfault_fops =
+   INIT_EVENTQ_FOPS(iommufd_fault_fops_read, iommufd_fault_fops_write);
+
 int iommufd_fault_alloc(struct iommufd_ucmd *ucmd)
 {
struct iommu_fault_alloc *cmd = ucmd->cmd;
@@ -515,7 +487,7 @@ int iommufd_fault_alloc(struct iommufd_ucmd *ucmd)
xa_init_flags(&fault->response, XA_

Re: [PATCH v5 02/14] iommufd/fault: Add an iommufd_fault_init() helper

2025-01-10 Thread Jason Gunthorpe
On Tue, Jan 07, 2025 at 09:10:05AM -0800, Nicolin Chen wrote:
> The infrastructure of a fault object will be shared with a new vEVENTQ
> object in a following change. Add a helper for a vEVENTQ allocator to
> call it too.
> 
> Reorder the iommufd_ctx_get and refcount_inc, to keep them symmetrical
> with the iommufd_fault_fops_release().
> 
> Since the new vEVENTQ doesn't need "response", leave the xa_init_flags
> in its original location.
> 
> Reviewed-by: Kevin Tian 
> Reviewed-by: Lu Baolu 
> Signed-off-by: Nicolin Chen 
> ---
>  drivers/iommu/iommufd/fault.c | 48 ---
>  1 file changed, 28 insertions(+), 20 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason



Re: [PATCH v5 03/14] iommufd/fault: Move iommufd_fault_iopf_handler() to header

2025-01-10 Thread Jason Gunthorpe
On Tue, Jan 07, 2025 at 09:10:06AM -0800, Nicolin Chen wrote:
> The new vEVENTQ object will need a similar function for drivers to report
> the vIOMMU related events. Split the common part out to a smaller helper,
> and place it in the header so that CONFIG_IOMMUFD_DRIVER_CORE can include
> that in the driver.c file for drivers to use.
> 
> Then keep iommufd_fault_iopf_handler() in the header too, since it's quite
> simple after all.
> 
> Reviewed-by: Kevin Tian 
> Reviewed-by: Lu Baolu 
> Signed-off-by: Nicolin Chen 
> ---
>  drivers/iommu/iommufd/iommufd_private.h | 20 +++-
>  drivers/iommu/iommufd/fault.c   | 17 -
>  2 files changed, 19 insertions(+), 18 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason



Re: [PATCH v5 01/14] iommufd: Keep OBJ/IOCTL lists in an alphabetical order

2025-01-10 Thread Jason Gunthorpe
On Tue, Jan 07, 2025 at 09:10:04AM -0800, Nicolin Chen wrote:
> Reorder the existing OBJ/IOCTL lists.
> 
> Also run clang-format for the same coding style at line wrappings.
> 
> No functional change.
> 
> Reviewed-by: Lu Baolu 
> Signed-off-by: Nicolin Chen 
> ---
>  drivers/iommu/iommufd/main.c | 30 ++
>  1 file changed, 14 insertions(+), 16 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason



Re: [PATCH v5 05/14] iommufd: Rename fault.c to eventq.c

2025-01-10 Thread Jason Gunthorpe
On Tue, Jan 07, 2025 at 09:10:08AM -0800, Nicolin Chen wrote:
> Rename the file, aligning with the new eventq object.
> 
> Reviewed-by: Kevin Tian 
> Reviewed-by: Lu Baolu 
> Signed-off-by: Nicolin Chen 
> ---
>  drivers/iommu/iommufd/Makefile  | 2 +-
>  drivers/iommu/iommufd/{fault.c => eventq.c} | 0
>  2 files changed, 1 insertion(+), 1 deletion(-)
>  rename drivers/iommu/iommufd/{fault.c => eventq.c} (100%)

Reviewed-by: Jason Gunthorpe 

Jason



Re: [PATCH v6 05/26] fs/dax: Create a common implementation to break DAX layouts

2025-01-10 Thread Darrick J. Wong
On Fri, Jan 10, 2025 at 05:00:33PM +1100, Alistair Popple wrote:
> Prior to freeing a block file systems supporting FS DAX must check
> that the associated pages are both unmapped from user-space and not
> undergoing DMA or other access from eg. get_user_pages(). This is
> achieved by unmapping the file range and scanning the FS DAX
> page-cache to see if any pages within the mapping have an elevated
> refcount.
> 
> This is done using two functions - dax_layout_busy_page_range() which
> returns a page to wait for the refcount to become idle on. Rather than
> open-code this introduce a common implementation to both unmap and
> wait for the page to become idle.
> 
> Signed-off-by: Alistair Popple 

So now that Dan Carpenter has complained, I guess I should look at
this...

> ---
> 
> Changes for v5:
> 
>  - Don't wait for idle pages on non-DAX mappings
> 
> Changes for v4:
> 
>  - Fixed some build breakage due to missing symbol exports reported by
>John Hubbard (thanks!).
> ---
>  fs/dax.c| 33 +
>  fs/ext4/inode.c | 10 +-
>  fs/fuse/dax.c   | 27 +++
>  fs/xfs/xfs_inode.c  | 23 +--
>  fs/xfs/xfs_inode.h  |  2 +-
>  include/linux/dax.h | 21 +
>  mm/madvise.c|  8 
>  7 files changed, 68 insertions(+), 56 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index d010c10..9c3bd07 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -845,6 +845,39 @@ int dax_delete_mapping_entry(struct address_space 
> *mapping, pgoff_t index)
>   return ret;
>  }
>  
> +static int wait_page_idle(struct page *page,
> + void (cb)(struct inode *),
> + struct inode *inode)
> +{
> + return ___wait_var_event(page, page_ref_count(page) == 1,
> + TASK_INTERRUPTIBLE, 0, 0, cb(inode));
> +}
> +
> +/*
> + * Unmaps the inode and waits for any DMA to complete prior to deleting the
> + * DAX mapping entries for the range.
> + */
> +int dax_break_mapping(struct inode *inode, loff_t start, loff_t end,
> + void (cb)(struct inode *))
> +{
> + struct page *page;
> + int error;
> +
> + if (!dax_mapping(inode->i_mapping))
> + return 0;
> +
> + do {
> + page = dax_layout_busy_page_range(inode->i_mapping, start, end);
> + if (!page)
> + break;
> +
> + error = wait_page_idle(page, cb, inode);
> + } while (error == 0);

You didn't initialize error to 0, so it could be any value.  What if
dax_layout_busy_page_range returns null the first time through the loop?

> +
> + return error;
> +}
> +EXPORT_SYMBOL_GPL(dax_break_mapping);
> +
>  /*
>   * Invalidate DAX entry if it is clean.
>   */



> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 42ea203..295730a 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2715,21 +2715,17 @@ xfs_mmaplock_two_inodes_and_break_dax_layout(
>   struct xfs_inode*ip2)
>  {
>   int error;
> - boolretry;
>   struct page *page;
>  
>   if (ip1->i_ino > ip2->i_ino)
>   swap(ip1, ip2);
>  
>  again:
> - retry = false;
>   /* Lock the first inode */
>   xfs_ilock(ip1, XFS_MMAPLOCK_EXCL);
> - error = xfs_break_dax_layouts(VFS_I(ip1), &retry);
> - if (error || retry) {
> + error = xfs_break_dax_layouts(VFS_I(ip1));
> + if (error) {
>   xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
> - if (error == 0 && retry)
> - goto again;

Hmm, so the retry loop has moved into xfs_break_dax_layouts, which means
that we no longer cycle the MMAPLOCK.  Why was the lock cycling
unnecessary?

>   return error;
>   }
>  
> @@ -2988,19 +2984,11 @@ xfs_wait_dax_page(
>  
>  int
>  xfs_break_dax_layouts(
> - struct inode*inode,
> - bool*retry)
> + struct inode*inode)
>  {
> - struct page *page;
> -
>   xfs_assert_ilocked(XFS_I(inode), XFS_MMAPLOCK_EXCL);
>  
> - page = dax_layout_busy_page(inode->i_mapping);
> - if (!page)
> - return 0;
> -
> - *retry = true;
> - return dax_wait_page_idle(page, xfs_wait_dax_page, inode);
> + return dax_break_mapping_inode(inode, xfs_wait_dax_page);
>  }
>  
>  int
> @@ -3018,8 +3006,7 @@ xfs_break_layouts(
>   retry = false;
>   switch (reason) {
>   case BREAK_UNMAP:
> - error = xfs_break_dax_layouts(inode, &retry);
> - if (error || retry)
> + if (xfs_break_dax_layouts(inode))

dax_break_mapping can return -ERESTARTSYS, right?  So doesn't this need
to be:
error = xfs_break_dax_layouts(inode);
if (error)
break;

Hm?

--D

>  

Re: [PATCH v6 21/26] fs/dax: Properly refcount fs dax pages

2025-01-10 Thread Darrick J. Wong
On Fri, Jan 10, 2025 at 05:00:49PM +1100, Alistair Popple wrote:
> Currently fs dax pages are considered free when the refcount drops to
> one and their refcounts are not increased when mapped via PTEs or
> decreased when unmapped. This requires special logic in mm paths to
> detect that these pages should not be properly refcounted, and to
> detect when the refcount drops to one instead of zero.
> 
> On the other hand get_user_pages(), etc. will properly refcount fs dax
> pages by taking a reference and dropping it when the page is
> unpinned.
> 
> Tracking this special behaviour requires extra PTE bits
> (eg. pte_devmap) and introduces rules that are potentially confusing
> and specific to FS DAX pages. To fix this, and to possibly allow
> removal of the special PTE bits in future, convert the fs dax page
> refcounts to be zero based and instead take a reference on the page
> each time it is mapped as is currently the case for normal pages.
> 
> This may also allow a future clean-up to remove the pgmap refcounting
> that is currently done in mm/gup.c.
> 
> Signed-off-by: Alistair Popple 
> 
> ---
> 
> Changes since v2:
> 
> Based on some questions from Dan I attempted to have the FS DAX page
> cache (ie. address space) hold a reference to the folio whilst it was
> mapped. However I came to the strong conclusion that this was not the
> right thing to do.
> 
> If the page refcount == 0 it means the page is:
> 
> 1. not mapped into user-space
> 2. not subject to other access via DMA/GUP/etc.
> 
> Ie. From the core MM perspective the page is not in use.
> 
> The fact a page may or may not be present in one or more address space
> mappings is irrelevant for core MM. It just means the page is still in
> use or valid from the file system perspective, and it's a
> responsiblity of the file system to remove these mappings if the pfn
> mapping becomes invalid (along with first making sure the MM state,
> ie. page->refcount, is idle). So we shouldn't be trying to track that
> lifetime with MM refcounts.
> 
> Doing so just makes DMA-idle tracking more complex because there is
> now another thing (one or more address spaces) which can hold
> references on a page. And FS DAX can't even keep track of all the
> address spaces which might contain a reference to the page in the
> XFS/reflink case anyway.
> 
> We could do this if we made file systems invalidate all address space
> mappings prior to calling dax_break_layouts(), but that isn't
> currently neccessary and would lead to increased faults just so we
> could do some superfluous refcounting which the file system already
> does.
> 
> I have however put the page sharing checks and WARN_ON's back which
> also turned out to be useful for figuring out when to re-initialising
> a folio.
> ---
>  drivers/nvdimm/pmem.c|   4 +-
>  fs/dax.c | 212 +++-
>  fs/fuse/virtio_fs.c  |   3 +-
>  fs/xfs/xfs_inode.c   |   2 +-
>  include/linux/dax.h  |   6 +-
>  include/linux/mm.h   |  27 +-
>  include/linux/mm_types.h |   7 +-
>  mm/gup.c |   9 +--
>  mm/huge_memory.c |   6 +-
>  mm/internal.h|   2 +-
>  mm/memory-failure.c  |   6 +-
>  mm/memory.c  |   6 +-
>  mm/memremap.c|  47 -
>  mm/mm_init.c |   9 +--
>  mm/swap.c|   2 +-
>  15 files changed, 183 insertions(+), 165 deletions(-)
> 
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index d81faa9..785b2d2 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -513,7 +513,7 @@ static int pmem_attach_disk(struct device *dev,
>  
>   pmem->disk = disk;
>   pmem->pgmap.owner = pmem;
> - pmem->pfn_flags = PFN_DEV;
> + pmem->pfn_flags = 0;
>   if (is_nd_pfn(dev)) {
>   pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
>   pmem->pgmap.ops = &fsdax_pagemap_ops;
> @@ -522,7 +522,6 @@ static int pmem_attach_disk(struct device *dev,
>   pmem->data_offset = le64_to_cpu(pfn_sb->dataoff);
>   pmem->pfn_pad = resource_size(res) -
>   range_len(&pmem->pgmap.range);
> - pmem->pfn_flags |= PFN_MAP;
>   bb_range = pmem->pgmap.range;
>   bb_range.start += pmem->data_offset;
>   } else if (pmem_should_map_pages(dev)) {
> @@ -532,7 +531,6 @@ static int pmem_attach_disk(struct device *dev,
>   pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
>   pmem->pgmap.ops = &fsdax_pagemap_ops;
>   addr = devm_memremap_pages(dev, &pmem->pgmap);
> - pmem->pfn_flags |= PFN_MAP;
>   bb_range = pmem->pgmap.range;
>   } else {
>   addr = devm_memremap(dev, pmem->phys_addr,
> diff --git a/fs/dax.c b/fs/dax.c
> index d35dbe1..19f444e 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -71,6 +71,11 @@ static unsigned long dax_to_pfn(void *entry)
>   return xa_to_value(entry) >> DAX_SHIFT;
>

Re: [PATCH v13 0/5] Extended MODVERSIONS Support

2025-01-10 Thread Masahiro Yamada
On Sat, Jan 4, 2025 at 2:37 AM Matthew Maurer  wrote:
>
> This patch series is intended for use alongside the Implement DWARF
> modversions series [1] to enable RUST and MODVERSIONS at the same
> time.
>
> Elsewhere, we've seen a desire for long symbol name support for LTO
> symbol names [2], and the previous series came up [3] as a possible
> solution rather than hashing, which some have objected [4] to.
>
> This series adds a MODVERSIONS format which uses a section per column.
> This avoids userspace tools breaking if we need to make a similar change
> to the format in the future - we would do so by adding a new section,
> rather than editing the struct definition. In the new format, the name
> section is formatted as a concatenated sequence of NUL-terminated
> strings, which allows for arbitrary length names.
>
> Emitting the extended format is guarded by CONFIG_EXTENDED_MODVERSIONS,
> but the kernel always knows how to validate both the original and
> extended formats.
>
> Emitting the existing format is now guarded by CONFIG_BASIC_MODVERSIONS,
> but it is enabled by default when MODVERSIONS is enabled and must be
> explicitly disabled by the user.
>
> Disabling CONFIG_BASIC_MODVERSIONS may cause some userspace tools to be
> unable to retrieve CRCs until they are patched to understand the new
> location. Even with CONFIG_BASIC_MODVERSIONS enabled, those tools will
> be unable to read the CRCs for long symbols until they are updated to
> read the new format. This is not expected to interfere with normal
> operation, as the primary use for CRCs embedded in the module is
> load-time verification by the kernel. Recording and monitoring of CRCs
> is typically done through Module.symvers.
>
> Selecting RUST and MODVERSIONS is now possible if GENDWARFKSYMS is
> selected, and will implicitly select EXTENDED_MODVERSIONS.
>
> This series depends upon DWARF-based versions [1] and Masahiro's u32
> fixup patch [5].
>
> [1] 
> https://lore.kernel.org/lkml/20241219210736.2990838-20-samitolva...@google.com/
> [2] https://lore.kernel.org/lkml/20240605032120.3179157-1-s...@kernel.org/
> [3] https://lore.kernel.org/lkml/zoxbeesk40asi...@bombadil.infradead.org/
> [4] 
> https://lore.kernel.org/lkml/0b2697fd-7ab4-469f-83a6-ec9ebc701...@suse.com/
> [5] 
> https://lore.kernel.org/linux-kbuild/20241228154603.2234284-1-masahi...@kernel.org
>
> Changes in v13:
> - Fixed up missed s32 usage (Thanks Sami).


Applied to linux-kbuild.
Thanks.




-- 
Best Regards
Masahiro Yamada



Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper

2025-01-10 Thread Jason Gunthorpe
On Tue, Jan 07, 2025 at 09:10:11AM -0800, Nicolin Chen wrote:
> +/*
> + * Typically called in driver's threaded IRQ handler.
> + * The @type and @event_data must be defined in include/uapi/linux/iommufd.h
> + */
> +int iommufd_viommu_report_event(struct iommufd_viommu *viommu,
> + enum iommu_veventq_type type, void *event_data,
> + size_t data_len)
> +{
> + struct iommufd_veventq *veventq;
> + struct iommufd_vevent *vevent;
> + int rc = 0;
> +
> + if (!viommu)
> + return -ENODEV;
> + if (WARN_ON_ONCE(!viommu->ops || !viommu->ops->supports_veventq ||
> +  !viommu->ops->supports_veventq(type)))
> + return -EOPNOTSUPP;
> + if (WARN_ON_ONCE(!data_len || !event_data))
> + return -EINVAL;
> +
> + down_read(&viommu->veventqs_rwsem);
> +
> + veventq = iommufd_viommu_find_veventq(viommu, type);
> + if (!veventq) {
> + rc = -EOPNOTSUPP;
> + goto out_unlock_veventqs;
> + }
> +
> + vevent = kmalloc(struct_size(vevent, event_data, data_len), GFP_KERNEL);
> + if (!vevent) {
> + rc = -ENOMEM;
> + goto out_unlock_veventqs;
> + }
> + memcpy(vevent->event_data, event_data, data_len);

The page fault path is self limited because end point devices are only
able to issue a certain number of PRI's before they have to stop.

But the async events generated by something like the SMMU are not self
limiting and we can have a huge barrage of them. I think you need to
add some kind of limiting here otherwise we will OOM the kernel and
crash, eg if the VM spams protection errors.

The virtual event queue should behave the same as if the physical
event queue overflows, and that logic should be in the smmu driver -
this should return some Exxx to indicate the queue is filled.

I supposed we will need a way to indicate lost events to userspace on
top of this?

Presumably userspace should specify the max queue size.

Jason



Re: [PATCH v5 06/14] iommufd: Add IOMMUFD_OBJ_VEVENTQ and IOMMUFD_CMD_VEVENTQ_ALLOC

2025-01-10 Thread Jason Gunthorpe
On Tue, Jan 07, 2025 at 09:10:09AM -0800, Nicolin Chen wrote:

> +static ssize_t iommufd_veventq_fops_read(struct iommufd_eventq *eventq,
> +  char __user *buf, size_t count,
> +  loff_t *ppos)
> +{
> + size_t done = 0;
> + int rc = 0;
> +
> + if (*ppos)
> + return -ESPIPE;
> +
> + mutex_lock(&eventq->mutex);
> + while (!list_empty(&eventq->deliver) && count > done) {
> + struct iommufd_vevent *cur = list_first_entry(
> + &eventq->deliver, struct iommufd_vevent, node);
> +
> + if (cur->data_len > count - done)
> + break;
> +
> + if (copy_to_user(buf + done, cur->event_data, cur->data_len)) {
> + rc = -EFAULT;
> + break;
> + }

Now that I look at this more closely, the fault path this is copied
from is not great.

This copy_to_user() can block while waiting on a page fault, possibily
for a long time. While blocked the mutex is held and we can't add more
entries to the list.

That will cause the shared IRQ handler in the iommu driver to back up,
which would cause a global DOS.

This probably wants to be organized to look more like

while (itm = eventq_get_next_item(eventq)) {
   if (..) {
   eventq_restore_failed_item(eventq);
   return -1;
   }
}

Where the next_item would just be a simple spinlock across the linked
list manipulation.

Jason



Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper

2025-01-10 Thread Jason Gunthorpe
On Fri, Jan 10, 2025 at 10:38:42AM -0800, Nicolin Chen wrote:
> > The virtual event queue should behave the same as if the physical
> > event queue overflows, and that logic should be in the smmu driver -
> > this should return some Exxx to indicate the queue is filled.
> 
> Hmm, the driver only screams...
> 
> static irqreturn_t arm_smmu_evtq_thread(int irq, void *dev)
> {
> [...]
>   /*
>* Not much we can do on overflow, so scream and pretend we're
>* trying harder.
>*/
>   if (queue_sync_prod_in(q) == -EOVERFLOW)
>   dev_err(smmu->dev, "EVTQ overflow detected -- events 
> lost\n");

Well it must know from the HW somehow that the overflow has happened??

> > I supposed we will need a way to indicate lost events to userspace on
> > top of this?
> 
> Perhaps another u32 flag in the arm_smmuv3_vevent struct to report
> an overflow. That said, what userspace/VMM will need to do with it?

Trigger the above code in the VM somehow?

Jason



Re: [PATCH v5 06/14] iommufd: Add IOMMUFD_OBJ_VEVENTQ and IOMMUFD_CMD_VEVENTQ_ALLOC

2025-01-10 Thread Jason Gunthorpe
On Fri, Jan 10, 2025 at 11:27:53AM -0800, Nicolin Chen wrote:
> On Fri, Jan 10, 2025 at 01:48:42PM -0400, Jason Gunthorpe wrote:
> > On Tue, Jan 07, 2025 at 09:10:09AM -0800, Nicolin Chen wrote:
> > 
> > > +static ssize_t iommufd_veventq_fops_read(struct iommufd_eventq *eventq,
> > > +  char __user *buf, size_t count,
> > > +  loff_t *ppos)
> > > +{
> > > + size_t done = 0;
> > > + int rc = 0;
> > > +
> > > + if (*ppos)
> > > + return -ESPIPE;
> > > +
> > > + mutex_lock(&eventq->mutex);
> > > + while (!list_empty(&eventq->deliver) && count > done) {
> > > + struct iommufd_vevent *cur = list_first_entry(
> > > + &eventq->deliver, struct iommufd_vevent, node);
> > > +
> > > + if (cur->data_len > count - done)
> > > + break;
> > > +
> > > + if (copy_to_user(buf + done, cur->event_data, cur->data_len)) {
> > > + rc = -EFAULT;
> > > + break;
> > > + }
> > 
> > Now that I look at this more closely, the fault path this is copied
> > from is not great.
> > 
> > This copy_to_user() can block while waiting on a page fault, possibily
> > for a long time. While blocked the mutex is held and we can't add more
> > entries to the list.
> >
> > That will cause the shared IRQ handler in the iommu driver to back up,
> > which would cause a global DOS.
> >
> > This probably wants to be organized to look more like
> > 
> > while (itm = eventq_get_next_item(eventq)) {
> >if (..) {
> >eventq_restore_failed_item(eventq);
> >return -1;
> >}
> > }
> > 
> > Where the next_item would just be a simple spinlock across the linked
> > list manipulation.
> 
> Would it be simpler by just limiting one node per read(), i.e.
> no "while (!list_empty)" and no block?
> 
> The report() adds one node at a time, and wakes up the poll()
> each time of adding a node. And user space could read one event
> at a time too?

That doesn't really help, the issue is it holds the lock over the
copy_to_user() which it is doing because it doesn't want pull the item off
the list and then try to handle the failure and put it back.

Jason



Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper

2025-01-10 Thread Nicolin Chen
On Fri, Jan 10, 2025 at 03:51:14PM -0400, Jason Gunthorpe wrote:
> On Fri, Jan 10, 2025 at 10:38:42AM -0800, Nicolin Chen wrote:
> > > The virtual event queue should behave the same as if the physical
> > > event queue overflows, and that logic should be in the smmu driver -
> > > this should return some Exxx to indicate the queue is filled.
> > 
> > Hmm, the driver only screams...
> > 
> > static irqreturn_t arm_smmu_evtq_thread(int irq, void *dev)
> > {
> > [...]
> > /*
> >  * Not much we can do on overflow, so scream and pretend we're
> >  * trying harder.
> >  */
> > if (queue_sync_prod_in(q) == -EOVERFLOW)
> > dev_err(smmu->dev, "EVTQ overflow detected -- events 
> > lost\n");
> 
> Well it must know from the HW somehow that the overflow has happened??
> 
> > > I supposed we will need a way to indicate lost events to userspace on
> > > top of this?
> > 
> > Perhaps another u32 flag in the arm_smmuv3_vevent struct to report
> > an overflow. That said, what userspace/VMM will need to do with it?
> 
> Trigger the above code in the VM somehow?

Oh, I see. I misunderstood somehow..

Thanks
Nicolin



Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper

2025-01-10 Thread Nicolin Chen
On Fri, Jan 10, 2025 at 01:41:32PM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 07, 2025 at 09:10:11AM -0800, Nicolin Chen wrote:
> > +/*
> > + * Typically called in driver's threaded IRQ handler.
> > + * The @type and @event_data must be defined in 
> > include/uapi/linux/iommufd.h
> > + */
> > +int iommufd_viommu_report_event(struct iommufd_viommu *viommu,
> > +   enum iommu_veventq_type type, void *event_data,
> > +   size_t data_len)
> > +{
> > +   struct iommufd_veventq *veventq;
> > +   struct iommufd_vevent *vevent;
> > +   int rc = 0;
> > +
> > +   if (!viommu)
> > +   return -ENODEV;
> > +   if (WARN_ON_ONCE(!viommu->ops || !viommu->ops->supports_veventq ||
> > +!viommu->ops->supports_veventq(type)))
> > +   return -EOPNOTSUPP;
> > +   if (WARN_ON_ONCE(!data_len || !event_data))
> > +   return -EINVAL;
> > +
> > +   down_read(&viommu->veventqs_rwsem);
> > +
> > +   veventq = iommufd_viommu_find_veventq(viommu, type);
> > +   if (!veventq) {
> > +   rc = -EOPNOTSUPP;
> > +   goto out_unlock_veventqs;
> > +   }
> > +
> > +   vevent = kmalloc(struct_size(vevent, event_data, data_len), GFP_KERNEL);
> > +   if (!vevent) {
> > +   rc = -ENOMEM;
> > +   goto out_unlock_veventqs;
> > +   }
> > +   memcpy(vevent->event_data, event_data, data_len);
> 
> The page fault path is self limited because end point devices are only
> able to issue a certain number of PRI's before they have to stop.
> 
> But the async events generated by something like the SMMU are not self
> limiting and we can have a huge barrage of them. I think you need to
> add some kind of limiting here otherwise we will OOM the kernel and
> crash, eg if the VM spams protection errors.

Ack. I think we can just use an atomic counter in the producer
and consumer functions.

> The virtual event queue should behave the same as if the physical
> event queue overflows, and that logic should be in the smmu driver -
> this should return some Exxx to indicate the queue is filled.

Hmm, the driver only screams...

static irqreturn_t arm_smmu_evtq_thread(int irq, void *dev)
{
[...]
/*
 * Not much we can do on overflow, so scream and pretend we're
 * trying harder.
 */
if (queue_sync_prod_in(q) == -EOVERFLOW)
dev_err(smmu->dev, "EVTQ overflow detected -- events 
lost\n");

> I supposed we will need a way to indicate lost events to userspace on
> top of this?

Perhaps another u32 flag in the arm_smmuv3_vevent struct to report
an overflow. That said, what userspace/VMM will need to do with it?

> Presumably userspace should specify the max queue size.

Yes. Similarly, vCMDQ has a vcmdq_log2size in the driver structure
for that. For veventq, this piece is core managed, so we will need
a veventq_size or so in the common iommufd_veventq_alloc structure.

Thanks!
Nicolin



Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper

2025-01-10 Thread Nicolin Chen
On Fri, Jan 10, 2025 at 10:51:49AM -0400, Jason Gunthorpe wrote:
> On Fri, Jan 10, 2025 at 07:12:46AM +, Tian, Kevin wrote:
> 
> > > + if (!viommu)
> > > + return -ENODEV;
> > > + if (WARN_ON_ONCE(!viommu->ops || !viommu->ops-
> > > >supports_veventq ||
> > > +  !viommu->ops->supports_veventq(type)))
> > > + return -EOPNOTSUPP;
> > 
> > Hmm the driver knows which type is supported by itself before
> > calling this helper. Why bother having the helper calling into
> > the driver again to verify?
> 
> Indeed, it might make sense to protect this with
> 
> if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
> 
> As a compiled out assertion
> 
> Or drop it
> 
> We shouldn't have unnecessary argument validation on fast paths,
> !viommu should go too.

Ack. I will drop those two if-statments.

Nicolin



Re: [PATCH v5 06/14] iommufd: Add IOMMUFD_OBJ_VEVENTQ and IOMMUFD_CMD_VEVENTQ_ALLOC

2025-01-10 Thread Nicolin Chen
On Fri, Jan 10, 2025 at 01:48:42PM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 07, 2025 at 09:10:09AM -0800, Nicolin Chen wrote:
> 
> > +static ssize_t iommufd_veventq_fops_read(struct iommufd_eventq *eventq,
> > +char __user *buf, size_t count,
> > +loff_t *ppos)
> > +{
> > +   size_t done = 0;
> > +   int rc = 0;
> > +
> > +   if (*ppos)
> > +   return -ESPIPE;
> > +
> > +   mutex_lock(&eventq->mutex);
> > +   while (!list_empty(&eventq->deliver) && count > done) {
> > +   struct iommufd_vevent *cur = list_first_entry(
> > +   &eventq->deliver, struct iommufd_vevent, node);
> > +
> > +   if (cur->data_len > count - done)
> > +   break;
> > +
> > +   if (copy_to_user(buf + done, cur->event_data, cur->data_len)) {
> > +   rc = -EFAULT;
> > +   break;
> > +   }
> 
> Now that I look at this more closely, the fault path this is copied
> from is not great.
> 
> This copy_to_user() can block while waiting on a page fault, possibily
> for a long time. While blocked the mutex is held and we can't add more
> entries to the list.
>
> That will cause the shared IRQ handler in the iommu driver to back up,
> which would cause a global DOS.
>
> This probably wants to be organized to look more like
> 
> while (itm = eventq_get_next_item(eventq)) {
>if (..) {
>eventq_restore_failed_item(eventq);
>return -1;
>}
> }
> 
> Where the next_item would just be a simple spinlock across the linked
> list manipulation.

Would it be simpler by just limiting one node per read(), i.e.
no "while (!list_empty)" and no block?

The report() adds one node at a time, and wakes up the poll()
each time of adding a node. And user space could read one event
at a time too?

Thanks
Nicolin



Re: [PATCH v2 2/3] tun: Pad virtio header with zero

2025-01-10 Thread Akihiko Odaki

On 2025/01/10 17:33, Michael S. Tsirkin wrote:

On Fri, Jan 10, 2025 at 01:38:06PM +0900, Akihiko Odaki wrote:

On 2025/01/09 21:46, Willem de Bruijn wrote:

Akihiko Odaki wrote:

On 2025/01/09 16:31, Michael S. Tsirkin wrote:

On Thu, Jan 09, 2025 at 03:58:44PM +0900, Akihiko Odaki wrote:

tun used to simply advance iov_iter when it needs to pad virtio header,
which leaves the garbage in the buffer as is. This is especially
problematic when tun starts to allow enabling the hash reporting
feature; even if the feature is enabled, the packet may lack a hash
value and may contain a hole in the virtio header because the packet
arrived before the feature gets enabled or does not contain the
header fields to be hashed. If the hole is not filled with zero, it is
impossible to tell if the packet lacks a hash value.


Zero is a valid hash value, so cannot be used as an indication that
hashing is inactive.


Zeroing will initialize the hash_report field to
VIRTIO_NET_HASH_REPORT_NONE, which tells it does not have a hash value.




In theory, a user of tun can fill the buffer with zero before calling
read() to avoid such a problem, but leaving the garbage in the buffer is
awkward anyway so fill the buffer in tun.

Signed-off-by: Akihiko Odaki 


But if the user did it, you have just overwritten his value,
did you not?


Yes. but that means the user expects some part of buffer is not filled
after read() or recvmsg(). I'm a bit worried that not filling the buffer
may break assumptions others (especially the filesystem and socket
infrastructures in the kernel) may have.


If this is user memory that is ignored by the kernel, just reflected
back, then there is no need in general to zero it. There are many such
instances, also in msg_control.


More specifically, is there any instance of recvmsg() implementation which
returns N and does not fill the complete N bytes of msg_iter?


The one in tun. It was a silly idea but it has been here for years now.


Except tun. If there is such an example of recvmsg() implementation and 
it is not accidental and people have agreed to keep it functioning, we 
can confidently say this construct is safe without fearing pushback from 
people maintaining filesystem/networking infrastructure. Ultimately I 
want those people decide if this can be supported for the future or not.







If not zeroing leads to ambiguity with the new feature, that would be
a reason to add it -- it is always safe to do so.

If we are really confident that it will not cause problems, this
behavior can be opt-in based on a flag or we can just write some
documentation warning userspace programmers to initialize the buffer.







Re: [PATCH v5 04/14] iommufd: Abstract an iommufd_eventq from iommufd_fault

2025-01-10 Thread Nicolin Chen
On Fri, Jan 10, 2025 at 01:26:37PM -0400, Jason Gunthorpe wrote:
> +#define INIT_EVENTQ_FOPS(read_op, write_op) \
> + (struct file_operations){   \
> + .owner = THIS_MODULE,   \
> + .open = nonseekable_open,   \
> + .read = read_op,\
> + .write = write_op,  \
> + .poll = iommufd_eventq_fops_poll,   \
> + .release = iommufd_eventq_fops_release, \
> + }

There is an ERROR complained by checkpatch. So I changed a bit,
and squashed it to the previous patch adding iommufd_fault_init:

+#define INIT_FAULT_FOPS(read_op, write_op) 
\
+   ((const struct file_operations){   \
+   .owner = THIS_MODULE,  \
+   .open = nonseekable_open,  \
+   .read = read_op,   \
+   .write = write_op, \
+   .poll = iommufd_fault_fops_poll,   \
+   .release = iommufd_fault_fops_release, \
+   })

Thanks
Nicolin



Re: [PATCH v5 06/14] iommufd: Add IOMMUFD_OBJ_VEVENTQ and IOMMUFD_CMD_VEVENTQ_ALLOC

2025-01-10 Thread Nicolin Chen
On Fri, Jan 10, 2025 at 07:06:49AM +, Tian, Kevin wrote:
> > From: Nicolin Chen 
> > Sent: Wednesday, January 8, 2025 1:10 AM
> > +
> > +int iommufd_veventq_alloc(struct iommufd_ucmd *ucmd)
> > +{
> > +   struct iommu_veventq_alloc *cmd = ucmd->cmd;
> > +   struct iommufd_veventq *veventq;
> > +   struct iommufd_viommu *viommu;
> > +   int fdno;
> > +   int rc;
> > +
> > +   if (cmd->flags || cmd->type == IOMMU_VEVENTQ_TYPE_DEFAULT)
> > +   return -EOPNOTSUPP;
> > +
> > +   viommu = iommufd_get_viommu(ucmd, cmd->viommu_id);
> > +   if (IS_ERR(viommu))
> > +   return PTR_ERR(viommu);
> > +
> > +   if (!viommu->ops || !viommu->ops->supports_veventq ||
> > +   !viommu->ops->supports_veventq(cmd->type))
> > +   return -EOPNOTSUPP;
> > +
> 
> I'm not sure about the necessity of above check. The event queue
> is just a software struct with a user-specified format for the iommu
> driver to report viommu event. The struct itself is not constrained
> by the hardware capability, though I'm not sure a real usage in
> which a smmu driver wants to report a vtd event. But legitimately
> an user can create any type of event queues which might just be
> never used.

Allowing a random type that a driver will never use for reporting
doesn't sound to make a lot of sense to me...

That being said, yea..I guess we could drop the limit here, since
it isn't going to break anything?

> It sounds clearer to do the check when IOPF cap is actually enabled
> on a device contained in the viommu. At that point check whether 
> a required type eventqueue has been created. If not then fail the
> iopf enabling.

Hmm, isn't IOPF a different channel?

And a vEVENTQ is per vIOMMU, not necessarily per vDEVICE/device..

> Then it reveals probably another todo in this series. Seems you still
> let the smmu driver statically enable iopf when probing the device. 
> Sounds like iommufd_viommu_alloc_hwpt_nested() may accept
> IOMMU_HWPT_FAULT_ID_VALID to refer to a event queue and
> later dynamically enable/disable iopf when attaching a device to the
> hwpt and check the event queue type there. Just like how the fault
> object is handled.

You've lost me here :-/

Thanks
Nicolin



Re: [PATCH v5 07/14] iommufd/viommu: Add iommufd_viommu_get_vdev_id helper

2025-01-10 Thread Nicolin Chen
On Fri, Jan 10, 2025 at 07:07:36AM +, Tian, Kevin wrote:
> > From: Nicolin Chen 
> > Sent: Wednesday, January 8, 2025 1:10 AM
> > +
> > +   xa_lock(&viommu->vdevs);
> > +   xa_for_each(&viommu->vdevs, index, vdev) {
> > +   if (vdev && vdev->dev == dev) {
> > +   vdev_id = (unsigned long)vdev->id;
> > +   break;
> > +   }
> > +   }
> 
> Nit. No need to check vdev pointer as commented in last review.

Oh, I missed that. Will drop it.

> Reviewed-by: Kevin Tian 

Thanks
Nicolin