Re: [PATCH v2 03/47] mm: shrinker: add infrastructure for dynamically allocating shrinker

2023-07-26 Thread Dave Chinner via Virtualization
On Mon, Jul 24, 2023 at 05:43:10PM +0800, Qi Zheng wrote:
> Currently, the shrinker instances can be divided into the following three
> types:
> 
> a) global shrinker instance statically defined in the kernel, such as
>workingset_shadow_shrinker.
> 
> b) global shrinker instance statically defined in the kernel modules, such
>as mmu_shrinker in x86.
> 
> c) shrinker instance embedded in other structures.
> 
> For case a, the memory of shrinker instance is never freed. For case b,
> the memory of shrinker instance will be freed after synchronize_rcu() when
> the module is unloaded. For case c, the memory of shrinker instance will
> be freed along with the structure it is embedded in.
> 
> In preparation for implementing lockless slab shrink, we need to
> dynamically allocate those shrinker instances in case c, then the memory
> can be dynamically freed alone by calling kfree_rcu().
> 
> So this commit adds the following new APIs for dynamically allocating
> shrinker, and add a private_data field to struct shrinker to record and
> get the original embedded structure.
> 
> 1. shrinker_alloc()
> 
> Used to allocate shrinker instance itself and related memory, it will
> return a pointer to the shrinker instance on success and NULL on failure.
> 
> 2. shrinker_free_non_registered()
> 
> Used to destroy the non-registered shrinker instance.

This is a bit nasty

> 
> 3. shrinker_register()
> 
> Used to register the shrinker instance, which is same as the current
> register_shrinker_prepared().
> 
> 4. shrinker_unregister()

rename this "shrinker_free()" and key the two different freeing
cases on the SHRINKER_REGISTERED bit rather than mostly duplicating
the two.

void shrinker_free(struct shrinker *shrinker)
{
struct dentry *debugfs_entry = NULL;
int debugfs_id;

if (!shrinker)
return;

down_write(&shrinker_rwsem);
if (shrinker->flags & SHRINKER_REGISTERED) {
list_del(&shrinker->list);
debugfs_entry = shrinker_debugfs_detach(shrinker, &debugfs_id);
} else if (IS_ENABLED(CONFIG_SHRINKER_DEBUG)) {
kfree_const(shrinker->name);
}

if (shrinker->flags & SHRINKER_MEMCG_AWARE)
unregister_memcg_shrinker(shrinker);
up_write(&shrinker_rwsem);

if (debugfs_entry)
shrinker_debugfs_remove(debugfs_entry, debugfs_id);

kfree(shrinker->nr_deferred);
kfree(shrinker);
}
EXPORT_SYMBOL_GPL(shrinker_free);

-- 
Dave Chinner
da...@fromorbit.com
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v3 4/4] vsock/test: MSG_PEEK test for SOCK_SEQPACKET

2023-07-26 Thread Stefano Garzarella

On Tue, Jul 25, 2023 at 08:29:12PM +0300, Arseniy Krasnov wrote:

This adds MSG_PEEK test for SOCK_SEQPACKET. It works in the same way as
SOCK_STREAM test, except it also tests MSG_TRUNC flag.

Signed-off-by: Arseniy Krasnov 
---
tools/testing/vsock/vsock_test.c | 58 +---
1 file changed, 54 insertions(+), 4 deletions(-)


Reviewed-by: Stefano Garzarella 

Thanks,
Stefano



diff --git a/tools/testing/vsock/vsock_test.c b/tools/testing/vsock/vsock_test.c
index 444a3ff0681f..90718c2fd4ea 100644
--- a/tools/testing/vsock/vsock_test.c
+++ b/tools/testing/vsock/vsock_test.c
@@ -257,14 +257,19 @@ static void test_stream_multiconn_server(const struct 
test_opts *opts)

#define MSG_PEEK_BUF_LEN 64

-static void test_stream_msg_peek_client(const struct test_opts *opts)
+static void test_msg_peek_client(const struct test_opts *opts,
+bool seqpacket)
{
unsigned char buf[MSG_PEEK_BUF_LEN];
ssize_t send_size;
int fd;
int i;

-   fd = vsock_stream_connect(opts->peer_cid, 1234);
+   if (seqpacket)
+   fd = vsock_seqpacket_connect(opts->peer_cid, 1234);
+   else
+   fd = vsock_stream_connect(opts->peer_cid, 1234);
+
if (fd < 0) {
perror("connect");
exit(EXIT_FAILURE);
@@ -290,7 +295,8 @@ static void test_stream_msg_peek_client(const struct 
test_opts *opts)
close(fd);
}

-static void test_stream_msg_peek_server(const struct test_opts *opts)
+static void test_msg_peek_server(const struct test_opts *opts,
+bool seqpacket)
{
unsigned char buf_half[MSG_PEEK_BUF_LEN / 2];
unsigned char buf_normal[MSG_PEEK_BUF_LEN];
@@ -298,7 +304,11 @@ static void test_stream_msg_peek_server(const struct 
test_opts *opts)
ssize_t res;
int fd;

-   fd = vsock_stream_accept(VMADDR_CID_ANY, 1234, NULL);
+   if (seqpacket)
+   fd = vsock_seqpacket_accept(VMADDR_CID_ANY, 1234, NULL);
+   else
+   fd = vsock_stream_accept(VMADDR_CID_ANY, 1234, NULL);
+
if (fd < 0) {
perror("accept");
exit(EXIT_FAILURE);
@@ -340,6 +350,21 @@ static void test_stream_msg_peek_server(const struct 
test_opts *opts)
exit(EXIT_FAILURE);
}

+   if (seqpacket) {
+   /* This type of socket supports MSG_TRUNC flag,
+* so check it with MSG_PEEK. We must get length
+* of the message.
+*/
+   res = recv(fd, buf_half, sizeof(buf_half), MSG_PEEK |
+  MSG_TRUNC);
+   if (res != sizeof(buf_peek)) {
+   fprintf(stderr,
+   "recv(2) + MSG_PEEK | MSG_TRUNC, exp %zu, got 
%zi\n",
+   sizeof(buf_half), res);
+   exit(EXIT_FAILURE);
+   }
+   }
+
res = recv(fd, buf_normal, sizeof(buf_normal), 0);
if (res != sizeof(buf_normal)) {
fprintf(stderr, "recv(2), expected %zu, got %zi\n",
@@ -356,6 +381,16 @@ static void test_stream_msg_peek_server(const struct 
test_opts *opts)
close(fd);
}

+static void test_stream_msg_peek_client(const struct test_opts *opts)
+{
+   return test_msg_peek_client(opts, false);
+}
+
+static void test_stream_msg_peek_server(const struct test_opts *opts)
+{
+   return test_msg_peek_server(opts, false);
+}
+
#define SOCK_BUF_SIZE (2 * 1024 * 1024)
#define MAX_MSG_SIZE (32 * 1024)

@@ -1125,6 +1160,16 @@ static void test_stream_virtio_skb_merge_server(const 
struct test_opts *opts)
close(fd);
}

+static void test_seqpacket_msg_peek_client(const struct test_opts *opts)
+{
+   return test_msg_peek_client(opts, true);
+}
+
+static void test_seqpacket_msg_peek_server(const struct test_opts *opts)
+{
+   return test_msg_peek_server(opts, true);
+}
+
static struct test_case test_cases[] = {
{
.name = "SOCK_STREAM connection reset",
@@ -1200,6 +1245,11 @@ static struct test_case test_cases[] = {
.run_client = test_stream_virtio_skb_merge_client,
.run_server = test_stream_virtio_skb_merge_server,
},
+   {
+   .name = "SOCK_SEQPACKET MSG_PEEK",
+   .run_client = test_seqpacket_msg_peek_client,
+   .run_server = test_seqpacket_msg_peek_server,
+   },
{},
};

--
2.25.1



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 44/47] mm: shrinker: make global slab shrink lockless

2023-07-26 Thread Dave Chinner via Virtualization
On Mon, Jul 24, 2023 at 05:43:51PM +0800, Qi Zheng wrote:
> The shrinker_rwsem is a global read-write lock in shrinkers subsystem,
> which protects most operations such as slab shrink, registration and
> unregistration of shrinkers, etc. This can easily cause problems in the
> following cases.
> 
> 1) When the memory pressure is high and there are many filesystems
>mounted or unmounted at the same time, slab shrink will be affected
>(down_read_trylock() failed).
> 
>Such as the real workload mentioned by Kirill Tkhai:
> 
>```
>One of the real workloads from my experience is start
>of an overcommitted node containing many starting
>containers after node crash (or many resuming containers
>after reboot for kernel update). In these cases memory
>pressure is huge, and the node goes round in long reclaim.
>```
> 
> 2) If a shrinker is blocked (such as the case mentioned
>in [1]) and a writer comes in (such as mount a fs),
>then this writer will be blocked and cause all
>subsequent shrinker-related operations to be blocked.
> 
> Even if there is no competitor when shrinking slab, there may still be a
> problem. The down_read_trylock() may become a perf hotspot with frequent
> calls to shrink_slab(). Because of the poor multicore scalability of
> atomic operations, this can lead to a significant drop in IPC
> (instructions per cycle).
> 
> We used to implement the lockless slab shrink with SRCU [2], but then
> kernel test robot reported -88.8% regression in
> stress-ng.ramfs.ops_per_sec test case [3], so we reverted it [4].
> 
> This commit uses the refcount+RCU method [5] proposed by Dave Chinner
> to re-implement the lockless global slab shrink. The memcg slab shrink is
> handled in the subsequent patch.
> 
> For now, all shrinker instances are converted to dynamically allocated and
> will be freed by kfree_rcu(). So we can use rcu_read_{lock,unlock}() to
> ensure that the shrinker instance is valid.
> 
> And the shrinker instance will not be run again after unregistration. So
> the structure that records the pointer of shrinker instance can be safely
> freed without waiting for the RCU read-side critical section.
> 
> In this way, while we implement the lockless slab shrink, we don't need to
> be blocked in unregister_shrinker().
> 
> The following are the test results:
> 
> stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &
> 
> 1) Before applying this patchset:
> 
> setting to a 60 second run per stressor
> dispatching hogs: 9 ramfs
> stressor   bogo ops real time  usr time  sys time   bogo ops/s bogo 
> ops/s
>   (secs)(secs)(secs)   (real time) (usr+sys 
> time)
> ramfs735238 60.00 12.37363.70 12253.05
> 1955.08
> for a 60.01s run time:
>1440.27s available CPU time
>  12.36s user time   (  0.86%)
> 363.70s system time ( 25.25%)
> 376.06s total time  ( 26.11%)
> load average: 10.79 4.47 1.69
> passed: 9: ramfs (9)
> failed: 0
> skipped: 0
> successful run completed in 60.01s (1 min, 0.01 secs)
> 
> 2) After applying this patchset:
> 
> setting to a 60 second run per stressor
> dispatching hogs: 9 ramfs
> stressor   bogo ops real time  usr time  sys time   bogo ops/s bogo 
> ops/s
>   (secs)(secs)(secs)   (real time) (usr+sys 
> time)
> ramfs746677 60.00 12.22367.75 12443.70
> 1965.13
> for a 60.01s run time:
>1440.26s available CPU time
>  12.21s user time   (  0.85%)
> 367.75s system time ( 25.53%)
> 379.96s total time  ( 26.38%)
> load average: 8.37 2.48 0.86
> passed: 9: ramfs (9)
> failed: 0
> skipped: 0
> successful run completed in 60.01s (1 min, 0.01 secs)
> 
> We can see that the ops/s has hardly changed.
> 
> [1]. 
> https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomi...@virtuozzo.com/
> [2]. 
> https://lore.kernel.org/lkml/20230313112819.38938-1-zhengqi.a...@bytedance.com/
> [3]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie@intel.com/
> [4]. https://lore.kernel.org/all/20230609081518.3039120-1-qi.zh...@linux.dev/
> [5]. https://lore.kernel.org/lkml/zijhou1d55d4h...@dread.disaster.area/
> 
> Signed-off-by: Qi Zheng 
> ---
>  include/linux/shrinker.h | 19 +++---
>  mm/shrinker.c| 75 ++--
>  mm/shrinker_debug.c  | 52 +---
>  3 files changed, 104 insertions(+), 42 deletions(-)
> 
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index 36977a70bebb..335da93cccee 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -4,6 +4,7 @@
>  
>  #include 
>  #include 
> +#include 
>  
>  #define SHRINKER_UNIT_BITS   BITS_PER_LONG
>  
> @@ -86,6 +87,10 @@ struct shrinker {
>   long batch; /* reclaim batch size, 0 = default */
>   int seeks;  /* seeks to recreate an obj */
>   unsigned flags;
> + bo

Re: vdpa: use io_uring passthrough command for IOCTLs [was Re: [PATCH 1/2] Reduce vdpa initialization / startup overhead]

2023-07-26 Thread Jason Wang
On Tue, Jul 18, 2023 at 6:32 PM Stefano Garzarella  wrote:
>
> On Thu, Apr 20, 2023 at 6:20 AM Jason Wang  wrote:
> >
> > On Wed, Apr 19, 2023 at 11:33 PM Eugenio Perez Martin
> >  wrote:
> > >
> > > On Wed, Apr 19, 2023 at 12:56 AM  wrote:
> > > >
> > > > From: Pei Li 
> > > >
> > > > Currently, part of the vdpa initialization / startup process
> > > > needs to trigger many ioctls per vq, which is very inefficient
> > > > and causing unnecessary context switch between user mode and
> > > > kernel mode.
> > > >
> > > > This patch creates an additional ioctl() command, namely
> > > > VHOST_VDPA_GET_VRING_GROUP_BATCH, that will batching
> > > > commands of VHOST_VDPA_GET_VRING_GROUP into a single
> > > > ioctl() call.
> >
> > I'd expect there's a kernel patch but I didn't see that?
> >
> > If we want to go this way. Why simply have a more generic way, that is
> > introducing something like:
> >
> > VHOST_CMD_BATCH which did something like
> >
> > struct vhost_cmd_batch {
> > int ncmds;
> > struct vhost_ioctls[];
> > };
> >
> > Then you can batch other ioctls other than GET_VRING_GROUP?
> >
>
> Just restarting this discussion, since I recently worked more with
> io_uring passthrough commands and I think it can help here.
>
> The NVMe guys had a similar problem (ioctl too slow for their use
> case)[1][2], so they developed a new feature in io_uring that
> basically allows you to do IOCTLs asynchronously and in batches using
> io_uring.
>
> The same feature is also used by ublk [3] and I recently talked about
> this at DevConf with German [4].
>
> Basically, there's a new callback in fops (struct file_operations.uring_cmd).
> IIUC for NVMe (drivers/nvme/host/ioctl.c) they used exactly the same
> values used for IOCTLs also for the new uring_cmd callback.
>
> We could do the same. The changes in the vhost-vdpa kernel module
> should be simple, and we could share the code for handling ioctl and
> uring_cmd.
> That way any new command can be supported with both for compatibility.
>
> In QEMU then we can start using it to optimize the control path.
>
> What do you think?

This looks interesting.

>
> If it's interesting, I could throw down an RFC with the changes or if
> anyone is interested in working on it, I can help with the details.

Please do that.

Thanks


>
> Thanks,
> Stefano
>
> [1] https://lpc.events/event/11/contributions/989/
> [2] https://lpc.events/event/16/contributions/1382/
> [3] https://lwn.net/Articles/903855/
> [4] https://www.youtube.com/watch?v=6JqNPirreoY
>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next v3 0/4] virtio/vsock: some updates for MSG_PEEK flag

2023-07-26 Thread Michael S. Tsirkin
On Tue, Jul 25, 2023 at 08:29:08PM +0300, Arseniy Krasnov wrote:
> Hello,
> 
> This patchset does several things around MSG_PEEK flag support. In
> general words it reworks MSG_PEEK test and adds support for this flag
> in SOCK_SEQPACKET logic. Here is per-patch description:
> 
> 1) This is cosmetic change for SOCK_STREAM implementation of MSG_PEEK:
>1) I think there is no need of "safe" mode walk here as there is no
>   "unlink" of skbs inside loop (it is MSG_PEEK mode - we don't change
>   queue).
>2) Nested while loop is removed: in case of MSG_PEEK we just walk
>   over skbs and copy data from each one. I guess this nested loop
>   even didn't behave as loop - it always executed just for single
>   iteration.
> 
> 2) This adds MSG_PEEK support for SOCK_SEQPACKET. It could be implemented
>be reworking MSG_PEEK callback for SOCK_STREAM to support SOCK_SEQPACKET
>also, but I think it will be more simple and clear from potential
>bugs to implemented it as separate function thus not mixing logics
>for both types of socket. So I've added it as dedicated function.
> 
> 3) This is reworked MSG_PEEK test for SOCK_STREAM. Previous version just
>sent single byte, then tried to read it with MSG_PEEK flag, then read
>it in normal way. New version is more complex: now sender uses buffer
>instead of single byte and this buffer is initialized with random
>values. Receiver tests several things:
>1) Read empty socket with MSG_PEEK flag.
>2) Read part of buffer with MSG_PEEK flag.
>3) Read whole buffer with MSG_PEEK flag, then checks that it is same
>   as buffer from 2) (limited by size of buffer from 2) of course).
>4) Read whole buffer without any flags, then checks that it is same
>   as buffer from 3).
> 
> 4) This is MSG_PEEK test for SOCK_SEQPACKET. It works in the same way
>as for SOCK_STREAM, except it also checks combination of MSG_TRUNC
>and MSG_PEEK.

Acked-by: Michael S. Tsirkin 



> Head is:
> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=a5a91f546444940f3d75e2edf3c53b4d235f0557
> 
> Link to v1:
> https://lore.kernel.org/netdev/20230618062451.79980-1-avkras...@sberdevices.ru/
> Link to v2:
> https://lore.kernel.org/netdev/20230719192708.1775162-1-avkras...@sberdevices.ru/
> 
> Changelog:
>  v1 -> v2:
>  * Patchset is rebased on the new HEAD of net-next.
>  * 0001: R-b tag added.
>  * 0003: check return value of 'send()' call. 
>  v2 -> v3:
>  * Patchset is rebased (and tested) on the new HEAD of net-next.
>  * 'RFC' tag is replaced with 'net-next'.
>  * Small refactoring in 0004:
>'__test_msg_peek_client()' -> 'test_msg_peek_client()'.
>'__test_msg_peek_server()' -> 'test_msg_peek_server()'.
> 
> Arseniy Krasnov (4):
>   virtio/vsock: rework MSG_PEEK for SOCK_STREAM
>   virtio/vsock: support MSG_PEEK for SOCK_SEQPACKET
>   vsock/test: rework MSG_PEEK test for SOCK_STREAM
>   vsock/test: MSG_PEEK test for SOCK_SEQPACKET
> 
>  net/vmw_vsock/virtio_transport_common.c | 104 +-
>  tools/testing/vsock/vsock_test.c| 136 ++--
>  2 files changed, 208 insertions(+), 32 deletions(-)
> 
> -- 
> 2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v4 2/2] virtio-net: add cond_resched() to the command waiting loop

2023-07-26 Thread Michael S. Tsirkin
On Wed, Jul 26, 2023 at 09:55:37AM +0800, Jason Wang wrote:
> On Tue, Jul 25, 2023 at 3:36 PM Michael S. Tsirkin  wrote:
> >
> > On Tue, Jul 25, 2023 at 11:07:40AM +0800, Jason Wang wrote:
> > > On Mon, Jul 24, 2023 at 3:17 PM Michael S. Tsirkin  
> > > wrote:
> > > >
> > > > On Mon, Jul 24, 2023 at 02:52:05PM +0800, Jason Wang wrote:
> > > > > On Sat, Jul 22, 2023 at 4:18 AM Maxime Coquelin
> > > > >  wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > > On 7/21/23 17:10, Michael S. Tsirkin wrote:
> > > > > > > On Fri, Jul 21, 2023 at 04:58:04PM +0200, Maxime Coquelin wrote:
> > > > > > >>
> > > > > > >>
> > > > > > >> On 7/21/23 16:45, Michael S. Tsirkin wrote:
> > > > > > >>> On Fri, Jul 21, 2023 at 04:37:00PM +0200, Maxime Coquelin wrote:
> > > > > > 
> > > > > > 
> > > > > >  On 7/20/23 23:02, Michael S. Tsirkin wrote:
> > > > > > > On Thu, Jul 20, 2023 at 01:26:20PM -0700, Shannon Nelson 
> > > > > > > wrote:
> > > > > > >> On 7/20/23 1:38 AM, Jason Wang wrote:
> > > > > > >>>
> > > > > > >>> Adding cond_resched() to the command waiting loop for a 
> > > > > > >>> better
> > > > > > >>> co-operation with the scheduler. This allows to give CPU a 
> > > > > > >>> breath to
> > > > > > >>> run other task(workqueue) instead of busy looping when 
> > > > > > >>> preemption is
> > > > > > >>> not allowed on a device whose CVQ might be slow.
> > > > > > >>>
> > > > > > >>> Signed-off-by: Jason Wang 
> > > > > > >>
> > > > > > >> This still leaves hung processes, but at least it doesn't 
> > > > > > >> pin the CPU any
> > > > > > >> more.  Thanks.
> > > > > > >> Reviewed-by: Shannon Nelson 
> > > > > > >>
> > > > > > >
> > > > > > > I'd like to see a full solution
> > > > > > > 1- block until interrupt
> > > > >
> > > > > I remember in previous versions, you worried about the extra MSI
> > > > > vector. (Maybe I was wrong).
> > > > >
> > > > > > 
> > > > > >  Would it make sense to also have a timeout?
> > > > > >  And when timeout expires, set FAILED bit in device status?
> > > > > > >>>
> > > > > > >>> virtio spec does not set any limits on the timing of vq
> > > > > > >>> processing.
> > > > > > >>
> > > > > > >> Indeed, but I thought the driver could decide it is too long for 
> > > > > > >> it.
> > > > > > >>
> > > > > > >> The issue is we keep waiting with rtnl locked, it can quickly 
> > > > > > >> make the
> > > > > > >> system unusable.
> > > > > > >
> > > > > > > if this is a problem we should find a way not to keep rtnl
> > > > > > > locked indefinitely.
> > > > >
> > > > > Any ideas on this direction? Simply dropping rtnl during the busy loop
> > > > > will result in a lot of races. This seems to require non-trivial
> > > > > changes in the networking core.
> > > > >
> > > > > >
> > > > > >  From the tests I have done, I think it is. With OVS, a 
> > > > > > reconfiguration
> > > > > > is performed when the VDUSE device is added, and when a MLX5 device 
> > > > > > is
> > > > > > in the same bridge, it ends up doing an ioctl() that tries to take 
> > > > > > the
> > > > > > rtnl lock. In this configuration, it is not possible to kill OVS 
> > > > > > because
> > > > > > it is stuck trying to acquire rtnl lock for mlx5 that is held by 
> > > > > > virtio-
> > > > > > net.
> > > > >
> > > > > Yeah, basically, any RTNL users would be blocked forever.
> > > > >
> > > > > And the infinite loop has other side effects like it blocks the 
> > > > > freezer to work.
> > > > >
> > > > > To summarize, there are three issues
> > > > >
> > > > > 1) busy polling
> > > > > 2) breaks freezer
> > > > > 3) hold RTNL during the loop
> > > > >
> > > > > Solving 3 may help somehow for 2 e.g some pm routine e.g wireguard or
> > > > > even virtnet_restore() itself may try to hold the lock.
> > > >
> > > > Yep. So my feeling currently is, the only real fix is to actually
> > > > queue up the work in software.
> > >
> > > Do you mean something like:
> > >
> > > rtnl_lock();
> > > queue up the work
> > > rtnl_unlock();
> > > return success;
> > >
> > > ?
> >
> > yes
> 
> We will lose the error reporting, is it a real problem or not?

Fundamental isn't it? Maybe we want a per-device flag for a asynch commands,
and vduse will set it while hardware virtio won't.
this way we only lose error reporting for vduse.

> >
> >
> > > > It's mostly trivial to limit
> > > > memory consumption, vid's is the
> > > > only one where it would make sense to have more than
> > > > 1 command of a given type outstanding.
> > >
> > > And rx mode so this implies we will fail any command if the previous
> > > work is not finished.
> >
> > don't fail it, store it.
> 
> Ok.
> 
> Thanks
> 
> >
> > > > have a tree
> > > > or a bitmap with vids to add/remove?
> > >
> > > Probably.
> > >
> > > Thanks
> > >
> > > >
> > > >
> > > >
> > > > > >
> > > > > > >
> > > > > > > 2- still handle surprise removal correctly by waking in that 
> > > > 

Re: [PATCH v1] vdpa: Complement vdpa_nl_policy for nlattr length check

2023-07-26 Thread Michael S. Tsirkin
On Tue, Jul 25, 2023 at 08:26:32AM +, Dragos Tatulea wrote:
> On Mon, 2023-07-24 at 16:08 -0400, Michael S. Tsirkin wrote:
> > On Mon, Jul 24, 2023 at 11:42:42AM +, Dragos Tatulea wrote:
> > > On Mon, 2023-07-24 at 05:16 -0400, Michael S. Tsirkin wrote:
> > > > On Mon, Jul 24, 2023 at 08:38:04AM +, Dragos Tatulea wrote:
> > > > > 
> > > > > On Mon, 2023-07-24 at 15:11 +0800, Jason Wang wrote:
> > > > > > On Sun, Jul 23, 2023 at 6:02 PM Michael S. Tsirkin 
> > > > > > wrote:
> > > > > > > 
> > > > > > > On Sun, Jul 23, 2023 at 05:48:46PM +0800, Lin Ma wrote:
> > > > > > > > 
> > > > > > > > > Sure, that is another undergoing task I'm working on. If the
> > > > > > > > > nlattr
> > > > > > > > > is
> > > > > > > > > parsed with
> > > > > > > > > NL_VALIDATE_UNSPEC, any forgotten nlattr will be rejected,
> > > > > > > > > therefore
> > > > > > > > > (which is the default
> > > > > > > > > for modern nla_parse).
> > > > > > > > 
> > > > > > > > For the general netlink interface, the deciding flag should be
> > > > > > > > genl_ops.validate defined in
> > > > > > > > each ops. The default validate flag is strict, while the 
> > > > > > > > developer
> > > > > > > > can
> > > > > > > > overwrite the flag
> > > > > > > > with GENL_DONT_VALIDATE_STRICT to ease the validation. That is 
> > > > > > > > to
> > > > > > > > say,
> > > > > > > > safer code should
> > > > > > > > enforce NL_VALIDATE_STRICT by not overwriting the validate flag.
> > > > > > > > 
> > > > > > > > Regrads
> > > > > > > > Lin
> > > > > > > 
> > > > > > > 
> > > > > > > Oh I see.
> > > > > > > 
> > > > > > > It started here:
> > > > > > > 
> > > > > > > commit 33b347503f014ebf76257327cbc7001c6b721956
> > > > > > > Author: Parav Pandit 
> > > > > > > Date:   Tue Jan 5 12:32:00 2021 +0200
> > > > > > > 
> > > > > > >     vdpa: Define vdpa mgmt device, ops and a netlink interface
> > > > > > > 
> > > > > > > which did:
> > > > > > > 
> > > > > > > +   .validate = GENL_DONT_VALIDATE_STRICT |
> > > > > > > GENL_DONT_VALIDATE_DUMP,
> > > > > > > 
> > > > > > > 
> > > > > > > which was most likely just a copy paste from somewhere, right 
> > > > > > > Parav?
> > > > > > > 
> > > > > > > and then everyone kept copying this around.
> > > > > > > 
> > > > > > > Parav, Eli can we drop these? There's a tiny chance of breaking
> > > > > > > something
> > > > > > > but I feel there aren't that many users outside mlx5 yet, so if 
> > > > > > > you
> > > > > > > guys can test on mlx5 and confirm no breakage, I think we are 
> > > > > > > good.
> > > > > > 
> > > > > > Adding Dragos.
> > > > > > 
> > > > > I will check. Just to make sure I understand correctly: you want me to
> > > > > drop
> > > > > the
> > > > > .validate flags all together in all vdpa ops and check, right?
> > > > > 
> > > > > Thanks,
> > > > > Dragos
> > > > 
> > > > yes - I suspect you will then need this patch to make things work.
> > > > 
> > > Yep. Adding the patch and removing the .validate config on the vdpa_nl_ops
> > > seems to work just fine.
> > > 
> > > Thanks,
> > > Dragos
> > 
> > OK, post a patch?
> > 
> Sure, but how do I make it depend on this patch? Otherwise it will break 
> things.
> 
> Thanks,
> Dragos

Send a patch series with this as patch 1/2 that one 2/2.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH 0/2] vdpa: Enable strict validation for netlink ops

2023-07-26 Thread Dragos Tatulea via Virtualization
The original patch from Lin Ma enables the vdpa driver to use validation
netlink ops.

The second patch simply disables the validation skip which is no longer
neccesary. Patchset started of from this discussion [0].

[0] 
https://lore.kernel.org/virtualization/20230726074710-mutt-send-email-...@kernel.org/T/#t

Dragos Tatulea (1):
  vdpa: Enable strict validation for netlinks ops

Lin Ma (1):
  vdpa: Complement vdpa_nl_policy for nlattr length check

 drivers/vdpa/vdpa.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

-- 
2.41.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 1/2] vdpa: Complement vdpa_nl_policy for nlattr length check

2023-07-26 Thread Dragos Tatulea via Virtualization
From: Lin Ma 

The vdpa_nl_policy structure is used to validate the nlattr when parsing
the incoming nlmsg. It will ensure the attribute being described produces
a valid nlattr pointer in info->attrs before entering into each handler
in vdpa_nl_ops.

That is to say, the missing part in vdpa_nl_policy may lead to illegal
nlattr after parsing, which could lead to OOB read just like CVE-2023-3773.

This patch adds three missing nla_policy to avoid such bugs.

Fixes: 90fea5a800c3 ("vdpa: device feature provisioning")
Fixes: 13b00b135665 ("vdpa: Add support for querying vendor statistics")
Fixes: ad69dd0bf26b ("vdpa: Introduce query of device config layout")
Signed-off-by: Lin Ma 
---
 drivers/vdpa/vdpa.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
index 965e32529eb8..f2f654fd84e5 100644
--- a/drivers/vdpa/vdpa.c
+++ b/drivers/vdpa/vdpa.c
@@ -1247,8 +1247,11 @@ static const struct nla_policy 
vdpa_nl_policy[VDPA_ATTR_MAX + 1] = {
[VDPA_ATTR_MGMTDEV_DEV_NAME] = { .type = NLA_STRING },
[VDPA_ATTR_DEV_NAME] = { .type = NLA_STRING },
[VDPA_ATTR_DEV_NET_CFG_MACADDR] = NLA_POLICY_ETH_ADDR,
+   [VDPA_ATTR_DEV_NET_CFG_MAX_VQP] = { .type = NLA_U16 },
/* virtio spec 1.1 section 5.1.4.1 for valid MTU range */
[VDPA_ATTR_DEV_NET_CFG_MTU] = NLA_POLICY_MIN(NLA_U16, 68),
+   [VDPA_ATTR_DEV_QUEUE_INDEX] = { .type = NLA_U32 },
+   [VDPA_ATTR_DEV_FEATURES] = { .type = NLA_U64 },
 };
 
 static const struct genl_ops vdpa_nl_ops[] = {
-- 
2.41.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 2/2] vdpa: Enable strict validation for netlinks ops

2023-07-26 Thread Dragos Tatulea via Virtualization
The previous patch added the missing nla policies that were required for
validation to work.

Now strict validation on netlink ops can be enabled. This patch does it.

Signed-off-by: Dragos Tatulea 
---
 drivers/vdpa/vdpa.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
index f2f654fd84e5..a7612e0783b3 100644
--- a/drivers/vdpa/vdpa.c
+++ b/drivers/vdpa/vdpa.c
@@ -1257,37 +1257,31 @@ static const struct nla_policy 
vdpa_nl_policy[VDPA_ATTR_MAX + 1] = {
 static const struct genl_ops vdpa_nl_ops[] = {
{
.cmd = VDPA_CMD_MGMTDEV_GET,
-   .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
.doit = vdpa_nl_cmd_mgmtdev_get_doit,
.dumpit = vdpa_nl_cmd_mgmtdev_get_dumpit,
},
{
.cmd = VDPA_CMD_DEV_NEW,
-   .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
.doit = vdpa_nl_cmd_dev_add_set_doit,
.flags = GENL_ADMIN_PERM,
},
{
.cmd = VDPA_CMD_DEV_DEL,
-   .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
.doit = vdpa_nl_cmd_dev_del_set_doit,
.flags = GENL_ADMIN_PERM,
},
{
.cmd = VDPA_CMD_DEV_GET,
-   .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
.doit = vdpa_nl_cmd_dev_get_doit,
.dumpit = vdpa_nl_cmd_dev_get_dumpit,
},
{
.cmd = VDPA_CMD_DEV_CONFIG_GET,
-   .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
.doit = vdpa_nl_cmd_dev_config_get_doit,
.dumpit = vdpa_nl_cmd_dev_config_get_dumpit,
},
{
.cmd = VDPA_CMD_DEV_VSTATS_GET,
-   .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
.doit = vdpa_nl_cmd_dev_stats_get_doit,
.flags = GENL_ADMIN_PERM,
},
-- 
2.41.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 0/2] vdpa: Enable strict validation for netlink ops

2023-07-26 Thread Michael S. Tsirkin
On Wed, Jul 26, 2023 at 09:30:48PM +0300, Dragos Tatulea wrote:
> The original patch from Lin Ma enables the vdpa driver to use validation
> netlink ops.
> 
> The second patch simply disables the validation skip which is no longer
> neccesary. Patchset started of from this discussion [0].
> 
> [0] 
> https://lore.kernel.org/virtualization/20230726074710-mutt-send-email-...@kernel.org/T/#t

Cc stable with at least 1/2 ?

> Dragos Tatulea (1):
>   vdpa: Enable strict validation for netlinks ops
> 
> Lin Ma (1):
>   vdpa: Complement vdpa_nl_policy for nlattr length check
> 
>  drivers/vdpa/vdpa.c | 9 +++--
>  1 file changed, 3 insertions(+), 6 deletions(-)
> 
> -- 
> 2.41.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC net-next v5 10/14] virtio/vsock: add VIRTIO_VSOCK_F_DGRAM feature bit

2023-07-26 Thread Michael S. Tsirkin
On Wed, Jul 19, 2023 at 12:50:14AM +, Bobby Eshleman wrote:
> This commit adds a feature bit for virtio vsock to support datagrams.
> 
> Signed-off-by: Jiang Wang 
> Signed-off-by: Bobby Eshleman 
> ---
>  include/uapi/linux/virtio_vsock.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/include/uapi/linux/virtio_vsock.h 
> b/include/uapi/linux/virtio_vsock.h
> index 331be28b1d30..27b4b2b8bf13 100644
> --- a/include/uapi/linux/virtio_vsock.h
> +++ b/include/uapi/linux/virtio_vsock.h
> @@ -40,6 +40,7 @@
>  
>  /* The feature bitmap for virtio vsock */
>  #define VIRTIO_VSOCK_F_SEQPACKET 1   /* SOCK_SEQPACKET supported */
> +#define VIRTIO_VSOCK_F_DGRAM 3   /* SOCK_DGRAM supported */
>  
>  struct virtio_vsock_config {
>   __le64 guest_cid;

pls do not add interface without first getting it accepted in the
virtio spec.

> -- 
> 2.30.2

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC net-next v5 11/14] vhost/vsock: implement datagram support

2023-07-26 Thread Michael S. Tsirkin
On Wed, Jul 19, 2023 at 12:50:15AM +, Bobby Eshleman wrote:
> This commit implements datagram support for vhost/vsock by teaching
> vhost to use the common virtio transport datagram functions.
> 
> If the virtio RX buffer is too small, then the transmission is
> abandoned, the packet dropped, and EHOSTUNREACH is added to the socket's
> error queue.
> 
> Signed-off-by: Bobby Eshleman 

EHOSTUNREACH?


> ---
>  drivers/vhost/vsock.c| 62 
> +---
>  net/vmw_vsock/af_vsock.c |  5 +++-
>  2 files changed, 63 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> index d5d6a3c3f273..da14260c6654 100644
> --- a/drivers/vhost/vsock.c
> +++ b/drivers/vhost/vsock.c
> @@ -8,6 +8,7 @@
>   */
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -32,7 +33,8 @@
>  enum {
>   VHOST_VSOCK_FEATURES = VHOST_FEATURES |
>  (1ULL << VIRTIO_F_ACCESS_PLATFORM) |
> -(1ULL << VIRTIO_VSOCK_F_SEQPACKET)
> +(1ULL << VIRTIO_VSOCK_F_SEQPACKET) |
> +(1ULL << VIRTIO_VSOCK_F_DGRAM)
>  };
>  
>  enum {
> @@ -56,6 +58,7 @@ struct vhost_vsock {
>   atomic_t queued_replies;
>  
>   u32 guest_cid;
> + bool dgram_allow;
>   bool seqpacket_allow;
>  };
>  
> @@ -86,6 +89,32 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
>   return NULL;
>  }
>  
> +/* Claims ownership of the skb, do not free the skb after calling! */
> +static void
> +vhost_transport_error(struct sk_buff *skb, int err)
> +{
> + struct sock_exterr_skb *serr;
> + struct sock *sk = skb->sk;
> + struct sk_buff *clone;
> +
> + serr = SKB_EXT_ERR(skb);
> + memset(serr, 0, sizeof(*serr));
> + serr->ee.ee_errno = err;
> + serr->ee.ee_origin = SO_EE_ORIGIN_NONE;
> +
> + clone = skb_clone(skb, GFP_KERNEL);
> + if (!clone)
> + return;
> +
> + if (sock_queue_err_skb(sk, clone))
> + kfree_skb(clone);
> +
> + sk->sk_err = err;
> + sk_error_report(sk);
> +
> + kfree_skb(skb);
> +}
> +
>  static void
>  vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
>   struct vhost_virtqueue *vq)
> @@ -160,9 +189,15 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
>   hdr = virtio_vsock_hdr(skb);
>  
>   /* If the packet is greater than the space available in the
> -  * buffer, we split it using multiple buffers.
> +  * buffer, we split it using multiple buffers for connectible
> +  * sockets and drop the packet for datagram sockets.
>*/

won't this break things like recently proposed zerocopy?
I think splitup has to be supported for all types.


>   if (payload_len > iov_len - sizeof(*hdr)) {
> + if (le16_to_cpu(hdr->type) == VIRTIO_VSOCK_TYPE_DGRAM) {
> + vhost_transport_error(skb, EHOSTUNREACH);
> + continue;
> + }
> +
>   payload_len = iov_len - sizeof(*hdr);
>  
>   /* As we are copying pieces of large packet's buffer to
> @@ -394,6 +429,7 @@ static bool vhost_vsock_more_replies(struct vhost_vsock 
> *vsock)
>   return val < vq->num;
>  }
>  
> +static bool vhost_transport_dgram_allow(u32 cid, u32 port);
>  static bool vhost_transport_seqpacket_allow(u32 remote_cid);
>  
>  static struct virtio_transport vhost_transport = {
> @@ -410,7 +446,8 @@ static struct virtio_transport vhost_transport = {
>   .cancel_pkt   = vhost_transport_cancel_pkt,
>  
>   .dgram_enqueue= virtio_transport_dgram_enqueue,
> - .dgram_allow  = virtio_transport_dgram_allow,
> + .dgram_allow  = vhost_transport_dgram_allow,
> + .dgram_addr_init  = virtio_transport_dgram_addr_init,
>  
>   .stream_enqueue   = virtio_transport_stream_enqueue,
>   .stream_dequeue   = virtio_transport_stream_dequeue,
> @@ -443,6 +480,22 @@ static struct virtio_transport vhost_transport = {
>   .send_pkt = vhost_transport_send_pkt,
>  };
>  
> +static bool vhost_transport_dgram_allow(u32 cid, u32 port)
> +{
> + struct vhost_vsock *vsock;
> + bool dgram_allow = false;
> +
> + rcu_read_lock();
> + vsock = vhost_vsock_get(cid);
> +
> + if (vsock)
> + dgram_allow = vsock->dgram_allow;
> +
> + rcu_read_unlock();
> +
> + return dgram_allow;
> +}
> +
>  static bool vhost_transport_seqpacket_allow(u32 remote_cid)
>  {
>   struct vhost_vsock *vsock;
> @@ -799,6 +852,9 @@ static int vhost_vsock_set_features(struct vhost_vsock 
> *vsock, u64 features)
>   if (features & (1ULL << VIRTIO_VSOCK_F_SEQPACKET))
>   vsock->seqpacket_allow = true;
>  
> +

[PATCH v2 0/2] vdpa: Enable strict validation for netlink ops

2023-07-26 Thread Dragos Tatulea via Virtualization
The original patch from Lin Ma enables the vdpa driver to use validation
netlink ops.

The second patch simply disables the validation skip which is no longer
neccesary. Patchset started of from this discussion [0].

[0] 
https://lore.kernel.org/virtualization/20230726074710-mutt-send-email-...@kernel.org/T/#t

v2: cc'ed stable

Dragos Tatulea (1):
  vdpa: Enable strict validation for netlinks ops

Lin Ma (1):
  vdpa: Complement vdpa_nl_policy for nlattr length check

 drivers/vdpa/vdpa.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

-- 
2.41.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 1/2] vdpa: Complement vdpa_nl_policy for nlattr length check

2023-07-26 Thread Dragos Tatulea via Virtualization
Author: Lin Ma 
The vdpa_nl_policy structure is used to validate the nlattr when parsing
the incoming nlmsg. It will ensure the attribute being described produces
a valid nlattr pointer in info->attrs before entering into each handler
in vdpa_nl_ops.

That is to say, the missing part in vdpa_nl_policy may lead to illegal
nlattr after parsing, which could lead to OOB read just like CVE-2023-3773.

This patch adds three missing nla_policy to avoid such bugs.

Fixes: 90fea5a800c3 ("vdpa: device feature provisioning")
Fixes: 13b00b135665 ("vdpa: Add support for querying vendor statistics")
Fixes: ad69dd0bf26b ("vdpa: Introduce query of device config layout")
Signed-off-by: Lin Ma 
---
 drivers/vdpa/vdpa.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
index 965e32529eb8..f2f654fd84e5 100644
--- a/drivers/vdpa/vdpa.c
+++ b/drivers/vdpa/vdpa.c
@@ -1247,8 +1247,11 @@ static const struct nla_policy 
vdpa_nl_policy[VDPA_ATTR_MAX + 1] = {
[VDPA_ATTR_MGMTDEV_DEV_NAME] = { .type = NLA_STRING },
[VDPA_ATTR_DEV_NAME] = { .type = NLA_STRING },
[VDPA_ATTR_DEV_NET_CFG_MACADDR] = NLA_POLICY_ETH_ADDR,
+   [VDPA_ATTR_DEV_NET_CFG_MAX_VQP] = { .type = NLA_U16 },
/* virtio spec 1.1 section 5.1.4.1 for valid MTU range */
[VDPA_ATTR_DEV_NET_CFG_MTU] = NLA_POLICY_MIN(NLA_U16, 68),
+   [VDPA_ATTR_DEV_QUEUE_INDEX] = { .type = NLA_U32 },
+   [VDPA_ATTR_DEV_FEATURES] = { .type = NLA_U64 },
 };
 
 static const struct genl_ops vdpa_nl_ops[] = {
-- 
2.41.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 2/2] vdpa: Enable strict validation for netlinks ops

2023-07-26 Thread Dragos Tatulea via Virtualization
The previous patch added the missing nla policies that were required for
validation to work.

Now strict validation on netlink ops can be enabled. This patch does it.

Signed-off-by: Dragos Tatulea 
---
 drivers/vdpa/vdpa.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
index f2f654fd84e5..a7612e0783b3 100644
--- a/drivers/vdpa/vdpa.c
+++ b/drivers/vdpa/vdpa.c
@@ -1257,37 +1257,31 @@ static const struct nla_policy 
vdpa_nl_policy[VDPA_ATTR_MAX + 1] = {
 static const struct genl_ops vdpa_nl_ops[] = {
{
.cmd = VDPA_CMD_MGMTDEV_GET,
-   .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
.doit = vdpa_nl_cmd_mgmtdev_get_doit,
.dumpit = vdpa_nl_cmd_mgmtdev_get_dumpit,
},
{
.cmd = VDPA_CMD_DEV_NEW,
-   .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
.doit = vdpa_nl_cmd_dev_add_set_doit,
.flags = GENL_ADMIN_PERM,
},
{
.cmd = VDPA_CMD_DEV_DEL,
-   .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
.doit = vdpa_nl_cmd_dev_del_set_doit,
.flags = GENL_ADMIN_PERM,
},
{
.cmd = VDPA_CMD_DEV_GET,
-   .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
.doit = vdpa_nl_cmd_dev_get_doit,
.dumpit = vdpa_nl_cmd_dev_get_dumpit,
},
{
.cmd = VDPA_CMD_DEV_CONFIG_GET,
-   .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
.doit = vdpa_nl_cmd_dev_config_get_doit,
.dumpit = vdpa_nl_cmd_dev_config_get_dumpit,
},
{
.cmd = VDPA_CMD_DEV_VSTATS_GET,
-   .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
.doit = vdpa_nl_cmd_dev_stats_get_doit,
.flags = GENL_ADMIN_PERM,
},
-- 
2.41.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 0/2] vdpa: Enable strict validation for netlink ops

2023-07-26 Thread Dragos Tatulea via Virtualization
On Wed, 2023-07-26 at 14:36 -0400, Michael S. Tsirkin wrote:
> On Wed, Jul 26, 2023 at 09:30:48PM +0300, Dragos Tatulea wrote:
> > The original patch from Lin Ma enables the vdpa driver to use validation
> > netlink ops.
> > 
> > The second patch simply disables the validation skip which is no longer
> > neccesary. Patchset started of from this discussion [0].
> > 
> > [0]
> > https://lore.kernel.org/virtualization/20230726074710-mutt-send-email-...@kernel.org/T/#t
> 
> Cc stable with at least 1/2 ?
> 
Sent a v2 with stable in cc. But looks like 1/2 breaks the "fix one thing only"
rule due to the many Fixes tags I guess...

> > Dragos Tatulea (1):
> >   vdpa: Enable strict validation for netlinks ops
> > 
> > Lin Ma (1):
> >   vdpa: Complement vdpa_nl_policy for nlattr length check
> > 
> >  drivers/vdpa/vdpa.c | 9 +++--
> >  1 file changed, 3 insertions(+), 6 deletions(-)
> > 
> > -- 
> > 2.41.0
> 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH] vdpa/mlx5: Fix crash on shutdown for when no ndev exists

2023-07-26 Thread Dragos Tatulea via Virtualization
The ndev was accessed on shutdown without a check if it actually exists.
This triggered the crash pasted below. This patch simply adds a check
before using ndev.

 BUG: kernel NULL pointer dereference, address: 0300
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x) - not-present page
 PGD 0 P4D 0
 Oops:  [#1] SMP
 CPU: 0 PID: 1 Comm: systemd-shutdow Not tainted 
6.5.0-rc2_for_upstream_min_debug_2023_07_17_15_05 #1
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 RIP: 0010:mlx5v_shutdown+0xe/0x50 [mlx5_vdpa]
 RSP: 0018:8881003bfdc0 EFLAGS: 00010286
 RAX: 888103befba0 RBX: 888109d28008 RCX: 0017
 RDX: 0001 RSI: 0212 RDI: 888109d28000
 RBP:  R08: 000d3a3a3882 R09: 0001
 R10:  R11:  R12: 888109d28000
 R13: 888109d28080 R14: fee1dead R15: 
 FS:  7f4969e0be40() GS:88852c80() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 0300 CR3: 0001051cd006 CR4: 00370eb0
 DR0:  DR1:  DR2: 
 DR3:  DR6: fffe0ff0 DR7: 0400
 Call Trace:
  
  ? __die+0x20/0x60
  ? page_fault_oops+0x14c/0x3c0
  ? exc_page_fault+0x75/0x140
  ? asm_exc_page_fault+0x22/0x30
  ? mlx5v_shutdown+0xe/0x50 [mlx5_vdpa]
  device_shutdown+0x13e/0x1e0
  kernel_restart+0x36/0x90
  __do_sys_reboot+0x141/0x210
  ? vfs_writev+0xcd/0x140
  ? handle_mm_fault+0x161/0x260
  ? do_writev+0x6b/0x110
  do_syscall_64+0x3d/0x90
  entry_SYSCALL_64_after_hwframe+0x46/0xb0
 RIP: 0033:0x7f496990fb56
 RSP: 002b:7fffc7bdde88 EFLAGS: 0206 ORIG_RAX: 00a9
 RAX: ffda RBX:  RCX: 7f496990fb56
 RDX: 01234567 RSI: 28121969 RDI: fee1dead
 RBP: 7fffc7bde1d0 R08:  R09: 
 R10:  R11: 0206 R12: 
 R13: 7fffc7bddf10 R14:  R15: 7fffc7bde2b8
  
 CR2: 0300
 ---[ end trace  ]---

Fixes: bc9a2b3e686e ("vdpa/mlx5: Support interrupt bypassing")
Signed-off-by: Dragos Tatulea 
---
 drivers/vdpa/mlx5/net/mlx5_vnet.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index 9138ef2fb2c8..e2e7ebd71798 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -3556,7 +3556,8 @@ static void mlx5v_shutdown(struct auxiliary_device 
*auxdev)
mgtdev = auxiliary_get_drvdata(auxdev);
ndev = mgtdev->ndev;
 
-   free_irqs(ndev);
+   if (ndev)
+   free_irqs(ndev);
 }
 
 static const struct auxiliary_device_id mlx5v_id_table[] = {
-- 
2.41.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH] virtio-vdpa: Fix cpumask memory leak in virtio_vdpa_find_vqs()

2023-07-26 Thread Dragos Tatulea via Virtualization
From: Gal Pressman 

Free the cpumask allocated by create_affinity_masks() before returning
from the function.

Fixes: 3dad56823b53 ("virtio-vdpa: Support interrupt affinity spreading 
mechanism")
Signed-off-by: Gal Pressman 
Reviewed-by: Dragos Tatulea 
---
 drivers/virtio/virtio_vdpa.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/virtio/virtio_vdpa.c b/drivers/virtio/virtio_vdpa.c
index 989e2d7184ce..961161da5900 100644
--- a/drivers/virtio/virtio_vdpa.c
+++ b/drivers/virtio/virtio_vdpa.c
@@ -393,11 +393,13 @@ static int virtio_vdpa_find_vqs(struct virtio_device 
*vdev, unsigned int nvqs,
cb.callback = virtio_vdpa_config_cb;
cb.private = vd_dev;
ops->set_config_cb(vdpa, &cb);
+   kfree(masks);
 
return 0;
 
 err_setup_vq:
virtio_vdpa_del_vqs(vdev);
+   kfree(masks);
return err;
 }
 
-- 
2.41.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 0/2] vdpa: Enable strict validation for netlink ops

2023-07-26 Thread Dragos Tatulea via Virtualization
On Wed, 2023-07-26 at 20:56 +0200, Dragos Tatulea wrote:
> On Wed, 2023-07-26 at 14:36 -0400, Michael S. Tsirkin wrote:
> > On Wed, Jul 26, 2023 at 09:30:48PM +0300, Dragos Tatulea wrote:
> > > The original patch from Lin Ma enables the vdpa driver to use validation
> > > netlink ops.
> > > 
> > > The second patch simply disables the validation skip which is no longer
> > > neccesary. Patchset started of from this discussion [0].
> > > 
> > > [0]
> > > https://lore.kernel.org/virtualization/20230726074710-mutt-send-email-...@kernel.org/T/#t
> > 
> > Cc stable with at least 1/2 ?
> > 
> Sent a v2 with stable in cc. But looks like 1/2 breaks the "fix one thing
> only"
> rule due to the many Fixes tags I guess...
> 
Or my lack of understanding: I only now realize that "Cc: stable" is a tag in
the patch. My bad. Will re-send.

> > > Dragos Tatulea (1):
> > >   vdpa: Enable strict validation for netlinks ops
> > > 
> > > Lin Ma (1):
> > >   vdpa: Complement vdpa_nl_policy for nlattr length check
> > > 
> > >  drivers/vdpa/vdpa.c | 9 +++--
> > >  1 file changed, 3 insertions(+), 6 deletions(-)
> > > 
> > > -- 
> > > 2.41.0
> > 
> 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] vdpa/mlx5: Fix crash on shutdown for when no ndev exists

2023-07-26 Thread Michael S. Tsirkin
On Wed, Jul 26, 2023 at 10:07:38PM +0300, Dragos Tatulea wrote:
> The ndev was accessed on shutdown without a check if it actually exists.
> This triggered the crash pasted below. This patch simply adds a check
> before using ndev.
> 
>  BUG: kernel NULL pointer dereference, address: 0300
>  #PF: supervisor read access in kernel mode
>  #PF: error_code(0x) - not-present page
>  PGD 0 P4D 0
>  Oops:  [#1] SMP
>  CPU: 0 PID: 1 Comm: systemd-shutdow Not tainted 
> 6.5.0-rc2_for_upstream_min_debug_2023_07_17_15_05 #1
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
>  RIP: 0010:mlx5v_shutdown+0xe/0x50 [mlx5_vdpa]
>  RSP: 0018:8881003bfdc0 EFLAGS: 00010286
>  RAX: 888103befba0 RBX: 888109d28008 RCX: 0017
>  RDX: 0001 RSI: 0212 RDI: 888109d28000
>  RBP:  R08: 000d3a3a3882 R09: 0001
>  R10:  R11:  R12: 888109d28000
>  R13: 888109d28080 R14: fee1dead R15: 
>  FS:  7f4969e0be40() GS:88852c80() knlGS:
>  CS:  0010 DS:  ES:  CR0: 80050033
>  CR2: 0300 CR3: 0001051cd006 CR4: 00370eb0
>  DR0:  DR1:  DR2: 
>  DR3:  DR6: fffe0ff0 DR7: 0400
>  Call Trace:
>   
>   ? __die+0x20/0x60
>   ? page_fault_oops+0x14c/0x3c0
>   ? exc_page_fault+0x75/0x140
>   ? asm_exc_page_fault+0x22/0x30
>   ? mlx5v_shutdown+0xe/0x50 [mlx5_vdpa]
>   device_shutdown+0x13e/0x1e0
>   kernel_restart+0x36/0x90
>   __do_sys_reboot+0x141/0x210
>   ? vfs_writev+0xcd/0x140
>   ? handle_mm_fault+0x161/0x260
>   ? do_writev+0x6b/0x110
>   do_syscall_64+0x3d/0x90
>   entry_SYSCALL_64_after_hwframe+0x46/0xb0
>  RIP: 0033:0x7f496990fb56
>  RSP: 002b:7fffc7bdde88 EFLAGS: 0206 ORIG_RAX: 00a9
>  RAX: ffda RBX:  RCX: 7f496990fb56
>  RDX: 01234567 RSI: 28121969 RDI: fee1dead
>  RBP: 7fffc7bde1d0 R08:  R09: 
>  R10:  R11: 0206 R12: 
>  R13: 7fffc7bddf10 R14:  R15: 7fffc7bde2b8
>   
>  CR2: 0300
>  ---[ end trace  ]---
> 
> Fixes: bc9a2b3e686e ("vdpa/mlx5: Support interrupt bypassing")
> Signed-off-by: Dragos Tatulea 
> ---
>  drivers/vdpa/mlx5/net/mlx5_vnet.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
> b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> index 9138ef2fb2c8..e2e7ebd71798 100644
> --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> @@ -3556,7 +3556,8 @@ static void mlx5v_shutdown(struct auxiliary_device 
> *auxdev)
>   mgtdev = auxiliary_get_drvdata(auxdev);
>   ndev = mgtdev->ndev;
>  
> - free_irqs(ndev);
> + if (ndev)
> + free_irqs(ndev);
>  }
>  

something I don't get:
irqs are allocated in mlx5_vdpa_dev_add
why are they not freed in mlx5_vdpa_dev_del?

this is what's creating all this mess.



>  static const struct auxiliary_device_id mlx5v_id_table[] = {
> -- 
> 2.41.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/2] vdpa: Enable strict validation for netlinks ops

2023-07-26 Thread Greg KH
On Wed, Jul 26, 2023 at 09:49:44PM +0300, Dragos Tatulea wrote:
> The previous patch added the missing nla policies that were required for
> validation to work.
> 
> Now strict validation on netlink ops can be enabled. This patch does it.
> 
> Signed-off-by: Dragos Tatulea 
> ---
>  drivers/vdpa/vdpa.c | 6 --
>  1 file changed, 6 deletions(-)
> 
> diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
> index f2f654fd84e5..a7612e0783b3 100644
> --- a/drivers/vdpa/vdpa.c
> +++ b/drivers/vdpa/vdpa.c
> @@ -1257,37 +1257,31 @@ static const struct nla_policy 
> vdpa_nl_policy[VDPA_ATTR_MAX + 1] = {
>  static const struct genl_ops vdpa_nl_ops[] = {
>   {
>   .cmd = VDPA_CMD_MGMTDEV_GET,
> - .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
>   .doit = vdpa_nl_cmd_mgmtdev_get_doit,
>   .dumpit = vdpa_nl_cmd_mgmtdev_get_dumpit,
>   },
>   {
>   .cmd = VDPA_CMD_DEV_NEW,
> - .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
>   .doit = vdpa_nl_cmd_dev_add_set_doit,
>   .flags = GENL_ADMIN_PERM,
>   },
>   {
>   .cmd = VDPA_CMD_DEV_DEL,
> - .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
>   .doit = vdpa_nl_cmd_dev_del_set_doit,
>   .flags = GENL_ADMIN_PERM,
>   },
>   {
>   .cmd = VDPA_CMD_DEV_GET,
> - .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
>   .doit = vdpa_nl_cmd_dev_get_doit,
>   .dumpit = vdpa_nl_cmd_dev_get_dumpit,
>   },
>   {
>   .cmd = VDPA_CMD_DEV_CONFIG_GET,
> - .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
>   .doit = vdpa_nl_cmd_dev_config_get_doit,
>   .dumpit = vdpa_nl_cmd_dev_config_get_dumpit,
>   },
>   {
>   .cmd = VDPA_CMD_DEV_VSTATS_GET,
> - .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
>   .doit = vdpa_nl_cmd_dev_stats_get_doit,
>   .flags = GENL_ADMIN_PERM,
>   },
> -- 
> 2.41.0
> 



This is not the correct way to submit patches for inclusion in the
stable kernel tree.  Please read:
https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
for how to do this properly.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 0/2] vdpa: Enable strict validation for netlink ops

2023-07-26 Thread Michael S. Tsirkin
On Wed, Jul 26, 2023 at 07:23:50PM +, Dragos Tatulea wrote:
> On Wed, 2023-07-26 at 20:56 +0200, Dragos Tatulea wrote:
> > On Wed, 2023-07-26 at 14:36 -0400, Michael S. Tsirkin wrote:
> > > On Wed, Jul 26, 2023 at 09:30:48PM +0300, Dragos Tatulea wrote:
> > > > The original patch from Lin Ma enables the vdpa driver to use validation
> > > > netlink ops.
> > > > 
> > > > The second patch simply disables the validation skip which is no longer
> > > > neccesary. Patchset started of from this discussion [0].
> > > > 
> > > > [0]
> > > > https://lore.kernel.org/virtualization/20230726074710-mutt-send-email-...@kernel.org/T/#t
> > > 
> > > Cc stable with at least 1/2 ?
> > > 
> > Sent a v2 with stable in cc. But looks like 1/2 breaks the "fix one thing
> > only"
> > rule due to the many Fixes tags I guess...

I think it's ok.

> Or my lack of understanding: I only now realize that "Cc: stable" is a tag in
> the patch. My bad. Will re-send.

you also need v2 on subject of each patch.

> > > > Dragos Tatulea (1):
> > > >   vdpa: Enable strict validation for netlinks ops
> > > > 
> > > > Lin Ma (1):
> > > >   vdpa: Complement vdpa_nl_policy for nlattr length check
> > > > 
> > > >  drivers/vdpa/vdpa.c | 9 +++--
> > > >  1 file changed, 3 insertions(+), 6 deletions(-)
> > > > 
> > > > -- 
> > > > 2.41.0
> > > 
> > 
> 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 0/2] vdpa: Enable strict validation for netlink ops

2023-07-26 Thread Michael S. Tsirkin
On Wed, Jul 26, 2023 at 06:56:24PM +, Dragos Tatulea wrote:
> On Wed, 2023-07-26 at 14:36 -0400, Michael S. Tsirkin wrote:
> > On Wed, Jul 26, 2023 at 09:30:48PM +0300, Dragos Tatulea wrote:
> > > The original patch from Lin Ma enables the vdpa driver to use validation
> > > netlink ops.
> > > 
> > > The second patch simply disables the validation skip which is no longer
> > > neccesary. Patchset started of from this discussion [0].
> > > 
> > > [0]
> > > https://lore.kernel.org/virtualization/20230726074710-mutt-send-email-...@kernel.org/T/#t
> > 
> > Cc stable with at least 1/2 ?
> > 
> Sent a v2 with stable in cc. But looks like 1/2 breaks the "fix one thing 
> only"
> rule due to the many Fixes tags I guess...

you can split it up to 3 patches to simplify backports if you like.



> > > Dragos Tatulea (1):
> > >   vdpa: Enable strict validation for netlinks ops
> > > 
> > > Lin Ma (1):
> > >   vdpa: Complement vdpa_nl_policy for nlattr length check
> > > 
> > >  drivers/vdpa/vdpa.c | 9 +++--
> > >  1 file changed, 3 insertions(+), 6 deletions(-)
> > > 
> > > -- 
> > > 2.41.0
> > 
> 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 44/47] mm: shrinker: make global slab shrink lockless

2023-07-26 Thread Dave Chinner via Virtualization
On Wed, Jul 26, 2023 at 05:14:09PM +0800, Qi Zheng wrote:
> On 2023/7/26 16:08, Dave Chinner wrote:
> > On Mon, Jul 24, 2023 at 05:43:51PM +0800, Qi Zheng wrote:
> > > @@ -122,6 +126,13 @@ void shrinker_free_non_registered(struct shrinker 
> > > *shrinker);
> > >   void shrinker_register(struct shrinker *shrinker);
> > >   void shrinker_unregister(struct shrinker *shrinker);
> > > +static inline bool shrinker_try_get(struct shrinker *shrinker)
> > > +{
> > > + return READ_ONCE(shrinker->registered) &&
> > > +refcount_inc_not_zero(&shrinker->refcount);
> > > +}
> > 
> > Why do we care about shrinker->registered here? If we don't set
> > the refcount to 1 until we have fully initialised everything, then
> > the shrinker code can key entirely off the reference count and
> > none of the lookup code needs to care about whether the shrinker is
> > registered or not.
> 
> The purpose of checking shrinker->registered here is to stop running
> shrinker after calling shrinker_free(), which can prevent the following
> situations from happening:
> 
> CPU 0 CPU 1
> 
> shrinker_try_get()
> 
>shrinker_try_get()
> 
> shrinker_put()
> shrinker_try_get()
>shrinker_put()

I don't see any race here? What is wrong with having multiple active
users at once?

> > 
> > This should use a completion, then it is always safe under
> > rcu_read_lock().  This also gets rid of the shrinker_lock spin lock,
> > which only exists because we can't take a blocking lock under
> > rcu_read_lock(). i.e:
> > 
> > 
> > void shrinker_put(struct shrinker *shrinker)
> > {
> > if (refcount_dec_and_test(&shrinker->refcount))
> > complete(&shrinker->done);
> > }
> > 
> > void shrinker_free()
> > {
> > .
> > refcount_dec(&shrinker->refcount);
> 
> I guess what you mean is shrinker_put(), because here may be the last
> refcount.

Yes, I did.

> > wait_for_completion(&shrinker->done);
> > /*
> >  * lookups on the shrinker will now all fail as refcount has
> >  * fallen to zero. We can now remove it from the lists and
> >  * free it.
> >  */
> > down_write(shrinker_rwsem);
> > list_del_rcu(&shrinker->list);
> > up_write(&shrinker_rwsem);
> > call_rcu(shrinker->rcu_head, shrinker_free_rcu_cb);
> > }
> > 
> > 
> > 
> > > @@ -686,11 +711,14 @@ EXPORT_SYMBOL(shrinker_free_non_registered);
> > >   void shrinker_register(struct shrinker *shrinker)
> > >   {
> > > - down_write(&shrinker_rwsem);
> > > - list_add_tail(&shrinker->list, &shrinker_list);
> > > - shrinker->flags |= SHRINKER_REGISTERED;
> > > + refcount_set(&shrinker->refcount, 1);
> > > +
> > > + spin_lock(&shrinker_lock);
> > > + list_add_tail_rcu(&shrinker->list, &shrinker_list);
> > > + spin_unlock(&shrinker_lock);
> > > +
> > >   shrinker_debugfs_add(shrinker);
> > > - up_write(&shrinker_rwsem);
> > > + WRITE_ONCE(shrinker->registered, true);
> > >   }
> > >   EXPORT_SYMBOL(shrinker_register);
> > 
> > This just looks wrong - you are trying to use WRITE_ONCE() as a
> > release barrier to indicate that the shrinker is now set up fully.
> > That's not necessary - the refcount is an atomic and along with the
> > rcu locks they should provides all the barriers we need. i.e.
> 
> The reason I used WRITE_ONCE() here is because the shrinker->registered
> will be read and written concurrently (read in shrinker_try_get() and
> written in shrinker_free()), which is why I added shrinker::registered
> field instead of using SHRINKER_REGISTERED flag (this can reduce the
> addition of WRITE_ONCE()/READ_ONCE()).

Using WRITE_ONCE/READ_ONCE doesn't provide memory barriers needed to
use the field like this. You need release/acquire memory ordering
here. i.e. smp_store_release()/smp_load_acquire().

As it is, the refcount_inc_not_zero() provides a control dependency,
as documented in include/linux/refcount.h, refcount_dec_and_test()
provides release memory ordering. The only thing I think we may need
is a write barrier before refcount_set(), such that if
refcount_inc_not_zero() sees a non-zero value, it is guaranteed to
see an initialised structure...

i.e. refcounts provide all the existence and initialisation
guarantees. Hence I don't see the need to use shrinker->registered
like this and it can remain a bit flag protected by the
shrinker_rwsem().


> > void shrinker_register(struct shrinker *shrinker)
> > {
> > down_write(&shrinker_rwsem);
> > list_add_tail_rcu(&shrinker->list, &shrinker_list);
> > shrinker->flags |= SHRINKER_REGISTERED;
> > shrinker_debugfs_add(shrinker);
> > up_write(&shrinker_rwsem);
> > 
> > /*
> >  * now the shrinker is fully set up, take the first
> >  * reference to it to indicate that lookup operations are
> >  * now allowed to use it via shrinker_try_get().
> >  */
> > refcount_set(&shrinker->refcount, 1);
> > }
> > 
> > > diff --git a/mm/shrinker_debug.c b/mm/shrinker_debug.c
> > > index f1becf

[PATCH] x86/paravirt: Fix tlb_remove_table function callback prototype warning

2023-07-26 Thread Kees Cook
Under W=1, this warning is visible in Clang 16 and newer:

arch/x86/kernel/paravirt.c:337:4: warning: cast from 'void (*)(struct 
mmu_gather *, struct page *)' to 'void (*)(struct mmu_gather *, void *)' 
converts to incompatible function type [-Wcast-function-type-strict]
   (void (*)(struct mmu_gather *, void 
*))tlb_remove_page,
   
^~

Add a direct wrapper instead, which will make this warning (and
potential KCFI failures) go away.

Reported-by: kernel test robot 
Closes: 
https://lore.kernel.org/oe-kbuild-all/202307260332.pjntwr6o-...@intel.com/
Cc: Juergen Gross 
Cc: Peter Zijlstra 
Cc: Sami Tolvanen 
Cc: Nathan Chancellor 
Cc: Ajay Kaher 
Cc: Alexey Makhalov 
Cc: VMware PV-Drivers Reviewers 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: Kees Cook 
---
 arch/x86/kernel/paravirt.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index ac10b46c5832..23d4d7114473 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -79,6 +79,11 @@ void __init native_pv_lock_init(void)
static_branch_disable(&virt_spin_lock_key);
 }
 
+static void native_tlb_remove_table(struct mmu_gather *tlb, void *table)
+{
+   tlb_remove_page(tlb, table);
+}
+
 unsigned int paravirt_patch(u8 type, void *insn_buff, unsigned long addr,
unsigned int len)
 {
@@ -295,8 +300,7 @@ struct paravirt_patch_template pv_ops = {
.mmu.flush_tlb_kernel   = native_flush_tlb_global,
.mmu.flush_tlb_one_user = native_flush_tlb_one_user,
.mmu.flush_tlb_multi= native_flush_tlb_multi,
-   .mmu.tlb_remove_table   =
-   (void (*)(struct mmu_gather *, void *))tlb_remove_page,
+   .mmu.tlb_remove_table   = native_tlb_remove_table,
 
.mmu.exit_mmap  = paravirt_nop,
.mmu.notify_page_enc_status_changed = paravirt_nop,
-- 
2.34.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86/paravirt: Fix tlb_remove_table function callback prototype warning

2023-07-26 Thread Juergen Gross via Virtualization

On 27.07.23 01:11, Kees Cook wrote:

Under W=1, this warning is visible in Clang 16 and newer:

arch/x86/kernel/paravirt.c:337:4: warning: cast from 'void (*)(struct 
mmu_gather *, struct page *)' to 'void (*)(struct mmu_gather *, void *)' 
converts to incompatible function type [-Wcast-function-type-strict]
(void (*)(struct mmu_gather *, void 
*))tlb_remove_page,

^~

Add a direct wrapper instead, which will make this warning (and
potential KCFI failures) go away.

Reported-by: kernel test robot 
Closes: 
https://lore.kernel.org/oe-kbuild-all/202307260332.pjntwr6o-...@intel.com/
Cc: Juergen Gross 
Cc: Peter Zijlstra 
Cc: Sami Tolvanen 
Cc: Nathan Chancellor 
Cc: Ajay Kaher 
Cc: Alexey Makhalov 
Cc: VMware PV-Drivers Reviewers 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: Kees Cook 


Reviewed-by: Juergen Gross 


Juergen



OpenPGP_0xB0DE9DD628BF132F.asc
Description: OpenPGP public key


OpenPGP_signature
Description: OpenPGP digital signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next v4 2/2] virtio-net: add cond_resched() to the command waiting loop

2023-07-26 Thread Jason Wang
On Wed, Jul 26, 2023 at 7:38 PM Michael S. Tsirkin  wrote:
>
> On Wed, Jul 26, 2023 at 09:55:37AM +0800, Jason Wang wrote:
> > On Tue, Jul 25, 2023 at 3:36 PM Michael S. Tsirkin  wrote:
> > >
> > > On Tue, Jul 25, 2023 at 11:07:40AM +0800, Jason Wang wrote:
> > > > On Mon, Jul 24, 2023 at 3:17 PM Michael S. Tsirkin  
> > > > wrote:
> > > > >
> > > > > On Mon, Jul 24, 2023 at 02:52:05PM +0800, Jason Wang wrote:
> > > > > > On Sat, Jul 22, 2023 at 4:18 AM Maxime Coquelin
> > > > > >  wrote:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On 7/21/23 17:10, Michael S. Tsirkin wrote:
> > > > > > > > On Fri, Jul 21, 2023 at 04:58:04PM +0200, Maxime Coquelin wrote:
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> On 7/21/23 16:45, Michael S. Tsirkin wrote:
> > > > > > > >>> On Fri, Jul 21, 2023 at 04:37:00PM +0200, Maxime Coquelin 
> > > > > > > >>> wrote:
> > > > > > > 
> > > > > > > 
> > > > > > >  On 7/20/23 23:02, Michael S. Tsirkin wrote:
> > > > > > > > On Thu, Jul 20, 2023 at 01:26:20PM -0700, Shannon Nelson 
> > > > > > > > wrote:
> > > > > > > >> On 7/20/23 1:38 AM, Jason Wang wrote:
> > > > > > > >>>
> > > > > > > >>> Adding cond_resched() to the command waiting loop for a 
> > > > > > > >>> better
> > > > > > > >>> co-operation with the scheduler. This allows to give CPU 
> > > > > > > >>> a breath to
> > > > > > > >>> run other task(workqueue) instead of busy looping when 
> > > > > > > >>> preemption is
> > > > > > > >>> not allowed on a device whose CVQ might be slow.
> > > > > > > >>>
> > > > > > > >>> Signed-off-by: Jason Wang 
> > > > > > > >>
> > > > > > > >> This still leaves hung processes, but at least it doesn't 
> > > > > > > >> pin the CPU any
> > > > > > > >> more.  Thanks.
> > > > > > > >> Reviewed-by: Shannon Nelson 
> > > > > > > >>
> > > > > > > >
> > > > > > > > I'd like to see a full solution
> > > > > > > > 1- block until interrupt
> > > > > >
> > > > > > I remember in previous versions, you worried about the extra MSI
> > > > > > vector. (Maybe I was wrong).
> > > > > >
> > > > > > > 
> > > > > > >  Would it make sense to also have a timeout?
> > > > > > >  And when timeout expires, set FAILED bit in device status?
> > > > > > > >>>
> > > > > > > >>> virtio spec does not set any limits on the timing of vq
> > > > > > > >>> processing.
> > > > > > > >>
> > > > > > > >> Indeed, but I thought the driver could decide it is too long 
> > > > > > > >> for it.
> > > > > > > >>
> > > > > > > >> The issue is we keep waiting with rtnl locked, it can quickly 
> > > > > > > >> make the
> > > > > > > >> system unusable.
> > > > > > > >
> > > > > > > > if this is a problem we should find a way not to keep rtnl
> > > > > > > > locked indefinitely.
> > > > > >
> > > > > > Any ideas on this direction? Simply dropping rtnl during the busy 
> > > > > > loop
> > > > > > will result in a lot of races. This seems to require non-trivial
> > > > > > changes in the networking core.
> > > > > >
> > > > > > >
> > > > > > >  From the tests I have done, I think it is. With OVS, a 
> > > > > > > reconfiguration
> > > > > > > is performed when the VDUSE device is added, and when a MLX5 
> > > > > > > device is
> > > > > > > in the same bridge, it ends up doing an ioctl() that tries to 
> > > > > > > take the
> > > > > > > rtnl lock. In this configuration, it is not possible to kill OVS 
> > > > > > > because
> > > > > > > it is stuck trying to acquire rtnl lock for mlx5 that is held by 
> > > > > > > virtio-
> > > > > > > net.
> > > > > >
> > > > > > Yeah, basically, any RTNL users would be blocked forever.
> > > > > >
> > > > > > And the infinite loop has other side effects like it blocks the 
> > > > > > freezer to work.
> > > > > >
> > > > > > To summarize, there are three issues
> > > > > >
> > > > > > 1) busy polling
> > > > > > 2) breaks freezer
> > > > > > 3) hold RTNL during the loop
> > > > > >
> > > > > > Solving 3 may help somehow for 2 e.g some pm routine e.g wireguard 
> > > > > > or
> > > > > > even virtnet_restore() itself may try to hold the lock.
> > > > >
> > > > > Yep. So my feeling currently is, the only real fix is to actually
> > > > > queue up the work in software.
> > > >
> > > > Do you mean something like:
> > > >
> > > > rtnl_lock();
> > > > queue up the work
> > > > rtnl_unlock();
> > > > return success;
> > > >
> > > > ?
> > >
> > > yes
> >
> > We will lose the error reporting, is it a real problem or not?
>
> Fundamental isn't it? Maybe we want a per-device flag for a asynch commands,
> and vduse will set it while hardware virtio won't.
> this way we only lose error reporting for vduse.

This problem is not VDUSE specific, DPUs/vDPA may suffer from this as
well. This might require more thoughts.

Thanks

>
> > >
> > >
> > > > > It's mostly trivial to limit
> > > > > memory consumption, vid's is the
> > > > > only one where it would make sense to have more than
> 

Re: [PATCH] virtio-vdpa: Fix cpumask memory leak in virtio_vdpa_find_vqs()

2023-07-26 Thread Jason Wang
On Thu, Jul 27, 2023 at 3:11 AM Dragos Tatulea  wrote:
>
> From: Gal Pressman 
>
> Free the cpumask allocated by create_affinity_masks() before returning
> from the function.
>
> Fixes: 3dad56823b53 ("virtio-vdpa: Support interrupt affinity spreading 
> mechanism")
> Signed-off-by: Gal Pressman 
> Reviewed-by: Dragos Tatulea 

Acked-by: Jason Wang 

Thanks

> ---
>  drivers/virtio/virtio_vdpa.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/drivers/virtio/virtio_vdpa.c b/drivers/virtio/virtio_vdpa.c
> index 989e2d7184ce..961161da5900 100644
> --- a/drivers/virtio/virtio_vdpa.c
> +++ b/drivers/virtio/virtio_vdpa.c
> @@ -393,11 +393,13 @@ static int virtio_vdpa_find_vqs(struct virtio_device 
> *vdev, unsigned int nvqs,
> cb.callback = virtio_vdpa_config_cb;
> cb.private = vd_dev;
> ops->set_config_cb(vdpa, &cb);
> +   kfree(masks);
>
> return 0;
>
>  err_setup_vq:
> virtio_vdpa_del_vqs(vdev);
> +   kfree(masks);
> return err;
>  }
>
> --
> 2.41.0
>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next v4 2/2] virtio-net: add cond_resched() to the command waiting loop

2023-07-26 Thread Michael S. Tsirkin
On Thu, Jul 27, 2023 at 02:03:59PM +0800, Jason Wang wrote:
> On Wed, Jul 26, 2023 at 7:38 PM Michael S. Tsirkin  wrote:
> >
> > On Wed, Jul 26, 2023 at 09:55:37AM +0800, Jason Wang wrote:
> > > On Tue, Jul 25, 2023 at 3:36 PM Michael S. Tsirkin  
> > > wrote:
> > > >
> > > > On Tue, Jul 25, 2023 at 11:07:40AM +0800, Jason Wang wrote:
> > > > > On Mon, Jul 24, 2023 at 3:17 PM Michael S. Tsirkin  
> > > > > wrote:
> > > > > >
> > > > > > On Mon, Jul 24, 2023 at 02:52:05PM +0800, Jason Wang wrote:
> > > > > > > On Sat, Jul 22, 2023 at 4:18 AM Maxime Coquelin
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On 7/21/23 17:10, Michael S. Tsirkin wrote:
> > > > > > > > > On Fri, Jul 21, 2023 at 04:58:04PM +0200, Maxime Coquelin 
> > > > > > > > > wrote:
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >> On 7/21/23 16:45, Michael S. Tsirkin wrote:
> > > > > > > > >>> On Fri, Jul 21, 2023 at 04:37:00PM +0200, Maxime Coquelin 
> > > > > > > > >>> wrote:
> > > > > > > > 
> > > > > > > > 
> > > > > > > >  On 7/20/23 23:02, Michael S. Tsirkin wrote:
> > > > > > > > > On Thu, Jul 20, 2023 at 01:26:20PM -0700, Shannon Nelson 
> > > > > > > > > wrote:
> > > > > > > > >> On 7/20/23 1:38 AM, Jason Wang wrote:
> > > > > > > > >>>
> > > > > > > > >>> Adding cond_resched() to the command waiting loop for a 
> > > > > > > > >>> better
> > > > > > > > >>> co-operation with the scheduler. This allows to give 
> > > > > > > > >>> CPU a breath to
> > > > > > > > >>> run other task(workqueue) instead of busy looping when 
> > > > > > > > >>> preemption is
> > > > > > > > >>> not allowed on a device whose CVQ might be slow.
> > > > > > > > >>>
> > > > > > > > >>> Signed-off-by: Jason Wang 
> > > > > > > > >>
> > > > > > > > >> This still leaves hung processes, but at least it 
> > > > > > > > >> doesn't pin the CPU any
> > > > > > > > >> more.  Thanks.
> > > > > > > > >> Reviewed-by: Shannon Nelson 
> > > > > > > > >>
> > > > > > > > >
> > > > > > > > > I'd like to see a full solution
> > > > > > > > > 1- block until interrupt
> > > > > > >
> > > > > > > I remember in previous versions, you worried about the extra MSI
> > > > > > > vector. (Maybe I was wrong).
> > > > > > >
> > > > > > > > 
> > > > > > > >  Would it make sense to also have a timeout?
> > > > > > > >  And when timeout expires, set FAILED bit in device status?
> > > > > > > > >>>
> > > > > > > > >>> virtio spec does not set any limits on the timing of vq
> > > > > > > > >>> processing.
> > > > > > > > >>
> > > > > > > > >> Indeed, but I thought the driver could decide it is too long 
> > > > > > > > >> for it.
> > > > > > > > >>
> > > > > > > > >> The issue is we keep waiting with rtnl locked, it can 
> > > > > > > > >> quickly make the
> > > > > > > > >> system unusable.
> > > > > > > > >
> > > > > > > > > if this is a problem we should find a way not to keep rtnl
> > > > > > > > > locked indefinitely.
> > > > > > >
> > > > > > > Any ideas on this direction? Simply dropping rtnl during the busy 
> > > > > > > loop
> > > > > > > will result in a lot of races. This seems to require non-trivial
> > > > > > > changes in the networking core.
> > > > > > >
> > > > > > > >
> > > > > > > >  From the tests I have done, I think it is. With OVS, a 
> > > > > > > > reconfiguration
> > > > > > > > is performed when the VDUSE device is added, and when a MLX5 
> > > > > > > > device is
> > > > > > > > in the same bridge, it ends up doing an ioctl() that tries to 
> > > > > > > > take the
> > > > > > > > rtnl lock. In this configuration, it is not possible to kill 
> > > > > > > > OVS because
> > > > > > > > it is stuck trying to acquire rtnl lock for mlx5 that is held 
> > > > > > > > by virtio-
> > > > > > > > net.
> > > > > > >
> > > > > > > Yeah, basically, any RTNL users would be blocked forever.
> > > > > > >
> > > > > > > And the infinite loop has other side effects like it blocks the 
> > > > > > > freezer to work.
> > > > > > >
> > > > > > > To summarize, there are three issues
> > > > > > >
> > > > > > > 1) busy polling
> > > > > > > 2) breaks freezer
> > > > > > > 3) hold RTNL during the loop
> > > > > > >
> > > > > > > Solving 3 may help somehow for 2 e.g some pm routine e.g 
> > > > > > > wireguard or
> > > > > > > even virtnet_restore() itself may try to hold the lock.
> > > > > >
> > > > > > Yep. So my feeling currently is, the only real fix is to actually
> > > > > > queue up the work in software.
> > > > >
> > > > > Do you mean something like:
> > > > >
> > > > > rtnl_lock();
> > > > > queue up the work
> > > > > rtnl_unlock();
> > > > > return success;
> > > > >
> > > > > ?
> > > >
> > > > yes
> > >
> > > We will lose the error reporting, is it a real problem or not?
> >
> > Fundamental isn't it? Maybe we want a per-device flag for a asynch commands,
> > and vduse will set it while hardware virtio won't.
> > this