Re: [RFC PATCH v2 02/11] netdev: implement netlink api to bind dma-buf to netdevice

2023-08-13 Thread David Ahern
On 8/9/23 7:57 PM, Mina Almasry wrote:
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 8e7d0cb540cd..02a25ccf771a 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -151,6 +151,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
> 
>  #include "dev.h"
>  #include "net-sysfs.h"
> @@ -2037,6 +2039,182 @@ static int call_netdevice_notifiers_mtu(unsigned long 
> val,
>   return call_netdevice_notifiers_info(val, &info.info);
>  }
> 
> +/* Device memory support */
> +
> +static void netdev_devmem_free_chunk_owner(struct gen_pool *genpool,
> +struct gen_pool_chunk *chunk,
> +void *not_used)
> +{
> + struct dmabuf_genpool_chunk_owner *owner = chunk->owner;
> +
> + kvfree(owner->ppiovs);
> + kfree(owner);
> +}
> +
> +void __netdev_devmem_binding_free(struct netdev_dmabuf_binding *binding)
> +{
> + size_t size, avail;
> +
> + gen_pool_for_each_chunk(binding->chunk_pool,
> + netdev_devmem_free_chunk_owner, NULL);
> +
> + size = gen_pool_size(binding->chunk_pool);
> + avail = gen_pool_avail(binding->chunk_pool);
> +
> + if (!WARN(size != avail, "can't destroy genpool. size=%lu, avail=%lu",
> +   size, avail))
> + gen_pool_destroy(binding->chunk_pool);
> +
> + dma_buf_unmap_attachment(binding->attachment, binding->sgt,
> +  DMA_BIDIRECTIONAL);
> + dma_buf_detach(binding->dmabuf, binding->attachment);
> + dma_buf_put(binding->dmabuf);
> + kfree(binding);
> +}
> +
> +void netdev_unbind_dmabuf_to_queue(struct netdev_dmabuf_binding *binding)
> +{
> + struct netdev_rx_queue *rxq;
> + unsigned long xa_idx;
> +
> + list_del_rcu(&binding->list);
> +
> + xa_for_each(&binding->bound_rxq_list, xa_idx, rxq)
> + if (rxq->binding == binding)
> + rxq->binding = NULL;
> +
> + netdev_devmem_binding_put(binding);

This does a put on the binding but it does not notify the driver that
that the dmabuf references need to be flushed from the rx queue.

Also, what about the device getting deleted - e.g., the driver is removed?

> +}
> +
> +int netdev_bind_dmabuf_to_queue(struct net_device *dev, unsigned int 
> dmabuf_fd,
> + u32 rxq_idx, struct netdev_dmabuf_binding **out)
> +{
> + struct netdev_dmabuf_binding *binding;
> + struct netdev_rx_queue *rxq;
> + struct scatterlist *sg;
> + struct dma_buf *dmabuf;
> + unsigned int sg_idx, i;
> + unsigned long virtual;
> + u32 xa_idx;
> + int err;
> +
> + rxq = __netif_get_rx_queue(dev, rxq_idx);
> +
> + if (rxq->binding)
> + return -EEXIST;

So this proposal is limiting a binding to a single dmabuf at a time? Is
this just for the RFC?

Also, this suggests that the Rx queue is unique to the flow. I do not
recall a netdev API to create H/W queues on the fly (only a passing
comment from Kuba), so how is the H/W queue (or queue set since a
completion queue is needed as well) created for the flow? And in turn if
it is unique to the flow, what deletes the queue if an app does not do a
proper cleanup? If the queue sticks around, the dmabuf references stick
around.

Also, if this is an app specific h/w queue, flow steering is not
mentioned in this RFC.

> +
> + dmabuf = dma_buf_get(dmabuf_fd);
> + if (IS_ERR_OR_NULL(dmabuf))
> + return -EBADFD;
> +
> + binding = kzalloc_node(sizeof(*binding), GFP_KERNEL,
> +dev_to_node(&dev->dev));
> + if (!binding) {
> + err = -ENOMEM;
> + goto err_put_dmabuf;
> + }
> +
> + xa_init_flags(&binding->bound_rxq_list, XA_FLAGS_ALLOC);
> +
> + refcount_set(&binding->ref, 1);
> +
> + binding->dmabuf = dmabuf;
> +
> + binding->attachment = dma_buf_attach(binding->dmabuf, dev->dev.parent);
> + if (IS_ERR(binding->attachment)) {
> + err = PTR_ERR(binding->attachment);
> + goto err_free_binding;
> + }
> +
> + binding->sgt = dma_buf_map_attachment(binding->attachment,
> +   DMA_BIDIRECTIONAL);
> + if (IS_ERR(binding->sgt)) {
> + err = PTR_ERR(binding->sgt);
> + goto err_detach;
> + }
> +
> + /* For simplicity we expect to make PAGE_SIZE allocations, but the
> +  * binding can be much more flexible than that. We may be able to
> +  * allocate MTU sized chunks here. Leave that for future work...
> +  */
> + binding->chunk_pool = gen_pool_create(PAGE_SHIFT,
> +   dev_to_node(&dev->dev));
> + if (!binding->chunk_pool) {
> + err = -ENOMEM;
> + goto err_unmap;
> + }
> +
> + virtual = 0;
> + for_each_sgtable_dma_sg(binding->sgt, sg, sg_idx) {
> + dma_addr_t dma_addr = sg_dma_address(sg);
> + stru

Re: [RFC PATCH v2 00/11] Device Memory TCP

2023-08-13 Thread David Ahern
On 8/9/23 7:57 PM, Mina Almasry wrote:
> Changes in RFC v2:
> --
> 
> The sticking point in RFC v1[1] was the dma-buf pages approach we used to
> deliver the device memory to the TCP stack. RFC v2 is a proof-of-concept
> that attempts to resolve this by implementing scatterlist support in the
> networking stack, such that we can import the dma-buf scatterlist
> directly. This is the approach proposed at a high level here[2].
> 
> Detailed changes:
> 1. Replaced dma-buf pages approach with importing scatterlist into the
>page pool.
> 2. Replace the dma-buf pages centric API with a netlink API.
> 3. Removed the TX path implementation - there is no issue with
>implementing the TX path with scatterlist approach, but leaving
>out the TX path makes it easier to review.
> 4. Functionality is tested with this proposal, but I have not conducted
>perf testing yet. I'm not sure there are regressions, but I removed
>perf claims from the cover letter until they can be re-confirmed.
> 5. Added Signed-off-by: contributors to the implementation.
> 6. Fixed some bugs with the RX path since RFC v1.
> 
> Any feedback welcome, but specifically the biggest pending questions
> needing feedback IMO are:
> 
> 1. Feedback on the scatterlist-based approach in general.
> 2. Netlink API (Patch 1 & 2).
> 3. Approach to handle all the drivers that expect to receive pages from
>the page pool (Patch 6).
> 
> [1] 
> https://lore.kernel.org/netdev/dfe4bae7-13a0-3c5d-d671-f61b375cb...@gmail.com/T/
> [2] 
> https://lore.kernel.org/netdev/CAHS8izPm6XRS54LdCDZVd0C75tA1zHSu6jLVO8nzTLXCc=h...@mail.gmail.com/
> 
> --
> 
> * TL;DR:
> 
> Device memory TCP (devmem TCP) is a proposal for transferring data to and/or
> from device memory efficiently, without bouncing the data to a host memory
> buffer.
> 
> * Problem:
> 
> A large amount of data transfers have device memory as the source and/or
> destination. Accelerators drastically increased the volume of such transfers.
> Some examples include:
> - ML accelerators transferring large amounts of training data from storage 
> into
>   GPU/TPU memory. In some cases ML training setup time can be as long as 50% 
> of
>   TPU compute time, improving data transfer throughput & efficiency can help
>   improving GPU/TPU utilization.
> 
> - Distributed training, where ML accelerators, such as GPUs on different 
> hosts,
>   exchange data among them.
> 
> - Distributed raw block storage applications transfer large amounts of data 
> with
>   remote SSDs, much of this data does not require host processing.
> 
> Today, the majority of the Device-to-Device data transfers the network are
> implemented as the following low level operations: Device-to-Host copy,
> Host-to-Host network transfer, and Host-to-Device copy.
> 
> The implementation is suboptimal, especially for bulk data transfers, and can
> put significant strains on system resources, such as host memory bandwidth,
> PCIe bandwidth, etc. One important reason behind the current state is the
> kernel’s lack of semantics to express device to network transfers.
> 
> * Proposal:
> 
> In this patch series we attempt to optimize this use case by implementing
> socket APIs that enable the user to:
> 
> 1. send device memory across the network directly, and
> 2. receive incoming network packets directly into device memory.
> 
> Packet _payloads_ go directly from the NIC to device memory for receive and 
> from
> device memory to NIC for transmit.
> Packet _headers_ go to/from host memory and are processed by the TCP/IP stack
> normally. The NIC _must_ support header split to achieve this.
> 
> Advantages:
> 
> - Alleviate host memory bandwidth pressure, compared to existing
>  network-transfer + device-copy semantics.
> 
> - Alleviate PCIe BW pressure, by limiting data transfer to the lowest level
>   of the PCIe tree, compared to traditional path which sends data through the
>   root complex.
> 
> * Patch overview:
> 
> ** Part 1: netlink API
> 
> Gives user ability to bind dma-buf to an RX queue.
> 
> ** Part 2: scatterlist support
> 
> Currently the standard for device memory sharing is DMABUF, which doesn't
> generate struct pages. On the other hand, networking stack (skbs, drivers, and
> page pool) operate on pages. We have 2 options:
> 
> 1. Generate struct pages for dmabuf device memory, or,
> 2. Modify the networking stack to process scatterlist.
> 
> Approach #1 was attempted in RFC v1. RFC v2 implements approach #2.
> 
> ** part 3: page pool support
> 
> We piggy back on page pool memory providers proposal:
> https://github.com/kuba-moo/linux/tree/pp-providers
> 
> It allows the page pool to define a memory provider that provides the
> page allocation and freeing. It helps abstract most of the device memory
> TCP changes from the driver.
> 
> ** part 4: support for unreadable skb frags
> 
> Page pool iovs are not accessible by the host; we implement changes
> throughput the networking stack to c

Re: [RFC PATCH v2 06/11] page-pool: add device memory support

2023-08-19 Thread David Ahern
On 8/19/23 9:22 AM, Jesper Dangaard Brouer wrote:
> 
> I do see the problem of depending on having a struct page, as the
> page_pool_iov isn't related to struct page.  Having "page" in the name
> of "page_pool_iov" is also confusing (hardest problem is CS is naming,
> as we all know).
> 
> To support more allocator types, perhaps skb->pp_recycle bit need to
> grow another bit (and be renamed skb->recycle), so we can tell allocator
> types apart, those that are page based and those whom are not.
> 
> 
>> I think the feedback has been strong to not multiplex yet another
>> memory type into that struct, that is not a real page. Which is why
>> we went into this direction. This latest series limits the impact largely
>> to networking structures and code.
>>
> 
> Some what related what I'm objecting to: the "page_pool_iov" is not a
> real page, but this getting recycled into something called "page_pool",
> which funny enough deals with struct-pages internally and depend on the
> struct-page-refcnt.
> 
> Given the approach changed way from using struct page, then I also don't
> see the connection with the page_pool. Sorry.

I do not care for the page_pool_iov name either; I presumed it was least
change to prove an idea and the name and details would evolve.

How about something like buffer_pool or netdev_buf_pool that can operate
with either pages, dma addresses, or something else in the future?

> 
>> As for the LSB trick: that avoided adding a lot of boilerplate churn
>> with new type and helper functions.
>>
> 
> Says the lazy programmer :-P ... sorry could not resist ;-)

Use of the LSB (or bits depending on alignment expectations) is a common
trick and already done in quite a few places in the networking stack.
This trick is essential to any realistic change here to incorporate gpu
memory; way too much code will have unnecessary churn without it.

I do prefer my earlier suggestion though where the skb_frag_t has a
union of relevant types though. Instead of `struct page *page` it could
be `void *addr` with the helpers indicating page, iov, or other.



Re: [RFC PATCH v2 00/11] Device Memory TCP

2023-08-23 Thread David Ahern
On 8/23/23 3:52 PM, David Wei wrote:
> I'm also preparing a submission for NetDev conf. I wonder if you and others at
> Google plan to present there as well? If so, then we may want to coordinate 
> our
> submissions and talks (if accepted).

personally, I see them as related but separate topics. Mina's proposal
as infra that io_uring builds on. Both are interesting and needed
discussions.


Re: [RFC PATCH 00/10] Device Memory TCP

2023-07-18 Thread David Ahern
On 7/18/23 12:15 PM, Jakub Kicinski wrote:
> On Tue, 18 Jul 2023 15:06:29 -0300 Jason Gunthorpe wrote:
>> netlink feels like a weird API choice for that, in particular it would
>> be really wrong to somehow bind the lifecycle of a netlink object to a
>> process.
> 
> Netlink is the right API, life cycle of objects can be easily tied to
> a netlink socket.

That is an untuitive connection -- memory references, h/w queues, flow
steering should be tied to the datapath socket, not a control plane socket.


Re: [RFC PATCH 00/10] Device Memory TCP

2023-07-18 Thread David Ahern
On 7/18/23 12:29 PM, Jakub Kicinski wrote:
> On Tue, 18 Jul 2023 12:20:59 -0600 David Ahern wrote:
>> On 7/18/23 12:15 PM, Jakub Kicinski wrote:
>>> On Tue, 18 Jul 2023 15:06:29 -0300 Jason Gunthorpe wrote:  
>>>> netlink feels like a weird API choice for that, in particular it would
>>>> be really wrong to somehow bind the lifecycle of a netlink object to a
>>>> process.  
>>>
>>> Netlink is the right API, life cycle of objects can be easily tied to
>>> a netlink socket.  
>>
>> That is an untuitive connection -- memory references, h/w queues, flow
>> steering should be tied to the datapath socket, not a control plane socket.
> 
> There's one RSS context for may datapath sockets. Plus a lot of the
> APIs already exist, and it's more of a question of packaging them up 
> at the user space level. For things which do not have an API, however,
> netlink, please.

I do not see how 1 RSS context (or more specifically a h/w Rx queue) can
be used properly with memory from different processes (or dma-buf
references). When the process dies, that memory needs to be flushed from
the H/W queues. Queues with interlaced submissions make that more
complicated.

I guess the devil is in the details; I look forward to the evolution of
the patches.


Re: [RFC PATCH net-next v6 01/15] queue_api: define queue api

2024-03-10 Thread David Ahern
On 3/8/24 4:47 PM, David Wei wrote:
> 
> I'm working to port bnxt over to using this API. What are your thoughts
> on maybe pulling this out and use bnxt to drive it?
> 

I would love to see a second nic implementation; this patch set and
overall design is driven by GVE limitations.



Re: [PATCH net-next v15 11/14] net: add SO_DEVMEM_DONTNEED setsockopt to release RX frags

2024-07-02 Thread David Ahern
On 6/27/24 6:32 PM, Mina Almasry wrote:
> @@ -1049,6 +1050,62 @@ static int sock_reserve_memory(struct sock *sk, int 
> bytes)
>   return 0;
>  }
>  
> +#ifdef CONFIG_PAGE_POOL
> +static noinline_for_stack int
> +sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen)
> +{
> + unsigned int num_tokens, i, j, k, netmem_num = 0;
> + struct dmabuf_token *tokens;
> + netmem_ref netmems[16];
> + int ret = 0;
> +
> + if (sk->sk_type != SOCK_STREAM || sk->sk_protocol != IPPROTO_TCP)
> + return -EBADF;
> +
> + if (optlen % sizeof(struct dmabuf_token) ||
> + optlen > sizeof(*tokens) * 128)
> + return -EINVAL;
> +
> + tokens = kvmalloc_array(128, sizeof(*tokens), GFP_KERNEL);

The '128' should be a named macro with a comment on why that value.



Re: [PATCH v2 10/15] vrf: Remove the now superfluous sentinel element from ctl_table array

2023-10-02 Thread David Ahern
On 10/2/23 2:55 AM, Joel Granados via B4 Relay wrote:
> From: Joel Granados 
> 
> This commit comes at the tail end of a greater effort to remove the
> empty elements at the end of the ctl_table arrays (sentinels) which
> will reduce the overall build time size of the kernel and run time
> memory bloat by ~64 bytes per sentinel (further information Link :
> https://lore.kernel.org/all/zo5yx5jfoggi%2f...@bombadil.infradead.org/)
> 
> Remove sentinel from vrf_table
> 
> Signed-off-by: Joel Granados 
> ---
>  drivers/net/vrf.c | 1 -
>  1 file changed, 1 deletion(-)
> 

Reviewed-by: David Ahern 




Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-06 Thread David Ahern
On 11/6/23 11:47 AM, Stanislav Fomichev wrote:
> On 11/05, Mina Almasry wrote:
>> For device memory TCP, we expect the skb headers to be available in host
>> memory for access, and we expect the skb frags to be in device memory
>> and unaccessible to the host. We expect there to be no mixing and
>> matching of device memory frags (unaccessible) with host memory frags
>> (accessible) in the same skb.
>>
>> Add a skb->devmem flag which indicates whether the frags in this skb
>> are device memory frags or not.
>>
>> __skb_fill_page_desc() now checks frags added to skbs for page_pool_iovs,
>> and marks the skb as skb->devmem accordingly.
>>
>> Add checks through the network stack to avoid accessing the frags of
>> devmem skbs and avoid coalescing devmem skbs with non devmem skbs.
>>
>> Signed-off-by: Willem de Bruijn 
>> Signed-off-by: Kaiyuan Zhang 
>> Signed-off-by: Mina Almasry 
>>
>> ---
>>  include/linux/skbuff.h | 14 +++-
>>  include/net/tcp.h  |  5 +--
>>  net/core/datagram.c|  6 
>>  net/core/gro.c |  5 ++-
>>  net/core/skbuff.c  | 77 --
>>  net/ipv4/tcp.c |  6 
>>  net/ipv4/tcp_input.c   | 13 +--
>>  net/ipv4/tcp_output.c  |  5 ++-
>>  net/packet/af_packet.c |  4 +--
>>  9 files changed, 115 insertions(+), 20 deletions(-)
>>
>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>> index 1fae276c1353..8fb468ff8115 100644
>> --- a/include/linux/skbuff.h
>> +++ b/include/linux/skbuff.h
>> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t;
>>   *  @csum_level: indicates the number of consecutive checksums found in
>>   *  the packet minus one that have been verified as
>>   *  CHECKSUM_UNNECESSARY (max 3)
>> + *  @devmem: indicates that all the fragments in this skb are backed by
>> + *  device memory.
>>   *  @dst_pending_confirm: need to confirm neighbour
>>   *  @decrypted: Decrypted SKB
>>   *  @slow_gro: state present at GRO time, slower prepare step required
>> @@ -991,7 +993,7 @@ struct sk_buff {
>>  #if IS_ENABLED(CONFIG_IP_SCTP)
>>  __u8csum_not_inet:1;
>>  #endif
>> -
>> +__u8devmem:1;
>>  #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
>>  __u16   tc_index;   /* traffic control index */
>>  #endif
>> @@ -1766,6 +1768,12 @@ static inline void skb_zcopy_downgrade_managed(struct 
>> sk_buff *skb)
>>  __skb_zcopy_downgrade_managed(skb);
>>  }
>>  
>> +/* Return true if frags in this skb are not readable by the host. */
>> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
>> +{
>> +return skb->devmem;
> 
> bikeshedding: should we also rename 'devmem' sk_buff flag to 'not_readable'?
> It better communicates the fact that the stack shouldn't dereference the
> frags (because it has 'devmem' fragments or for some other potential
> future reason).

+1.

Also, the flag on the skb is an optimization - a high level signal that
one or more frags is in unreadable memory. There is no requirement that
all of the frags are in the same memory type.


Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-06 Thread David Ahern
On 11/6/23 3:18 PM, Mina Almasry wrote:
>> @@ -991,7 +993,7 @@ struct sk_buff {
>>  #if IS_ENABLED(CONFIG_IP_SCTP)
>>  __u8csum_not_inet:1;
>>  #endif
>> -
>> +__u8devmem:1;
>>  #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS)
>>  __u16   tc_index;   /* traffic control index */
>>  #endif
>> @@ -1766,6 +1768,12 @@ static inline void 
>> skb_zcopy_downgrade_managed(struct sk_buff *skb)
>>  __skb_zcopy_downgrade_managed(skb);
>>  }
>>
>> +/* Return true if frags in this skb are not readable by the host. */
>> +static inline bool skb_frags_not_readable(const struct sk_buff *skb)
>> +{
>> +return skb->devmem;
>
> bikeshedding: should we also rename 'devmem' sk_buff flag to 
> 'not_readable'?
> It better communicates the fact that the stack shouldn't dereference the
> frags (because it has 'devmem' fragments or for some other potential
> future reason).

 +1.

 Also, the flag on the skb is an optimization - a high level signal that
 one or more frags is in unreadable memory. There is no requirement that
 all of the frags are in the same memory type.
>>
>> David: maybe there should be such a requirement (that they all are
>> unreadable)? Might be easier to support initially; we can relax later
>> on.
>>
> 
> Currently devmem == not_readable, and the restriction is that all the
> frags in the same skb must be either all readable or all unreadable
> (all devmem or all non-devmem).

What requires that restriction? In all of the uses of skb->devmem and
skb_frags_not_readable() what matters is if any frag is not readable,
then frag list walk or collapse is avoided.



Re: [RFC PATCH v3 05/12] netdev: netdevice devmem allocator

2023-11-06 Thread David Ahern
On 11/5/23 7:44 PM, Mina Almasry wrote:
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index eeeda849115c..1c351c138a5b 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -843,6 +843,9 @@ struct netdev_dmabuf_binding {
>  };
>  
>  #ifdef CONFIG_DMA_SHARED_BUFFER
> +struct page_pool_iov *
> +netdev_alloc_devmem(struct netdev_dmabuf_binding *binding);
> +void netdev_free_devmem(struct page_pool_iov *ppiov);

netdev_{alloc,free}_dmabuf?

I say that because a dmabuf can be host memory, at least I am not aware
of a restriction that a dmabuf is device memory.



Re: [RFC PATCH v3 06/12] memory-provider: dmabuf devmem memory provider

2023-11-06 Thread David Ahern
On 11/5/23 7:44 PM, Mina Almasry wrote:
> diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h
> index 78cbb040af94..b93243c2a640 100644
> --- a/include/net/page_pool/helpers.h
> +++ b/include/net/page_pool/helpers.h
> @@ -111,6 +112,45 @@ page_pool_iov_binding(const struct page_pool_iov *ppiov)
>   return page_pool_iov_owner(ppiov)->binding;
>  }
>  
> +static inline int page_pool_iov_refcount(const struct page_pool_iov *ppiov)
> +{
> + return refcount_read(&ppiov->refcount);
> +}
> +
> +static inline void page_pool_iov_get_many(struct page_pool_iov *ppiov,
> +   unsigned int count)
> +{
> + refcount_add(count, &ppiov->refcount);
> +}
> +
> +void __page_pool_iov_free(struct page_pool_iov *ppiov);
> +
> +static inline void page_pool_iov_put_many(struct page_pool_iov *ppiov,
> +   unsigned int count)
> +{
> + if (!refcount_sub_and_test(count, &ppiov->refcount))
> + return;
> +
> + __page_pool_iov_free(ppiov);
> +}
> +
> +/* page pool mm helpers */
> +
> +static inline bool page_is_page_pool_iov(const struct page *page)
> +{
> + return (unsigned long)page & PP_DEVMEM;

This is another one where the code can be more generic to not force a
lot changes later.  e.g., PP_CUSTOM or PP_NO_PAGE. Then io_uring use
case with host memory can leverage the iov pool in a similar manner.

That does mean skb->devmem needs to be a flag on the page pool and not
just assume iov == device memory.




Re: [RFC PATCH v3 10/12] tcp: RX path for devmem TCP

2023-11-06 Thread David Ahern
On 11/6/23 4:32 PM, Stanislav Fomichev wrote:
>> The concise notification API returns tokens as a range for
>> compression, encoding as two 32-bit unsigned integers start + length.
>> It allows for even further batching by returning multiple such ranges
>> in a single call.
> 
> Tangential: should tokens be u64? Otherwise we can't have more than
> 4gb unacknowledged. Or that's a reasonable constraint?
> 

Was thinking the same and with bits reserved for a dmabuf id to allow
multiple dmabufs in a single rx queue (future extension, but build the
capability in now). e.g., something like a 37b offset (128GB dmabuf
size), 19b length (large GRO), 8b dmabuf id (lots of dmabufs to a queue).


Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-06 Thread David Ahern
On 11/5/23 7:44 PM, Mina Almasry wrote:
> diff --git a/net/core/datagram.c b/net/core/datagram.c
> index 176eb5834746..cdd4fb129968 100644
> --- a/net/core/datagram.c
> +++ b/net/core/datagram.c
> @@ -425,6 +425,9 @@ static int __skb_datagram_iter(const struct sk_buff *skb, 
> int offset,
>   return 0;
>   }
>  
> + if (skb_frags_not_readable(skb))
> + goto short_copy;
> +
>   /* Copy paged appendix. Hmm... why does this look so complicated? */
>   for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
>   int end;
> @@ -616,6 +619,9 @@ int __zerocopy_sg_from_iter(struct msghdr *msg, struct 
> sock *sk,
>  {
>   int frag;
>  
> + if (skb_frags_not_readable(skb))
> + return -EFAULT;

This check 
> +
>   if (msg && msg->msg_ubuf && msg->sg_from_iter)
>   return msg->sg_from_iter(sk, skb, from, length);


... should go here. That allows custome sg_from_iter to have access to
the skb. What matters is not expecting struct page (e.g., refcounting);
if the custom iter does not do that then all is well. io_uring's iter
does not look at the pages, so all good.

>  
> diff --git a/net/core/gro.c b/net/core/gro.c
> index 42d7f6755f32..56046d65386a 100644
> --- a/net/core/gro.c
> +++ b/net/core/gro.c
> @@ -390,6 +390,9 @@ static void gro_pull_from_frag0(struct sk_buff *skb, int 
> grow)
>  {
>   struct skb_shared_info *pinfo = skb_shinfo(skb);
>  
> + if (WARN_ON_ONCE(skb_frags_not_readable(skb)))
> + return;
> +
>   BUG_ON(skb->end - skb->tail < grow);
>  
>   memcpy(skb_tail_pointer(skb), NAPI_GRO_CB(skb)->frag0, grow);
> @@ -411,7 +414,7 @@ static void gro_try_pull_from_frag0(struct sk_buff *skb)
>  {
>   int grow = skb_gro_offset(skb) - skb_headlen(skb);
>  
> - if (grow > 0)
> + if (grow > 0 && !skb_frags_not_readable(skb))
>   gro_pull_from_frag0(skb, grow);
>  }
>  
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 13eca4fd25e1..f01673ed2eff 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1230,6 +1230,14 @@ void skb_dump(const char *level, const struct sk_buff 
> *skb, bool full_pkt)
>   struct page *p;
>   u8 *vaddr;
>  
> + if (skb_frag_is_page_pool_iov(frag)) {

Why skb_frag_is_page_pool_iov here vs skb_frags_not_readable?


Re: [RFC PATCH v3 09/12] net: add support for skbs with unreadable frags

2023-11-06 Thread David Ahern
On 11/6/23 5:20 PM, Mina Almasry wrote:
> The user is free to modify or delete flow steering rules outside of the
> lifetime of the socket. Technically it's possible for the user to
> reconfigure flow steering while the socket is simultaneously receiving,
> and the result will be packets switching from devmem to non-devmem.

generically, from one page pool to another (ie., devmem piece of that
statement is not relevant).


Re: [RFC PATCH v3 00/12] Device Memory TCP

2023-11-07 Thread David Ahern
Is there a policy about cc'ing moderated lists on patch sets? I thought
there was, but not finding anything under Documentation/. Getting a
'needs moderator approval response' on every message is rather annoying.


Re: [RFC PATCH v3 05/12] netdev: netdevice devmem allocator

2023-11-07 Thread David Ahern
On 11/7/23 3:10 PM, Mina Almasry wrote:
> On Mon, Nov 6, 2023 at 3:44 PM David Ahern  wrote:
>>
>> On 11/5/23 7:44 PM, Mina Almasry wrote:
>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>>> index eeeda849115c..1c351c138a5b 100644
>>> --- a/include/linux/netdevice.h
>>> +++ b/include/linux/netdevice.h
>>> @@ -843,6 +843,9 @@ struct netdev_dmabuf_binding {
>>>  };
>>>
>>>  #ifdef CONFIG_DMA_SHARED_BUFFER
>>> +struct page_pool_iov *
>>> +netdev_alloc_devmem(struct netdev_dmabuf_binding *binding);
>>> +void netdev_free_devmem(struct page_pool_iov *ppiov);
>>
>> netdev_{alloc,free}_dmabuf?
>>
> 
> Can do.
> 
>> I say that because a dmabuf can be host memory, at least I am not aware
>> of a restriction that a dmabuf is device memory.
>>
> 
> In my limited experience dma-buf is generally device memory, and
> that's really its use case. CONFIG_UDMABUF is a driver that mocks
> dma-buf with a memfd which I think is used for testing. But I can do
> the rename, it's more clear anyway, I think.

config UDMABUF
bool "userspace dmabuf misc driver"
default n
depends on DMA_SHARED_BUFFER
depends on MEMFD_CREATE || COMPILE_TEST
help
  A driver to let userspace turn memfd regions into dma-bufs.
  Qemu can use this to create host dmabufs for guest framebuffers.


Qemu is just a userspace process; it is no way a special one.

Treating host memory as a dmabuf should radically simplify the io_uring
extension of this set. That the io_uring set needs to dive into
page_pools is just wrong - complicating the design and code and pushing
io_uring into a realm it does not need to be involved in.

Most (all?) of this patch set can work with any memory; only device
memory is unreadable.




Re: [RFC PATCH v3 10/12] tcp: RX path for devmem TCP

2023-11-07 Thread David Ahern
On 11/7/23 4:55 PM, Mina Almasry wrote:
> On Mon, Nov 6, 2023 at 4:03 PM Willem de Bruijn
>  wrote:
>>
>> On Mon, Nov 6, 2023 at 3:55 PM David Ahern  wrote:
>>>
>>> On 11/6/23 4:32 PM, Stanislav Fomichev wrote:
>>>>> The concise notification API returns tokens as a range for
>>>>> compression, encoding as two 32-bit unsigned integers start + length.
>>>>> It allows for even further batching by returning multiple such ranges
>>>>> in a single call.
>>>>
>>>> Tangential: should tokens be u64? Otherwise we can't have more than
>>>> 4gb unacknowledged. Or that's a reasonable constraint?
>>>>
>>>
>>> Was thinking the same and with bits reserved for a dmabuf id to allow
>>> multiple dmabufs in a single rx queue (future extension, but build the
>>> capability in now). e.g., something like a 37b offset (128GB dmabuf
>>> size), 19b length (large GRO), 8b dmabuf id (lots of dmabufs to a queue).
>>
>> Agreed. Converting to 64b now sounds like a good forward looking revision.
> 
> The concept of IDing a dma-buf came up in a couple of different
> contexts. First, in the context of us giving the dma-buf ID to the
> user on recvmsg() to tell the user the data is in this specific
> dma-buf. The second context is here, to bind dma-bufs with multiple
> user-visible IDs to an rx queue.
> 
> My issue here is that I don't see anything in the struct dma_buf that
> can practically serve as an ID:
> 
> https://elixir.bootlin.com/linux/v6.6-rc7/source/include/linux/dma-buf.h#L302
> 
> Actually, from the userspace, only the name of the dma-buf seems
> queryable. That's only unique if the user sets it as such. The dmabuf
> FD can't serve as an ID. For our use case we need to support 1 process
> doing the dma-buf bind via netlink, sharing the dma-buf FD to another
> process, and that process receives the data.  In this case the FDs
> shown by the 2 processes may be different. Converting to 64b is a
> trivial change I can make now, but I'm not sure how to ID these
> dma-bufs. Suggestions welcome. I'm not sure the dma-buf guys will
> allow adding a new ID + APIs to query said dma-buf ID.
> 

The API can be unique to this usage: e.g., add a dmabuf id to the
netlink API. Userspace manages the ids (tells the kernel what value to
use with an instance), the kernel validates no 2 dmabufs have the same
id and then returns the value here.




Re: [RFC PATCH v3 06/12] memory-provider: dmabuf devmem memory provider

2023-11-07 Thread David Ahern
On 11/7/23 5:02 PM, Mina Almasry wrote:
> On Mon, Nov 6, 2023 at 1:02 PM Stanislav Fomichev  wrote:
>>
>> On 11/05, Mina Almasry wrote:
>>> +static inline bool page_is_page_pool_iov(const struct page *page)
>>> +{
>>> + return (unsigned long)page & PP_DEVMEM;
>>> +}
>>
>> Speaking of bpf: one thing that might be problematic with this PP_DEVMEM
>> bit is that it will make debugging with bpftrace a bit (more)
>> complicated. If somebody were trying to get to that page_pool_iov from
>> the frags, they will have to do the equivalent of page_is_page_pool_iov,
>> but probably not a big deal? (thinking out loud)
> 
> Good point, but that doesn't only apply to bpf I think. I'm guessing
> even debugger drgn access to the bv_page in the frag will have trouble
> if it's actually accessing an iov with LSB set.
> 
> But this is not specific to this use for LSB pointer trick. I think
> all code that currently uses LSB pointer trick will have similar
> troubles. In this context my humble vote is that we get such big
> upside from reducing code churn that it's reasonable to tolerate such
> side effects.

+1

> 
> I could alleviate some of the issues by teaching drgn to do the right
> thing for devmem/iovs... time permitting.
> 
Tools like drgn and crash have to know when the LSB trick is used  -
e.g., dst_entry - and handle it when dereferencing pointers.


Re: [RFC PATCH v3 05/12] netdev: netdevice devmem allocator

2023-11-11 Thread David Ahern
On 11/10/23 7:26 AM, Pavel Begunkov wrote:
> On 11/7/23 23:03, Mina Almasry wrote:
>> On Tue, Nov 7, 2023 at 2:55 PM David Ahern  wrote:
>>>
>>> On 11/7/23 3:10 PM, Mina Almasry wrote:
>>>> On Mon, Nov 6, 2023 at 3:44 PM David Ahern  wrote:
>>>>>
>>>>> On 11/5/23 7:44 PM, Mina Almasry wrote:
>>>>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>>>>>> index eeeda849115c..1c351c138a5b 100644
>>>>>> --- a/include/linux/netdevice.h
>>>>>> +++ b/include/linux/netdevice.h
>>>>>> @@ -843,6 +843,9 @@ struct netdev_dmabuf_binding {
>>>>>>   };
>>>>>>
>>>>>>   #ifdef CONFIG_DMA_SHARED_BUFFER
>>>>>> +struct page_pool_iov *
>>>>>> +netdev_alloc_devmem(struct netdev_dmabuf_binding *binding);
>>>>>> +void netdev_free_devmem(struct page_pool_iov *ppiov);
>>>>>
>>>>> netdev_{alloc,free}_dmabuf?
>>>>>
>>>>
>>>> Can do.
>>>>
>>>>> I say that because a dmabuf can be host memory, at least I am not
>>>>> aware
>>>>> of a restriction that a dmabuf is device memory.
>>>>>
>>>>
>>>> In my limited experience dma-buf is generally device memory, and
>>>> that's really its use case. CONFIG_UDMABUF is a driver that mocks
>>>> dma-buf with a memfd which I think is used for testing. But I can do
>>>> the rename, it's more clear anyway, I think.
>>>
>>> config UDMABUF
>>>  bool "userspace dmabuf misc driver"
>>>  default n
>>>  depends on DMA_SHARED_BUFFER
>>>  depends on MEMFD_CREATE || COMPILE_TEST
>>>  help
>>>    A driver to let userspace turn memfd regions into dma-bufs.
>>>    Qemu can use this to create host dmabufs for guest
>>> framebuffers.
>>>
>>>
>>> Qemu is just a userspace process; it is no way a special one.
>>>
>>> Treating host memory as a dmabuf should radically simplify the io_uring
>>> extension of this set.
>>
>> I agree actually, and I was about to make that comment to David Wei's
>> series once I have the time.
>>
>> David, your io_uring RX zerocopy proposal actually works with devmem
>> TCP, if you're inclined to do that instead, what you'd do roughly is
>> (I think):
> That would be a Frankenstein's monster api with no good reason for it.

It brings a consistent API from a networking perspective.

io_uring should not need to be in the page pool and memory management
business. Have you or David coded up the re-use of the socket APIs with
dmabuf to see how much smaller it makes the io_uring change - or even
walked through from a theoretical perspective?




Re: [net-next v1 06/16] netdev: support binding dma-buf to netdevice

2023-12-08 Thread David Ahern
On 12/7/23 5:52 PM, Mina Almasry wrote:
> +
> +static int netdev_restart_rx_queue(struct net_device *dev, int rxq_idx)
> +{
> + void *new_mem;
> + void *old_mem;
> + int err;
> +
> + if (!dev || !dev->netdev_ops)
> + return -EINVAL;
> +
> + if (!dev->netdev_ops->ndo_queue_stop ||
> + !dev->netdev_ops->ndo_queue_mem_free ||
> + !dev->netdev_ops->ndo_queue_mem_alloc ||
> + !dev->netdev_ops->ndo_queue_start)
> + return -EOPNOTSUPP;
> +
> + new_mem = dev->netdev_ops->ndo_queue_mem_alloc(dev, rxq_idx);
> + if (!new_mem)
> + return -ENOMEM;
> +
> + err = dev->netdev_ops->ndo_queue_stop(dev, rxq_idx, &old_mem);
> + if (err)
> + goto err_free_new_mem;
> +
> + err = dev->netdev_ops->ndo_queue_start(dev, rxq_idx, new_mem);
> + if (err)
> + goto err_start_queue;
> +
> + dev->netdev_ops->ndo_queue_mem_free(dev, old_mem);
> +
> + return 0;
> +
> +err_start_queue:
> + dev->netdev_ops->ndo_queue_start(dev, rxq_idx, old_mem);
> +
> +err_free_new_mem:
> + dev->netdev_ops->ndo_queue_mem_free(dev, new_mem);
> +
> + return err;
> +}
> +
> +/* Protected by rtnl_lock() */
> +static DEFINE_XARRAY_FLAGS(netdev_dmabuf_bindings, XA_FLAGS_ALLOC1);
> +
> +void netdev_unbind_dmabuf(struct netdev_dmabuf_binding *binding)
> +{
> + struct netdev_rx_queue *rxq;
> + unsigned long xa_idx;
> + unsigned int rxq_idx;
> +
> + if (!binding)
> + return;
> +
> + if (binding->list.next)
> + list_del(&binding->list);
> +
> + xa_for_each(&binding->bound_rxq_list, xa_idx, rxq) {
> + if (rxq->binding == binding) {
> + /* We hold the rtnl_lock while binding/unbinding
> +  * dma-buf, so we can't race with another thread that
> +  * is also modifying this value. However, the driver
> +  * may read this config while it's creating its
> +  * rx-queues. WRITE_ONCE() here to match the
> +  * READ_ONCE() in the driver.
> +  */
> + WRITE_ONCE(rxq->binding, NULL);
> +
> + rxq_idx = get_netdev_rx_queue_index(rxq);
> +
> + netdev_restart_rx_queue(binding->dev, rxq_idx);

Blindly restarting a queue when a dmabuf is heavy handed. If the dmabuf
has no outstanding references (ie., no references in the RxQ), then no
restart is needed.

> + }
> + }
> +
> + xa_erase(&netdev_dmabuf_bindings, binding->id);
> +
> + netdev_dmabuf_binding_put(binding);
> +}
> +
> +int netdev_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
> + struct netdev_dmabuf_binding *binding)
> +{
> + struct netdev_rx_queue *rxq;
> + u32 xa_idx;
> + int err;
> +
> + rxq = __netif_get_rx_queue(dev, rxq_idx);
> +
> + if (rxq->binding)
> + return -EEXIST;
> +
> + err = xa_alloc(&binding->bound_rxq_list, &xa_idx, rxq, xa_limit_32b,
> +GFP_KERNEL);
> + if (err)
> + return err;
> +
> + /* We hold the rtnl_lock while binding/unbinding dma-buf, so we can't
> +  * race with another thread that is also modifying this value. However,
> +  * the driver may read this config while it's creating its * rx-queues.
> +  * WRITE_ONCE() here to match the READ_ONCE() in the driver.
> +  */
> + WRITE_ONCE(rxq->binding, binding);
> +
> + err = netdev_restart_rx_queue(dev, rxq_idx);

Similarly, here binding a dmabuf to a queue. I was expecting the dmabuf
binding to add entries to the page pool for the queue. If the pool was
previously empty, then maybe the queue needs to be "started" in the
sense of creating with h/w or just pushing buffers into the queue and
moving the pidx.




Re: [net-next v1 13/16] tcp: RX path for devmem TCP

2023-12-08 Thread David Ahern
On 12/7/23 5:52 PM, Mina Almasry wrote:
> In tcp_recvmsg_locked(), detect if the skb being received by the user
> is a devmem skb. In this case - if the user provided the MSG_SOCK_DEVMEM
> flag - pass it to tcp_recvmsg_devmem() for custom handling.
> 
> tcp_recvmsg_devmem() copies any data in the skb header to the linear
> buffer, and returns a cmsg to the user indicating the number of bytes
> returned in the linear buffer.
> 
> tcp_recvmsg_devmem() then loops over the unaccessible devmem skb frags,
> and returns to the user a cmsg_devmem indicating the location of the
> data in the dmabuf device memory. cmsg_devmem contains this information:
> 
> 1. the offset into the dmabuf where the payload starts. 'frag_offset'.
> 2. the size of the frag. 'frag_size'.
> 3. an opaque token 'frag_token' to return to the kernel when the buffer
> is to be released.
> 
> The pages awaiting freeing are stored in the newly added
> sk->sk_user_pages, and each page passed to userspace is get_page()'d.
> This reference is dropped once the userspace indicates that it is
> done reading this page.  All pages are released when the socket is
> destroyed.
> 
> Signed-off-by: Willem de Bruijn 
> Signed-off-by: Kaiyuan Zhang 
> Signed-off-by: Mina Almasry 
> 
> ---
> 
> Changes in v1:
> - Added dmabuf_id to dmabuf_cmsg (David/Stan).
> - Devmem -> dmabuf (David).
> - Change tcp_recvmsg_dmabuf() check to skb->dmabuf (Paolo).
> - Use __skb_frag_ref() & napi_pp_put_page() for refcounting (Yunsheng).
> 
> RFC v3:
> - Fixed issue with put_cmsg() failing silently.
> 

What happens if a retransmitted packet is received or an rx window is
closed and a probe is received where the kernel drops the skb - is the
iov reference(s) in the skb returned to the pool by the stack and ready
for use again?


Re: [net-next v1 07/16] netdev: netdevice devmem allocator

2023-12-08 Thread David Ahern
On 12/7/23 5:52 PM, Mina Almasry wrote:
> diff --git a/net/core/dev.c b/net/core/dev.c
> index b8c8be5a912e..30667e4c3b95 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2120,6 +2120,41 @@ static int netdev_restart_rx_queue(struct net_device 
> *dev, int rxq_idx)
>   return err;
>  }
>  
> +struct page_pool_iov *netdev_alloc_dmabuf(struct netdev_dmabuf_binding 
> *binding)
> +{
> + struct dmabuf_genpool_chunk_owner *owner;
> + struct page_pool_iov *ppiov;
> + unsigned long dma_addr;
> + ssize_t offset;
> + ssize_t index;
> +
> + dma_addr = gen_pool_alloc_owner(binding->chunk_pool, PAGE_SIZE,

Any reason not to allow allocation sizes other than PAGE_SIZE? e.g.,
2048 for smaller MTUs or 8192 for larger ones. It can be a property of
page_pool and constant across allocations vs allowing different size for
each allocation.


Re: [net-next v1 00/16] Device Memory TCP

2023-12-08 Thread David Ahern
On 12/7/23 5:52 PM, Mina Almasry wrote:
> Major changes in v1:
> --
> 
> 1. Implemented MVP queue API ndos to remove the userspace-visible
>driver reset.
> 
> 2. Fixed issues in the napi_pp_put_page() devmem frag unref path.
> 
> 3. Removed RFC tag.
> 
> Many smaller addressed comments across all the patches (patches have
> individual change log).
> 
> Full tree including the rest of the GVE driver changes:
> https://github.com/mina/linux/commits/tcpdevmem-v1
> 

Still a lot of DEVMEM references (e.g., socket API). Any reason not to
move those to DMABUF?



Re: [net-next v1 06/16] netdev: support binding dma-buf to netdevice

2023-12-09 Thread David Ahern
On 12/8/23 12:22 PM, Mina Almasry wrote:
> On Fri, Dec 8, 2023 at 9:48 AM David Ahern  wrote:
>>
>> On 12/7/23 5:52 PM, Mina Almasry wrote:
> ...
>>> +
>>> + xa_for_each(&binding->bound_rxq_list, xa_idx, rxq) {
>>> + if (rxq->binding == binding) {
>>> + /* We hold the rtnl_lock while binding/unbinding
>>> +  * dma-buf, so we can't race with another thread that
>>> +  * is also modifying this value. However, the driver
>>> +  * may read this config while it's creating its
>>> +  * rx-queues. WRITE_ONCE() here to match the
>>> +  * READ_ONCE() in the driver.
>>> +  */
>>> + WRITE_ONCE(rxq->binding, NULL);
>>> +
>>> + rxq_idx = get_netdev_rx_queue_index(rxq);
>>> +
>>> + netdev_restart_rx_queue(binding->dev, rxq_idx);
>>
>> Blindly restarting a queue when a dmabuf is heavy handed. If the dmabuf
>> has no outstanding references (ie., no references in the RxQ), then no
>> restart is needed.
>>
> 
> I think I need to stop the queue while binding to a dmabuf for the
> sake of concurrency, no? I.e. the softirq thread may be delivering a
> packet, and in parallel a separate thread holds rtnl_lock and tries to
> bind the dma-buf. At that point the page_pool recreation will race
> with the driver doing page_pool_alloc_page(). I don't think I can
> insert a lock to handle this into the rx fast path, no?

I think it depends on the details of how entries are added and removed
from the pool. I am behind on the pp details at this point, so I do need
to do some homework.

> 
> Also, this sounds like it requires (lots of) more changes. The
> page_pool + driver need to report how many pending references there
> are (with locking so we don't race with incoming packets), and have
> them reported via an ndo so that we can skip restarting the queue.
> Implementing the changes in to a huge issue but handling the
> concurrency may be a genuine blocker. Not sure it's worth the upside
> of not restarting the single rx queue?

It has to do with the usability of this overall solution. As I mentioned
most ML use cases can (and will want to) use many memory allocations for
receiving packets - e.g., allocations per message and receiving multiple
messages per socket connection.

> 
>>> + }
>>> + }
>>> +
>>> + xa_erase(&netdev_dmabuf_bindings, binding->id);
>>> +
>>> + netdev_dmabuf_binding_put(binding);
>>> +}
>>> +
>>> +int netdev_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
>>> + struct netdev_dmabuf_binding *binding)
>>> +{
>>> + struct netdev_rx_queue *rxq;
>>> + u32 xa_idx;
>>> + int err;
>>> +
>>> + rxq = __netif_get_rx_queue(dev, rxq_idx);
>>> +
>>> + if (rxq->binding)
>>> + return -EEXIST;
>>> +
>>> + err = xa_alloc(&binding->bound_rxq_list, &xa_idx, rxq, xa_limit_32b,
>>> +GFP_KERNEL);
>>> + if (err)
>>> + return err;
>>> +
>>> + /* We hold the rtnl_lock while binding/unbinding dma-buf, so we can't
>>> +  * race with another thread that is also modifying this value. 
>>> However,
>>> +  * the driver may read this config while it's creating its * 
>>> rx-queues.
>>> +  * WRITE_ONCE() here to match the READ_ONCE() in the driver.
>>> +  */
>>> + WRITE_ONCE(rxq->binding, binding);
>>> +
>>> + err = netdev_restart_rx_queue(dev, rxq_idx);
>>
>> Similarly, here binding a dmabuf to a queue. I was expecting the dmabuf
>> binding to add entries to the page pool for the queue.
> 
> To be honest, I think maybe there's a slight disconnect between how
> you think the page_pool works, and my primitive understanding of how
> it works. Today, I see a 1:1 mapping between rx-queue and page_pool in
> the code. I don't see 1:many or many:1 mappings.

I am not referring to 1:N or N:1 for page pool and queues. I am
referring to entries within a single page pool for a single Rx queue.


> 
> In theory mapping 1 rx-queue to n page_pools is trivial: the driver
> can call page_pool_create() multiple times to generate n queues and
> decide for incoming packets which one to use.
> 
> However, mapping n rx-queues to

Re: [net-next v1 08/16] memory-provider: dmabuf devmem memory provider

2023-12-12 Thread David Ahern
On 12/12/23 6:09 PM, Mina Almasry wrote:
> OK, I imagine this is not that hard to implement - it's really whether
> the change is acceptable to reviewers.
> 
> I figure I can start by implementing a no-op abstraction to page*:
> 
> typedef struct page netmem_t
> 
> and replace the page* in the following places with netmem_t*:
> 
> 1. page_pool API (not internals)
> 2. drivers using the page_pool.
> 3. skb_frag_t.
> 

accessors to skb_frag_t field are now consolidated to
include/linux/skbuff.h (the one IB driver was fixed in Sept by
4ececeb83986), so changing skb_frag_t from bio_vec to something like:

typedef struct skb_frag {
void *addr;
unsigned int length;
unsigned int offset;
};

is trivial. From there, addr can default to `struct page *`. If LSB is
set, strip it and return `struct page_pool_iov *` or `struct buffer_pool *`


Re: [PATCH net-next v10 02/14] net: page_pool: create hooks for custom page providers

2024-06-07 Thread David Ahern
On 6/7/24 7:42 AM, Pavel Begunkov wrote:
> I haven't seen any arguments against from the (net) maintainers so
> far. Nor I see any objection against callbacks from them (considering
> that either option adds an if).

I have said before I do not understand why the dmabuf paradigm is not
sufficient for both device memory and host memory. A less than ideal
control path to put hostmem in a dmabuf wrapper vs extra checks and
changes in the datapath. The former should always be preferred.

I also do not understand why the ifq cache and overloading xdp functions
have stuck around; I always thought both were added by Jonathan to
simplify kernel ports during early POC days.


Re: [PATCH net-next v10 02/14] net: page_pool: create hooks for custom page providers

2024-06-10 Thread David Ahern
On 6/10/24 6:16 AM, Jason Gunthorpe wrote:
> On Mon, Jun 10, 2024 at 02:07:01AM +0100, Pavel Begunkov wrote:
>> On 6/10/24 01:37, David Wei wrote:
>>> On 2024-06-07 17:52, Jason Gunthorpe wrote:
 IMHO it seems to compose poorly if you can only use the io_uring
 lifecycle model with io_uring registered memory, and not with DMABUF
 memory registered through Mina's mechanism.
>>>
>>> By this, do you mean io_uring must be exclusively used to use this
>>> feature?
>>>
>>> And you'd rather see the two decoupled, so userspace can register w/ say
>>> dmabuf then pass it to io_uring?
>>
>> Personally, I have no clue what Jason means. You can just as
>> well say that it's poorly composable that write(2) to a disk
>> cannot post a completion into a XDP ring, or a netlink socket,
>> or io_uring's main completion queue, or name any other API.
> 
> There is no reason you shouldn't be able to use your fast io_uring
> completion and lifecycle flow with DMABUF backed memory. Those are not
> widly different things and there is good reason they should work
> together.
> 
> Pretending they are totally different just because two different
> people wrote them is a very siloed view.
> 
>> The devmem TCP callback can implement it in a way feasible to
>> the project, but it cannot directly post events to an unrelated
>> API like io_uring. And devmem attaches buffers to a socket,
>> for which a ring for returning buffers might even be a nuisance.
> 
> If you can't compose your io_uring completion mechanism with a DMABUF
> provided backing store then I think it needs more work.
> 

exactly. io_uring, page_pool, dmabuf - all kernel building blocks for
solutions. This why I was pushing for Mina's set not to be using the
name `devmem` - it is but one type of memory and with dmabuf it should
not matter if it is gpu or host (or something else later on - cxl?).



Re: [PATCH net-next v10 02/14] net: page_pool: create hooks for custom page providers

2024-06-12 Thread David Ahern
On 6/12/24 6:06 AM, Jason Gunthorpe wrote:
> On Tue, Jun 11, 2024 at 11:09:15AM -0700, Mina Almasry wrote:
> 
>> Just curious: in Pavel's effort, io_uring - which is not a device - is
>> trying to share memory with the page_pool, which is also not a device.
>> And Pavel is being asked to wrap the memory in a dmabuf. Is dmabuf
>> going to be the kernel's standard for any memory sharing between any 2
>> components in the future, even when they're not devices?
> 
> dmabuf is how we are refcounting non-struct page memory, there is
> nothing about it that says it has to be MMIO memory, or even that the
> memory doesn't have struct pages.
> 
> All it says is that the memory is alive according to dmabuf
> refcounting rules. And the importer obviously don't get to touch the
> underlying folios, if any.
> 

In addition, the io_uring developers should be considering the use case
of device memory. There is no reason for this design to be limited to
host memory. io_uring should not care (it is not peeking inside the
memory buffers); it is just memory references.

One of io_uring's primary benefits is avoiding system calls. io_uring
works with TCP sockets. Let it work with any dmabuf without concern of
memory type. The performance benefits the Google crowd sees with system
call based apps should be even better with io_uring.

Focus on primitives, building blocks with solid APIs for other
subsystems to leverage and let them be wired up in ways you cannot
imagine today.