Re: DPDK packaging and OpenWrt

2023-05-16 Thread Garrett D'Amore
On May 16, 2023 at 12:19 PM -0700, Philip Prindeville 
, wrote:
> Hi,
>
> I'm a packaging maintainer for some of the OpenWrt ancillary packages, and 
> I'm looking at upstreaming DPDK support into OpenWrt. Apologies in advance if 
> this is a bit dense (a brain dump) or hops between too many topics.
>
> I was looking at what's been done to date, and it seems there are a few 
> shortcomings which I hope to address.
>
> Amongst them are:
>
>
> I have some related questions.
>
> 1. OpenWrt already bakes "vfio.ko" and "vfio-pci.ko" here:
>
> https://github.com/openwrt/openwrt/blob/master/package/kernel/linux/modules/virt.mk#L77-L117
>
> but does so with "CONFIG_VFIO_PCI_IGD=y", which seems to be specifically for 
> VGA acceleration of guests in a virtualized environment. That seems to be an 
> odd corner case, and unlikely given that OpenWrt almost always runs on 
> headless hardware. Should this be reverted?
>
> 2. "vfio.ko" is built with "CONFIG_VFIO_NOIOMMU=y", which seems insecure. Can 
> this be reverted?
>
> 3. Is "uio_pci_generic.ko" worth the potential insecurity/instability of a 
> misbehaving application? My understanding is that it's only needed on SR-IOV 
> hardware where MSI/MSI-X interrupts aren't supported: is there even any 
> current hardware that doesn't support MSI/MSI-X? Or am I misunderstanding the 
> use case?

*Either* uio_pci_generic *or* vfio with NOIOMMU are required for the vary large 
number of systems that either lack an IOMMU (btw, that will be nearly all 
OpenWRT platforms!), or elect to run with the iommu unconfigured (one 
justification for doing that - apart from numerous software bugs and 
limitations — is that the IOMMU can slow down I/O. We actually recommend most 
of our customers that run dedicated systems run with the IOMMU disabled for 
this reason.)

vfio with  noiommu is preferable.
>
> 4. Can most functionality be achieved with VFIO + IOMMU support?

*If* you have an IOMMU, and you aren’t trying to eke the very last bits of 
performance, yes.  But as many systems don’t have an IOMMU, and as one of the 
main points of DPDK are extremely performance sensitive applications, I think 
the answer is more broadly, “no”.
> 11. What is the user-space TCP/IP stack of choice (or reference) for use with 
> DPDK?

IMO, if you’re using DPDK to run *TCP* applications then you’re probably 
misusing it — there isn’t a user land TCP stack that I would trust.  IP/UDP is 
something we do, and it works well, but I can tell you already it’s a pain to 
do *just* IP, because e.g. routing tables, ARP, etc. all have to be handled.

• Garrett



Re: meson option to customize RTE_PKTMBUF_HEADROOM patch

2024-03-23 Thread Garrett D'Amore
So we right now (at WEKA) have a somewhat older version of DPDK that we have 
customized heavily, and I am going to to need to to make the headroom *dynamic* 
(passed in at run time, and per port.)

We have this requirement because we need payload to be at a specific offset, 
but have to deal with different header lengths for IPv4 and now IPv6.

My reason for pointing this out, is that I would dearly like if we could 
collaborate on this -- this change is going to touch pretty much every PMD (we 
don't need it on all of them as we only support a subset of PMDs, but its still 
a significant set.)

I'm not sure if anyone else has considered such a need -- this particular 
message caught my eye as I'm looking specifically in this area right now.
On Feb 15, 2024 at 11:02 AM -0800, Parthakumar Roy , 
wrote:
> Hello,
> Bruce Richardson suggested that I submit this patch - at IBM we needed to 
> adjust the pkt_mbuf_headroom​ for our application to work. This is my first 
> ever patch through a mailing list, I have only done it through Pull Requests 
> before, so let me know if I need to correct something.
>
> Message:
>     Add meson configuration option to adjust RTE_PKTMBUF_HEADROOM
>
> diff --git a/config/meson.build b/config/meson.build
> index 7cd375e991..43b765ade1 100644
> --- a/config/meson.build
> +++ b/config/meson.build
> @@ -304,6 +304,7 @@ endforeach
>  dpdk_conf.set('RTE_MAX_ETHPORTS', get_option('max_ethports'))
>  dpdk_conf.set('RTE_LIBEAL_USE_HPET', get_option('use_hpet'))
>  dpdk_conf.set('RTE_ENABLE_TRACE_FP', get_option('enable_trace_fp'))
> +dpdk_conf.set('RTE_PKTMBUF_HEADROOM', get_option('pkt_mbuf_headroom'))
>  # values which have defaults which may be overridden
>  dpdk_conf.set('RTE_MAX_VFIO_GROUPS', 64)
>  dpdk_conf.set('RTE_DRIVER_MEMPOOL_BUCKET_SIZE_KB', 64)
> diff --git a/config/rte_config.h b/config/rte_config.h
> index 7b8c85e948..a2bb4ea61b 100644
> --- a/config/rte_config.h
> +++ b/config/rte_config.h
> @@ -51,7 +51,6 @@
>
>  /* mbuf defines */
>  #define RTE_MBUF_DEFAULT_MEMPOOL_OPS "ring_mp_mc"
> -#define RTE_PKTMBUF_HEADROOM 128
>
>  /* ether defines */
>  #define RTE_MAX_QUEUES_PER_PORT 1024
> diff --git a/meson_options.txt b/meson_options.txt
> index 08528492f7..169fcc94c7 100644
> --- a/meson_options.txt
> +++ b/meson_options.txt
> @@ -36,6 +36,8 @@ option('machine', type: 'string', value: 'auto', 
> description:
>         'Alias of cpu_instruction_set.')
>  option('max_ethports', type: 'integer', value: 32, description:
>         'maximum number of Ethernet devices')
> +option('pkt_mbuf_headroom', type: 'integer', value: 128, description:
> +       'number of bytes skipped on Rx at the start of the packet buffer to 
> leave room for additional packet headers')
>  option('max_lcores', type: 'string', value: 'default', description:
>         'Set maximum number of cores/threads supported by EAL; "default" is 
> different per-arch, "detect" detects the number of cores on the build 
> machine.')
>  option('max_numa_nodes', type: 'string', value: 'default', description:
>


Re: meson option to customize RTE_PKTMBUF_HEADROOM patch

2024-03-25 Thread Garrett D'Amore
So we need (for reasons that I don't want to get to into in too much detail) 
that our UDP payload headers are at a specific offset in the packet.

This was not a problem as long as we only used IPv4.  (We have configured 40 
bytes of headroom, which is more than any of our PMDs need by a hefty margin.)

Now that we're extending to support IPv6, we need to reduce that headroom by 20 
bytes, to preserve our UDP payload offset.

This has big ramifications for how we fragment our own upper layer messages, 
and it has been determined that updating the PMDs to allow us to change the 
headroom for this use case (on a per port basis, as we will have some ports on 
IPv4 and others on IPv6) is the least effort, but a large margin.  (Well, 
copying the frames via memcpy would be less development effort, but would be a 
performance catastrophe.)

For transmit side we don't need this, as we can simply adjust the packet as 
needed.  But for the receive side, we are kind of stuck, as the PMDs rely on 
the hard coded RTE_PKTMBUF_HEADROOM to program receive locations.

As far as header splitting, that would indeed be a much much nicer solution.

I haven't looked in the latest code to see if header splitting is even an 
option -- the version of the DPDK I'm working with is a little older (20.11) -- 
we have to update but we have other local changes and so updating is one of the 
things that we still have to do.

At any rate, the version I did look at doesn't seem to support header splits on 
any device other than FM10K.  That's not terrifically interesting for us.  We 
use Mellanox, E810 (ICE), bnxt, cloud NICs (all of them really -- ENA, 
virtio-net, etc.)   We also have a fair amount of ixgbe and i40e on client 
systems in the field.

We also, unfortunately, have an older DPDK 18 with Mellanox contributions for 
IPoverIB though I'm not sure we will try to support IPv6 there.  (We are 
working towards replacing that part of stack with UCX.)

Unless header splitting will work on all of this (excepting the IPoIB piece), 
then it's not something we can really use.
On Mar 25, 2024 at 10:20 AM -0700, Stephen Hemminger 
, wrote:
> On Mon, 25 Mar 2024 10:01:52 +
> Bruce Richardson  wrote:
>
> > On Sat, Mar 23, 2024 at 01:51:25PM -0700, Garrett D'Amore wrote:
> > > > So we right now (at WEKA) have a somewhat older version of DPDK that we
> > > > have customized heavily, and I am going to to need to to make the
> > > > headroom *dynamic* (passed in at run time, and per port.)
> > > > We have this requirement because we need payload to be at a specific
> > > > offset, but have to deal with different header lengths for IPv4 and now
> > > > IPv6.
> > > > My reason for pointing this out, is that I would dearly like if we
> > > > could collaborate on this -- this change is going to touch pretty much
> > > > every PMD (we don't need it on all of them as we only support a subset
> > > > of PMDs, but its still a significant set.)
> > > > I'm not sure if anyone else has considered such a need -- this
> > > > particular message caught my eye as I'm looking specifically in this
> > > > area right now.
> > > >
> > Hi
> >
> > thanks for reaching out. Can you clarify a little more as to the need for
> > this requirement? Can you not just set the headroom value to the max needed
> > value for any port and use that? Is there an issue with having blank space
> > at the start of a buffer?
> >
> > Thanks,
> > /Bruce
>
> If you have to make such a deep change across all PMD's then maybe
> it is not the best solution. What about being able to do some form of buffer
> chaining or pullup.


RE: meson option to customize RTE_PKTMBUF_HEADROOM patch

2024-03-26 Thread Garrett D'Amore
This could work. Not that we would like to have the exceptional case of IPv6 
use less headroom.   So we would say 40 is our compiled in default and then we 
reduce it by 20 on IPv6 which doesn’t have to support all the same devices that 
IPv4 does. This would give the lowest disruption to the existing IPv4 stack and 
allow PMDs to updated incrementally.
On Mar 26, 2024 at 1:05 AM -0700, Morten Brørup , 
wrote:
> Interesting requirement. I can easily imagine how a (non-forwarding, i.e. 
> traffic terminating) application, which doesn’t really care about the 
> preceding headers, can benefit from having its actual data at a specific 
> offset for alignment purposes. I don’t consider this very exotic. (Even the 
> Linux kernel uses this trick to achieve improved IP header alignment on RX.)
>
> I think the proper solution would be to add a new offload parameter to 
> rte_eth_rxconf to specify how many bytes the driver should subtract from 
> RTE_PKTMBUF_HEADROOM when writing the RX descriptor to the NIC hardware. 
> Depending on driver support, this would make it configurable per device and 
> per RX queue.
>
> If this parameter is set, the driver should adjust m->data_off accordingly on 
> RX, so rte_pktmbuf_mtod[_offset]() and rte_pktmbuf_iova[_offset]() still 
> point to the Ethernet header.
>
>
> Med venlig hilsen / Kind regards,
> -Morten Brørup
>
> From: Garrett D'Amore [mailto:garr...@damore.org]
> Sent: Monday, 25 March 2024 23.56
>
> So we need (for reasons that I don't want to get to into in too much detail) 
> that our UDP payload headers are at a specific offset in the packet.
>
> This was not a problem as long as we only used IPv4.  (We have configured 40 
> bytes of headroom, which is more than any of our PMDs need by a hefty margin.)
>
> Now that we're extending to support IPv6, we need to reduce that headroom by 
> 20 bytes, to preserve our UDP payload offset.
>
> This has big ramifications for how we fragment our own upper layer messages, 
> and it has been determined that updating the PMDs to allow us to change the 
> headroom for this use case (on a per port basis, as we will have some ports 
> on IPv4 and others on IPv6) is the least effort, but a large margin.  (Well, 
> copying the frames via memcpy would be less development effort, but would be 
> a performance catastrophe.)
>
> For transmit side we don't need this, as we can simply adjust the packet as 
> needed.  But for the receive side, we are kind of stuck, as the PMDs rely on 
> the hard coded RTE_PKTMBUF_HEADROOM to program receive locations.
>
> As far as header splitting, that would indeed be a much much nicer solution.
>
> I haven't looked in the latest code to see if header splitting is even an 
> option -- the version of the DPDK I'm working with is a little older (20.11) 
> -- we have to update but we have other local changes and so updating is one 
> of the things that we still have to do.
>
> At any rate, the version I did look at doesn't seem to support header splits 
> on any device other than FM10K.  That's not terrifically interesting for us.  
> We use Mellanox, E810 (ICE), bnxt, cloud NICs (all of them really -- ENA, 
> virtio-net, etc.)   We also have a fair amount of ixgbe and i40e on client 
> systems in the field.
>
> We also, unfortunately, have an older DPDK 18 with Mellanox contributions for 
> IPoverIB though I'm not sure we will try to support IPv6 there.  (We are 
> working towards replacing that part of stack with UCX.)
>
> Unless header splitting will work on all of this (excepting the IPoIB piece), 
> then it's not something we can really use.
> On Mar 25, 2024 at 10:20 AM -0700, Stephen Hemminger 
> , wrote:
>
> On Mon, 25 Mar 2024 10:01:52 +
> Bruce Richardson  wrote:
>
>
> On Sat, Mar 23, 2024 at 01:51:25PM -0700, Garrett D'Amore wrote:
>
> > So we right now (at WEKA) have a somewhat older version of DPDK that we
> > have customized heavily, and I am going to to need to to make the
> > headroom *dynamic* (passed in at run time, and per port.)
> > We have this requirement because we need payload to be at a specific
> > offset, but have to deal with different header lengths for IPv4 and now
> > IPv6.
> > My reason for pointing this out, is that I would dearly like if we
> > could collaborate on this -- this change is going to touch pretty much
> > every PMD (we don't need it on all of them as we only support a subset
> > of PMDs, but its still a significant set.)
> > I'm not sure if anyone else has considered such a need -- this
> > particular message caught my eye as I'm looking specifically in this
> > area right now.
> >
>

RE: meson option to customize RTE_PKTMBUF_HEADROOM patch

2024-03-26 Thread Garrett D'Amore
This had occurred to me as well.  I think most hardware DMA engines can align 
on 32-bit boundaries.  I've yet to see a device that actually requires 64-bit 
DMA alignment.  (But I have only looked at a subset  of devices, and most of 
the  ones I have looked at are not ones that would be considered 'modern'.)
On Mar 26, 2024 at 8:06 AM -0700, Morten Brørup , 
wrote:
> Something just struck me…
> The buffer address field in the RX descriptor of some NICs may have alignment 
> requirements, i.e. the lowest bits in the buffer address field of the NIC’s 
> RX descriptor may be used for other purposes (and assumed zero for buffer 
> address purposes). 40 is divisible by 8, but offset 20 requires that the NIC 
> hardware supports 4-byte aligned addresses (so only the 2 lowest bits may be 
> used for other purposes).
>
> Here’s an example of what I mean:
> https://docs.amd.com/r/en-US/am011-versal-acap-trm/RX-Descriptor-Words
>
> If any of your supported NICs have that restriction, i.e. requires an 8 byte 
> aligned buffer address, your concept of having the UDP payload at the same 
> fixed offset for both IPv4 and IPv6 is not going to be possible. (And you 
> were lucky that the offset happens to be sufficiently aligned to work for 
> IPv4 to begin with.)
>
> It seems you need to read a bunch of datasheets before proceeding.
>
>
> Med venlig hilsen / Kind regards,
> -Morten Brørup
>
> From: Garrett D'Amore [mailto:garr...@damore.org]
> Sent: Tuesday, 26 March 2024 15.19
>
> This could work. Not that we would like to have the exceptional case of IPv6 
> use less headroom.   So we would say 40 is our compiled in default and then 
> we reduce it by 20 on IPv6 which doesn’t have to support all the same devices 
> that IPv4 does. This would give the lowest disruption to the existing IPv4 
> stack and allow PMDs to updated incrementally.
> On Mar 26, 2024 at 1:05 AM -0700, Morten Brørup , 
> wrote:
>
> Interesting requirement. I can easily imagine how a (non-forwarding, i.e. 
> traffic terminating) application, which doesn’t really care about the 
> preceding headers, can benefit from having its actual data at a specific 
> offset for alignment purposes. I don’t consider this very exotic. (Even the 
> Linux kernel uses this trick to achieve improved IP header alignment on RX.)
>
> I think the proper solution would be to add a new offload parameter to 
> rte_eth_rxconf to specify how many bytes the driver should subtract from 
> RTE_PKTMBUF_HEADROOM when writing the RX descriptor to the NIC hardware. 
> Depending on driver support, this would make it configurable per device and 
> per RX queue.
>
> If this parameter is set, the driver should adjust m->data_off accordingly on 
> RX, so rte_pktmbuf_mtod[_offset]() and rte_pktmbuf_iova[_offset]() still 
> point to the Ethernet header.
>
>
> Med venlig hilsen / Kind regards,
> -Morten Brørup
>
> From: Garrett D'Amore [mailto:garr...@damore.org]
> Sent: Monday, 25 March 2024 23.56
> So we need (for reasons that I don't want to get to into in too much detail) 
> that our UDP payload headers are at a specific offset in the packet.
>
> This was not a problem as long as we only used IPv4.  (We have configured 40 
> bytes of headroom, which is more than any of our PMDs need by a hefty margin.)
>
> Now that we're extending to support IPv6, we need to reduce that headroom by 
> 20 bytes, to preserve our UDP payload offset.
>
> This has big ramifications for how we fragment our own upper layer messages, 
> and it has been determined that updating the PMDs to allow us to change the 
> headroom for this use case (on a per port basis, as we will have some ports 
> on IPv4 and others on IPv6) is the least effort, but a large margin.  (Well, 
> copying the frames via memcpy would be less development effort, but would be 
> a performance catastrophe.)
>
> For transmit side we don't need this, as we can simply adjust the packet as 
> needed.  But for the receive side, we are kind of stuck, as the PMDs rely on 
> the hard coded RTE_PKTMBUF_HEADROOM to program receive locations.
>
> As far as header splitting, that would indeed be a much much nicer solution.
>
> I haven't looked in the latest code to see if header splitting is even an 
> option -- the version of the DPDK I'm working with is a little older (20.11) 
> -- we have to update but we have other local changes and so updating is one 
> of the things that we still have to do.
>
> At any rate, the version I did look at doesn't seem to support header splits 
> on any device other than FM10K.  That's not terrifically interesting for us.  
> We use Mellanox, E810 (ICE), bnxt, cloud NICs (all of them really -- ENA, 

RE: meson option to customize RTE_PKTMBUF_HEADROOM patch

2024-03-26 Thread Garrett D'Amore
This ETH_RX_OFFLOAD_BUFFER_SPLIT sounds promising indeed.
On Mar 26, 2024 at 9:14 AM -0700, Konstantin Ananyev 
, wrote:
> Just wonder what would happen if you’ll receive an ipv6 packet with options 
> or some fancy encapsulation IP-IP or so?
> BTW, there is an  RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT offload:
> https://doc.dpdk.org/api/structrte__eth__rxseg__split.html
> Which might be close to what you are looking for, but right now it is 
> supported by mlx5 PMD only.
>
> From: Garrett D'Amore 
> Sent: Tuesday, March 26, 2024 2:19 PM
> To: Bruce Richardson ; Stephen Hemminger 
> ; Morten Brørup 
> Cc: dev@dpdk.org; Parthakumar Roy 
> Subject: RE: meson option to customize RTE_PKTMBUF_HEADROOM patch
>
> This could work. Not that we would like to have the exceptional case of IPv6 
> use less headroom.   So we would say 40 is our compiled in default and then 
> we reduce it by 20 on IPv6 which doesn’t have to support all the same devices 
> that IPv4 does. This would give the lowest disruption to the existing IPv4 
> stack and allow PMDs to updated incrementally.
> On Mar 26, 2024 at 1:05 AM -0700, Morten Brørup , 
> wrote:
>
> > quote_type
> > Interesting requirement. I can easily imagine how a (non-forwarding, i.e. 
> > traffic terminating) application, which doesn’t really care about the 
> > preceding headers, can benefit from having its actual data at a specific 
> > offset for alignment purposes. I don’t consider this very exotic. (Even the 
> > Linux kernel uses this trick to achieve improved IP header alignment on RX.)
> >
> > I think the proper solution would be to add a new offload parameter to 
> > rte_eth_rxconf to specify how many bytes the driver should subtract from 
> > RTE_PKTMBUF_HEADROOM when writing the RX descriptor to the NIC hardware. 
> > Depending on driver support, this would make it configurable per device and 
> > per RX queue.
> >
> > If this parameter is set, the driver should adjust m->data_off accordingly 
> > on RX, so rte_pktmbuf_mtod[_offset]() and rte_pktmbuf_iova[_offset]() still 
> > point to the Ethernet header.
> >
> >
> > Med venlig hilsen / Kind regards,
> > -Morten Brørup
> >
> > From: Garrett D'Amore [mailto:garr...@damore.org]
> > Sent: Monday, 25 March 2024 23.56
> > So we need (for reasons that I don't want to get to into in too much 
> > detail) that our UDP payload headers are at a specific offset in the packet.
> >
> > This was not a problem as long as we only used IPv4.  (We have configured 
> > 40 bytes of headroom, which is more than any of our PMDs need by a hefty 
> > margin.)
> >
> > Now that we're extending to support IPv6, we need to reduce that headroom 
> > by 20 bytes, to preserve our UDP payload offset.
> >
> > This has big ramifications for how we fragment our own upper layer 
> > messages, and it has been determined that updating the PMDs to allow us to 
> > change the headroom for this use case (on a per port basis, as we will have 
> > some ports on IPv4 and others on IPv6) is the least effort, but a large 
> > margin.  (Well, copying the frames via memcpy would be less development 
> > effort, but would be a performance catastrophe.)
> >
> > For transmit side we don't need this, as we can simply adjust the packet as 
> > needed.  But for the receive side, we are kind of stuck, as the PMDs rely 
> > on the hard coded RTE_PKTMBUF_HEADROOM to program receive locations.
> >
> > As far as header splitting, that would indeed be a much much nicer solution.
> >
> > I haven't looked in the latest code to see if header splitting is even an 
> > option -- the version of the DPDK I'm working with is a little older 
> > (20.11) -- we have to update but we have other local changes and so 
> > updating is one of the things that we still have to do.
> >
> > At any rate, the version I did look at doesn't seem to support header 
> > splits on any device other than FM10K.  That's not terrifically interesting 
> > for us.  We use Mellanox, E810 (ICE), bnxt, cloud NICs (all of them really 
> > -- ENA, virtio-net, etc.)   We also have a fair amount of ixgbe and i40e on 
> > client systems in the field.
> >
> > We also, unfortunately, have an older DPDK 18 with Mellanox contributions 
> > for IPoverIB though I'm not sure we will try to support IPv6 there.  
> > (We are working towards replacing that part of stack with UCX.)
> >
> > Unless header splitting will work on all of this (excepting the IPoIB 
> > piece), then it's not something we can really use.
> > On Mar

Re: meson option to customize RTE_PKTMBUF_HEADROOM patch

2024-03-26 Thread Garrett D'Amore
Fair enough... I think that is just something we're going to have to live with.

The other solutions are either much more painful, or much more work.

If we can use header/buffer splitting that would be superior.  Right now we 
can't use that everywhere because it isn't available everywhere.
On Mar 26, 2024 at 1:35 PM -0700, Stephen Hemminger 
, wrote:
> On Tue, 26 Mar 2024 10:43:30 -0700
> Garrett D'Amore  wrote:
>
> > This had occurred to me as well.  I think most hardware DMA engines can 
> > align on 32-bit boundaries.  I've yet to see a device that actually 
> > requires 64-bit DMA alignment.  (But I have only looked at a subset  of 
> > devices, and most of the  ones I have looked at are not ones that would be 
> > considered 'modern'.)
>
> There maybe a PCI transaction performance penalty if not aligned.


Re: [RFC 0/3] introduce coroutine library

2023-04-24 Thread Garrett D'Amore
First time poster here:

I worry a bit about a coroutine approach as it may be challenging for some uses 
like ours.  We have a purely event driven loop with a Reactor model written in 
D.  The details are not specifically needed here, except to point out that an 
approach based on ucontext.h or something like that would very likely be 
utterly incompatible in our environment.  While we don’t currently plan to 
integrate support for your hns3 device, I would have grave reservations about a 
general coroutine library making it’s way into DPDK drivers — it would almost 
certainly cause no end of grief for us at Weka.

I’m doubtful that we’re the only DPDK users in this situation.

• Garrett

On Apr 24, 2023 at 7:50 PM -0700, fengchengwen , wrote:
> On 2023/4/25 10:16, Stephen Hemminger wrote:
> > On Tue, 25 Apr 2023 10:11:43 +0800
> > fengchengwen  wrote:
> >
> > > On 2023/4/25 0:08, Stephen Hemminger wrote:
> > > > On Mon, 24 Apr 2023 13:02:05 +
> > > > Chengwen Feng  wrote:
> > > >
> > > > > This patchset introduces the coroutine library which will help 
> > > > > refactor
> > > > > the hns3 PMD's reset process.
> > > > >
> > > > > The hns3 single function reset process consists of the following 
> > > > > steps:
> > > > > 1.stop_service();
> > > > > 2.prepare_reset();
> > > > > 3.delay(100ms);
> > > > > 4.notify_hw();
> > > > > 5.wait_hw_reset_done(); // multiple sleep waits are involved.
> > > > > 6.reinit();
> > > > > 7.restore_conf();
> > > > >
> > > > > If the DPDK process take over multiple hns3 functions (e.g. 100),
> > > > > it's impractical to reset and restore functions in sequence:
> > > > > 1.proc_func(001); // will completed in 100+ms range.
> > > > > 2.proc_func(002); // will completed in 100~200+ms range.
> > > > > ...
> > > > > x.proc_func(100); // will completed in 9900~1+ms range.
> > > > > The later functions will process fail because it's too late to deal 
> > > > > with.
> > > > >
> > > > > One solution is that create a reset thread for each function, and it
> > > > > will lead to large number of threads if the DPDK process take over
> > > > > multiple hns3 functions.
> > > > >
> > > > > So the current hns3 driver uses asynchronous mechanism, for examples, 
> > > > > it
> > > > > use rte_eal_alarm_set() when process delay(100ms), it splits a serial
> > > > > process into multiple asynchronous processes, and the code is complex
> > > > > and difficult to understand.
> > > > >
> > > > > The coroutine is a good mechanism to provide programmers with the
> > > > > simplicity of keeping serial processes within a limited number of
> > > > > threads.
> > > > >
> > > > > This patchset use  to build the coroutine framework, and 
> > > > > it
> > > > > just provides a demo. More APIs maybe added in the future.
> > > > >
> > > > > In addition, we would like to ask the community whether it it possible
> > > > > to accept the library. If not, whether it is allowed to provide the
> > > > > library in hns3 PMD.
> > > > >
> > > > > Chengwen Feng (3):
> > > > > lib/coroutine: add coroutine library
> > > > > examples/coroutine: support coroutine examples
> > > > > net/hns3: refactor reset process with coroutine
> > > >
> > > > Interesting, but the DPDK really is not the right place for this.
> > > > Also, why so much sleeping. Can't this device be handled with an event 
> > > > based
> > > > model. Plus any complexity like this introduces more bugs into already 
> > > > fragile
> > > > interaction of DPDK userspace applications and threads.
> > >
> > > A event base model will function as:
> > > event-handler() {
> > > for (...) {
> > > event = get_next_event();
> > > proc_event();
> > > }
> > > }
> > > The root cause is that the proc_event() take too many time, and it will 
> > > lead to other
> > > function can't be processed timely.
> > >
> > > For which proc_event() may wait a lot of time, the coroutine could also 
> > > used to optimize
> > > it.
> > >
> > > >
> > > > Not only that, coroutines add to the pre-existing problems with locking.
> > > > If coroutine 1 acquires a lock, the coroutine 2 will deadlock itself.
> > > > And someone will spend days figuring that out. And the existing analyzer
> > > > tools will not know about the magic coroutine library.
> > >
> > > Analyzer tools like lock annotations maybe a problem.
> > >
> > > Locks in DPDK APIs are mostly no-blocking. We can add some 
> > > restrictions(by reviewer), such
> > > as once holding a lock, you can't invoke rte_co_yield() or rte_co_delay() 
> > > API.
> >
> > > In addition, any technology has two sides, the greatest advantage of 
> > > coroutine I think is
> > > removes a large number of callbacks in asychronous programming. And also 
> > > high-level languages
> > > generally provide coroutines (e.g. C++/Python). With the development, the 
> > > analyzer tools maybe
> > > evolved to support detect.
> > >
> > >
> > > And one more, if not acceptable as public library, whether it is allowed 
> > > intergration of this
> > > l

Incoming changes for GVNIC

2024-09-14 Thread Garrett D'Amore
This is mostly a heads up

We (WEKA) use various drivers with specific alignment requirements, which 
causes us to need to need to use multibuffer (scatter/gather) functionality in 
various drivers.

Unfortunately, the GVNIC driver was ... very... buggy in this regard (in fact 
it doesn't work at all in this case!), especially for the DQO mode used on gen 
3 (C3) instances.  It turns out it was also buggy in that it was not properly 
resetting the device on teardown, creating a situation where DMA could be 
occurring to memory regions after process exit (and thus to invalid memory!)

I've fixed this in our code (its still under review and testing internally) -- 
but I'd like to upstream these fixes too.  (For benefit of anyone who may have 
concerns about my "credentials" -- I'm well known for my work in NIC drivers 
(and many others) in Solaris/illumos, and to a lesser extent (and longer ago) 
NetBSD.

Anyway, I think I have the detailed instructions for submitting changes to 
DPDK, but as this is code that is associated with a vendor (Google), I thought 
I'd reach out first -- if there is a specific code owner here I'd be happy to 
work with them.

My changes are based on a cherry pick of this driver's code from the upstream 
24.07, but mostly we (WEKA) are using DPDK 24.03.

Thanks in advance.