[dpdk-dev] eventdev: rte_event_dev_start() all queues are linked requirement
Hi, rte_event_dev_start() requires that all queues have to be linked, which makes writing applications which link/unlink queues at runtime cumbersome. E.g. the application has to dummy link all queues before rte_event_dev_start() and then unlink them after the function call. This alone wouldn't be a big issue but rte_event_dev_start() may also be called inside rte_event_eth_rx_adapter_create() implementation causing additional complexity. To me this check seems more like eventdev implementation specific limitation, which should be solved by the particular implementation and not enforced by the API to all applications. From an application point of view enqueueing events to an unlinked queue and expecting something meaningful to happen is an error anyway. So, would it be conceivable to remove this particular requirement? -Matias
Re: [dpdk-dev] eventdev: method for finding out unlink status
-Original Message- > Date: Mon, 30 Jul 2018 06:39:45 + > From: "Elo, Matias (Nokia - FI/Espoo)" > To: "dev@dpdk.org" > CC: "Van Haaren, Harry" > Subject: [dpdk-dev] eventdev: method for finding out unlink status > x-mailer: Apple Mail (2.3445.9.1) > > > Hi, > > In bug report https://bugs.dpdk.org/show_bug.cgi?id=60 we have been discussing > issues related to events ending up in wrong ports after calling > rte_event_port_unlink(). In addition of finding few bugs we have identified a > need for a new API call (or documentation extension) for an application to be >From HW perspective, documentation extension should be enough. adding "there may be pre-scheduled events and the application is responsible to process them" on unlink(). Since dequeue() has which queue it is dequeue-ed from, the application can allays make action based on that(i.e, Is the event post/pre to unlink) > able to find out when an unlink() call has finished and no new events are > scheduled anymore to the particular event port. This is required e.g. when > doing > clean-up after an application thread stops processing events. If thread stopping then it better to call dev_stop(). At least in HW implementation, A given event port assigned to a new lcore other than it previous one then we need to do some clean up at port level. > > The bug report discussion provides more background on the subject and Harry > has > already proposed a new 'int32_t rte_event_unlinks_in_progress()' API as one > possible solution. Assuming stale event(s) can go the new linked port, Does rte_event_unlink() takes consider amount of time in SW implementation. > > -Matias >
Re: [dpdk-dev] eventdev: rte_event_dev_start() all queues are linked requirement
-Original Message- > Date: Mon, 30 Jul 2018 07:38:27 + > From: "Elo, Matias (Nokia - FI/Espoo)" > To: "dev@dpdk.org" > CC: "jerin.ja...@caviumnetworks.com" , "Van > Haaren, Harry" > Subject: eventdev: rte_event_dev_start() all queues are linked requirement > x-mailer: Apple Mail (2.3445.9.1) > + mattias.ronnb...@ericsson.com as his SW driver is scheduled for next release. > > Hi, > > rte_event_dev_start() requires that all queues have to be linked, which makes > writing applications which link/unlink queues at runtime cumbersome. > E.g. the application has to dummy link all queues before rte_event_dev_start() > and then unlink them after the function call. This alone wouldn't be a big > issue > but rte_event_dev_start() may also be called inside > rte_event_eth_rx_adapter_create() implementation causing additional > complexity. > > To me this check seems more like eventdev implementation specific limitation, > which should be solved by the particular implementation and not enforced by > the > API to all applications. From an application point of view enqueueing events > to > an unlinked queue and expecting something meaningful to happen is an error > anyway. So, would it be conceivable to remove this particular requirement? For HW drivers, It is OK remove the particular requirement. But, If there is an issue(performance/functionality) for enabling such feature in SW driver. I would like keep that constraint to keep eventdev as abstraction for both SW and HW driver. Harry and/or Mattias.ronnblom can comment from SW driver perspective. > > -Matias
[dpdk-dev] [PATCH] net/virtio-user: fix the param string
Add the missing param "server" to param string. Also add the missing spaces after params. Fixes: bd8f50a45d0f ("net/virtio-user: support server mode") Cc: sta...@dpdk.org Signed-off-by: Tiwei Bie --- drivers/net/virtio/virtio_user_ethdev.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c index 5ee59d5e5..525d16cab 100644 --- a/drivers/net/virtio/virtio_user_ethdev.c +++ b/drivers/net/virtio/virtio_user_ethdev.c @@ -681,6 +681,7 @@ RTE_PMD_REGISTER_PARAM_STRING(net_virtio_user, "cq= " "queue_size= " "queues= " - "iface=" - "mrg_rxbuf=<0|1>" + "iface= " + "server=<0|1> " + "mrg_rxbuf=<0|1> " "in_order=<0|1>"); -- 2.18.0
Re: [dpdk-dev] [PATCH] net/virtio-user: fix the param string
On 07/30/2018 10:28 AM, Tiwei Bie wrote: Add the missing param "server" to param string. Also add the missing spaces after params. Fixes: bd8f50a45d0f ("net/virtio-user: support server mode") Cc: sta...@dpdk.org Signed-off-by: Tiwei Bie --- drivers/net/virtio/virtio_user_ethdev.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/net/virtio/virtio_user_ethdev.c b/drivers/net/virtio/virtio_user_ethdev.c index 5ee59d5e5..525d16cab 100644 --- a/drivers/net/virtio/virtio_user_ethdev.c +++ b/drivers/net/virtio/virtio_user_ethdev.c @@ -681,6 +681,7 @@ RTE_PMD_REGISTER_PARAM_STRING(net_virtio_user, "cq= " "queue_size= " "queues= " - "iface=" - "mrg_rxbuf=<0|1>" + "iface= " + "server=<0|1> " + "mrg_rxbuf=<0|1> " "in_order=<0|1>"); Reviewed-by: Maxime Coquelin Thanks! Maxime
Re: [dpdk-dev] [PATCH v5] vfio: fix workaround of BAR mapping
On 29-Jul-18 9:44 AM, Jerin Jacob wrote: -Original Message- Date: Thu, 26 Jul 2018 11:35:43 +0200 From: Thomas Monjalon To: Takeshi Yoshimura Cc: dev@dpdk.org, Anatoly Burakov Subject: Re: [dpdk-dev] [PATCH v5] vfio: fix workaround of BAR mapping 20/07/2018 10:13, Takeshi Yoshimura: Currently, VFIO will try to map around MSI-X table in the BARs. When MSI-X table (page-aligned) size is equal to (page-aligned) size of BAR, VFIO will just skip the BAR. Recent kernel versions will allow VFIO to map the entire BAR containing MSI-X tables (*), so instead of trying to map around the MSI-X vector or skipping the BAR entirely if it's not possible, we can now try mapping the entire BAR first. If mapping the entire BAR doesn't succeed, fall back to the old behavior of mapping around MSI-X table or skipping the BAR. (*): "vfio-pci: Allow mapping MSIX BAR", https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/ commit/?id=a32295c612c57990d17fb0f41e7134394b2f35f6 Fixes: 90a1633b2347 ("eal/linux: allow to map BARs with MSI-X tables") Signed-off-by: Takeshi Yoshimura Reviewed-by: Anatoly Burakov This change set breaks thunderx/octeontx platform with following error.(Tested with 4.9.0 kernel) EAL: probe driver: 177d:a034 net_thunderx EAL: using IOMMU type 1 (Type 1) EAL: pci_map_resource(): cannot mmap(44, 0x6020, 0x20, 0x400): Invalid argument (0x) EAL: PCI device 0001:01:00.2 on NUMA socket 0 EAL: probe driver: 177d:a034 net_thunderx EAL: pci_map_resource(): cannot mmap(47, 0x6060, 0x20, 0x400): Invalid argument (0x) According Linux kernel change, user space application suppose to use VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability to detect this feature to work < 4.15 kernel. Right? if so, Why we are doing this retry based logic? I don't think anything's broken there - just a gratuitous error message. But yes, i seem to have missed the region info flag. It was my suggestion to use the retry logic. I'll submit a patch fixing this. Applied, thanks -- Thanks, Anatoly
Re: [dpdk-dev] [PATCH] examples/multi_process: remove l2fwd fork example
On 27-Jul-18 7:56 PM, Gage Eads wrote: l2fwd_fork relies on a multiprocess model that that DPDK does not support (calling rte_eal_init() before fork()), in particular in light of recent EAL changes like the multiproess communication channel. This example can mislead users into thinking this is a supported multiprocess model; hence, this commit removes this example and the corresponding user guide documentation as well. Signed-off-by: Gage Eads --- Hi Gage, Maybe the commit message should reference the earlier discussion on the mailing list. -- Thanks, Anatoly
Re: [dpdk-dev] eventdev: rte_event_dev_start() all queues are linked requirement
> From: Jerin Jacob [mailto:jerin.ja...@caviumnetworks.com] > Sent: Monday, July 30, 2018 9:05 AM > To: Elo, Matias (Nokia - FI/Espoo) > Cc: dev@dpdk.org; Van Haaren, Harry ; > mattias.ronnb...@ericsson.com > Subject: Re: eventdev: rte_event_dev_start() all queues are linked requirement > > -Original Message- > > Date: Mon, 30 Jul 2018 07:38:27 + > > From: "Elo, Matias (Nokia - FI/Espoo)" > > To: "dev@dpdk.org" > > CC: "jerin.ja...@caviumnetworks.com" , "Van > > Haaren, Harry" > > Subject: eventdev: rte_event_dev_start() all queues are linked requirement > > x-mailer: Apple Mail (2.3445.9.1) > > > > + mattias.ronnb...@ericsson.com as his SW driver is scheduled for next > release. > > > > > Hi, > > > > rte_event_dev_start() requires that all queues have to be linked, which > makes > > writing applications which link/unlink queues at runtime cumbersome. > > E.g. the application has to dummy link all queues before > rte_event_dev_start() > > and then unlink them after the function call. This alone wouldn't be a big > issue > > but rte_event_dev_start() may also be called inside > > rte_event_eth_rx_adapter_create() implementation causing additional > complexity. > > > > To me this check seems more like eventdev implementation specific > limitation, > > which should be solved by the particular implementation and not enforced by > the > > API to all applications. From an application point of view enqueueing events > to > > an unlinked queue and expecting something meaningful to happen is an error > > anyway. So, would it be conceivable to remove this particular requirement? > > For HW drivers, It is OK remove the particular requirement. But, If > there is an issue(performance/functionality) for enabling such feature > in SW driver. I would like keep that constraint to keep eventdev as > abstraction for both SW and HW driver. Harry and/or Mattias.ronnblom can > comment from SW driver perspective. I don't have any objection to removing this restriction. Initially it was added to the event/sw driver as it avoids a potential deadlock situation if no ports are mapped to a queue, and the application enqueues events to the queue. As a result the queue will backpressure until the whole thing stops. I'm aware that this is incorrect usage - if events are enqueued to a queue the application is responsible for configuring the ports to pull those events out again... we thought this would help the user - but it looks to do the opposite. Currently this is NOT enforced by the lib/eventdev layer (as far as I can see)? I don't think the header file suggests such behavior either... In event/sw we have a check that fails - but we can relax that check. I'll wait a day or two for some more input, and send a patch for event/sw then. Thanks @Matias Elo for bringing attention to this! -Harry
Re: [dpdk-dev] eventdev: method for finding out unlink status
>> >> In bug report https://bugs.dpdk.org/show_bug.cgi?id=60 we have been >> discussing >> issues related to events ending up in wrong ports after calling >> rte_event_port_unlink(). In addition of finding few bugs we have identified a >> need for a new API call (or documentation extension) for an application to be > > From HW perspective, documentation extension should be enough. adding > "there may be pre-scheduled events and the application is responsible to > process them" > on unlink(). Since dequeue() has which queue it is dequeue-ed from, the > application can allays make action based on that(i.e, Is the event > post/pre to unlink) At least in case of SW eventdev the problem is how the application can know that it has processed all pre-scheduled events. E.g. dequeue may return nothing but since the scheduler is running as a separate process events may still end up to the unlinked port asynchronously. > >> able to find out when an unlink() call has finished and no new events are >> scheduled anymore to the particular event port. This is required e.g. when >> doing >> clean-up after an application thread stops processing events. > > If thread stopping then it better to call dev_stop(). At least in HW > implementation, For an application doing dynamic load balancing stopping the whole eventdev is not an option. > A given event port assigned to a new lcore other than > it previous one then we need to do some clean up at port level. In my case I'm mapping an event port per thread statically (basically thread_id == port_id), so this shouldn't be an issue.
Re: [dpdk-dev] [PATCH] examples: remove Rx checksum offload
Hi Jerin, > > As of now, application does not check PKT_RX_*_CKSUM_* flags per > packet, so it does not matter DEV_RX_OFFLOAD_CHECKSUM enabled or not. > > Removing DEV_RX_OFFLOAD_CHECKSUM offload so that driver can save a few > cycles if possible. Personally, I'd move in other direction: keep RX checksum offload and add checks inside sample apps to handle (drop) packets with invalid checksum. Konstantin > > Signed-off-by: Jerin Jacob > --- > examples/ip_fragmentation/main.c| 3 +-- > examples/ip_reassembly/main.c | 3 +-- > examples/ipsec-secgw/ipsec-secgw.c | 3 +-- > examples/l3fwd-acl/main.c | 3 +-- > examples/l3fwd-power/main.c | 4 ++-- > examples/l3fwd-vf/main.c| 3 +-- > examples/l3fwd/main.c | 3 +-- > examples/load_balancer/init.c | 3 +-- > examples/multi_process/symmetric_mp/main.c | 3 +-- > examples/performance-thread/l3fwd-thread/main.c | 3 +-- > examples/qos_meter/main.c | 3 +-- > 11 files changed, 12 insertions(+), 22 deletions(-) > > diff --git a/examples/ip_fragmentation/main.c > b/examples/ip_fragmentation/main.c > index 5306d7672..7cbd5dc10 100644 > --- a/examples/ip_fragmentation/main.c > +++ b/examples/ip_fragmentation/main.c > @@ -140,8 +140,7 @@ static struct rte_eth_conf port_conf = { > .rxmode = { > .max_rx_pkt_len = JUMBO_FRAME_MAX_SIZE, > .split_hdr_size = 0, > - .offloads = (DEV_RX_OFFLOAD_CHECKSUM | > - DEV_RX_OFFLOAD_JUMBO_FRAME | > + .offloads = (DEV_RX_OFFLOAD_JUMBO_FRAME | >DEV_RX_OFFLOAD_CRC_STRIP), > }, > .txmode = { > diff --git a/examples/ip_reassembly/main.c b/examples/ip_reassembly/main.c > index b830f67a5..7acc6b7b5 100644 > --- a/examples/ip_reassembly/main.c > +++ b/examples/ip_reassembly/main.c > @@ -164,8 +164,7 @@ static struct rte_eth_conf port_conf = { > .mq_mode= ETH_MQ_RX_RSS, > .max_rx_pkt_len = JUMBO_FRAME_MAX_SIZE, > .split_hdr_size = 0, > - .offloads = (DEV_RX_OFFLOAD_CHECKSUM | > - DEV_RX_OFFLOAD_JUMBO_FRAME | > + .offloads = (DEV_RX_OFFLOAD_JUMBO_FRAME | >DEV_RX_OFFLOAD_CRC_STRIP), > }, > .rx_adv_conf = { > diff --git a/examples/ipsec-secgw/ipsec-secgw.c > b/examples/ipsec-secgw/ipsec-secgw.c > index b45b87bde..63eef1f26 100644 > --- a/examples/ipsec-secgw/ipsec-secgw.c > +++ b/examples/ipsec-secgw/ipsec-secgw.c > @@ -197,8 +197,7 @@ static struct rte_eth_conf port_conf = { > .mq_mode= ETH_MQ_RX_RSS, > .max_rx_pkt_len = ETHER_MAX_LEN, > .split_hdr_size = 0, > - .offloads = DEV_RX_OFFLOAD_CHECKSUM | > - DEV_RX_OFFLOAD_CRC_STRIP, > + .offloads = DEV_RX_OFFLOAD_CRC_STRIP, > }, > .rx_adv_conf = { > .rss_conf = { > diff --git a/examples/l3fwd-acl/main.c b/examples/l3fwd-acl/main.c > index 7c063a8d0..c66b5b462 100644 > --- a/examples/l3fwd-acl/main.c > +++ b/examples/l3fwd-acl/main.c > @@ -127,8 +127,7 @@ static struct rte_eth_conf port_conf = { > .mq_mode= ETH_MQ_RX_RSS, > .max_rx_pkt_len = ETHER_MAX_LEN, > .split_hdr_size = 0, > - .offloads = (DEV_RX_OFFLOAD_CRC_STRIP | > - DEV_RX_OFFLOAD_CHECKSUM), > + .offloads = DEV_RX_OFFLOAD_CRC_STRIP, > }, > .rx_adv_conf = { > .rss_conf = { > diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c > index d15cd520e..b00a8ec45 100644 > --- a/examples/l3fwd-power/main.c > +++ b/examples/l3fwd-power/main.c > @@ -180,8 +180,8 @@ static struct rte_eth_conf port_conf = { > .mq_mode= ETH_MQ_RX_RSS, > .max_rx_pkt_len = ETHER_MAX_LEN, > .split_hdr_size = 0, > - .offloads = (DEV_RX_OFFLOAD_CRC_STRIP | > - DEV_RX_OFFLOAD_CHECKSUM), > + .offloads = DEV_RX_OFFLOAD_CRC_STRIP, > + > }, > .rx_adv_conf = { > .rss_conf = { > diff --git a/examples/l3fwd-vf/main.c b/examples/l3fwd-vf/main.c > index 5edd91a78..2a10e9d76 100644 > --- a/examples/l3fwd-vf/main.c > +++ b/examples/l3fwd-vf/main.c > @@ -161,8 +161,7 @@ static struct rte_eth_conf port_conf = { > .mq_mode= ETH_MQ_RX_RSS, > .max_rx_pkt_len = ETHER_MAX_LEN, > .split_hdr_size = 0, > - .offloads = (DEV_RX_OFFLOAD_CRC_STRIP | > - DEV_RX_OFFLOAD_CHECKSUM), > + .offloads = DEV_RX_OFFLOAD_CRC_STRIP, > }, > .rx_adv_conf = { > .rss_conf = { > diff --git a/examples/l3fwd/main.c b/examples/l3fwd/main.c > index ab019b9e4..d4a79b23f 10
Re: [dpdk-dev] eventdev: method for finding out unlink status
-Original Message- > Date: Mon, 30 Jul 2018 09:17:47 + > From: "Elo, Matias (Nokia - FI/Espoo)" > To: Jerin Jacob > CC: "dev@dpdk.org" , "Van Haaren, Harry" > > Subject: Re: [dpdk-dev] eventdev: method for finding out unlink status > x-mailer: Apple Mail (2.3445.9.1) > > > >> > >> In bug report https://bugs.dpdk.org/show_bug.cgi?id=60 we have been > >> discussing > >> issues related to events ending up in wrong ports after calling > >> rte_event_port_unlink(). In addition of finding few bugs we have > >> identified a > >> need for a new API call (or documentation extension) for an application to > >> be > > > > From HW perspective, documentation extension should be enough. adding > > "there may be pre-scheduled events and the application is responsible to > > process them" > > on unlink(). Since dequeue() has which queue it is dequeue-ed from, the > > application can allays make action based on that(i.e, Is the event > > post/pre to unlink) > > At least in case of SW eventdev the problem is how the application can know > that > it has processed all pre-scheduled events. E.g. dequeue may return nothing > but since > the scheduler is running as a separate process events may still end up to the > unlinked > port asynchronously. Can't we do, dequeue() in loop to get all the events from port. If dequeue returns with zero event then ports are drained up. Right? > > > > >> able to find out when an unlink() call has finished and no new events are > >> scheduled anymore to the particular event port. This is required e.g. when > >> doing > >> clean-up after an application thread stops processing events. > > > > If thread stopping then it better to call dev_stop(). At least in HW > > implementation, > > For an application doing dynamic load balancing stopping the whole eventdev > is not an > option. OK. Makes sense. Doing unlink() and link() in fastpath is not a problem. Changing core assignment to event port is problem without stop(). I guess, you application or general would be OK with that constraint. > > > A given event port assigned to a new lcore other than > > it previous one then we need to do some clean up at port level. > > In my case I'm mapping an event port per thread statically (basically > thread_id == port_id), > so this shouldn't be an issue. > >
Re: [dpdk-dev] [PATCH] examples: remove Rx checksum offload
-Original Message- > Date: Mon, 30 Jul 2018 09:27:48 + > From: "Ananyev, Konstantin" > To: Jerin Jacob , "dev@dpdk.org" > > CC: "tho...@monjalon.net" , "Yigit, Ferruh" > , "shah...@mellanox.com" > Subject: RE: [dpdk-dev] [PATCH] examples: remove Rx checksum offload > > > Hi Jerin, Hi Konstantin, > > > > > As of now, application does not check PKT_RX_*_CKSUM_* flags per > > packet, so it does not matter DEV_RX_OFFLOAD_CHECKSUM enabled or not. > > > > Removing DEV_RX_OFFLOAD_CHECKSUM offload so that driver can save a few > > cycles if possible. > > Personally, I'd move in other direction: keep RX checksum offload and add > checks inside sample apps to handle (drop) packets with invalid checksum. OK. Till someones add the DROP logic in application, Can we take this patch? Because there is no point in enabling DEV_RX_OFFLOAD_CHECKSUM without DROP or any meaning full action in application. > Konstantin > > > > > Signed-off-by: Jerin Jacob > > --- > > examples/ip_fragmentation/main.c| 3 +-- > > examples/ip_reassembly/main.c | 3 +-- > > examples/ipsec-secgw/ipsec-secgw.c | 3 +-- > > examples/l3fwd-acl/main.c | 3 +-- > > examples/l3fwd-power/main.c | 4 ++-- > > examples/l3fwd-vf/main.c| 3 +-- > > examples/l3fwd/main.c | 3 +-- > > examples/load_balancer/init.c | 3 +-- > > examples/multi_process/symmetric_mp/main.c | 3 +-- > > examples/performance-thread/l3fwd-thread/main.c | 3 +-- > > examples/qos_meter/main.c | 3 +-- > > 11 files changed, 12 insertions(+), 22 deletions(-) > > > > diff --git a/examples/ip_fragmentation/main.c > > b/examples/ip_fragmentation/main.c > > index 5306d7672..7cbd5dc10 100644 > > --- a/examples/ip_fragmentation/main.c > > +++ b/examples/ip_fragmentation/main.c > > @@ -140,8 +140,7 @@ static struct rte_eth_conf port_conf = { > > .rxmode = { > > .max_rx_pkt_len = JUMBO_FRAME_MAX_SIZE, > > .split_hdr_size = 0, > > - .offloads = (DEV_RX_OFFLOAD_CHECKSUM | > > - DEV_RX_OFFLOAD_JUMBO_FRAME | > > + .offloads = (DEV_RX_OFFLOAD_JUMBO_FRAME | > >DEV_RX_OFFLOAD_CRC_STRIP), > > }, > > .txmode = { > > diff --git a/examples/ip_reassembly/main.c b/examples/ip_reassembly/main.c > > index b830f67a5..7acc6b7b5 100644 > > --- a/examples/ip_reassembly/main.c > > +++ b/examples/ip_reassembly/main.c > > @@ -164,8 +164,7 @@ static struct rte_eth_conf port_conf = { > > .mq_mode= ETH_MQ_RX_RSS, > > .max_rx_pkt_len = JUMBO_FRAME_MAX_SIZE, > > .split_hdr_size = 0, > > - .offloads = (DEV_RX_OFFLOAD_CHECKSUM | > > - DEV_RX_OFFLOAD_JUMBO_FRAME | > > + .offloads = (DEV_RX_OFFLOAD_JUMBO_FRAME | > >DEV_RX_OFFLOAD_CRC_STRIP), > > }, > > .rx_adv_conf = { > > diff --git a/examples/ipsec-secgw/ipsec-secgw.c > > b/examples/ipsec-secgw/ipsec-secgw.c > > index b45b87bde..63eef1f26 100644 > > --- a/examples/ipsec-secgw/ipsec-secgw.c > > +++ b/examples/ipsec-secgw/ipsec-secgw.c > > @@ -197,8 +197,7 @@ static struct rte_eth_conf port_conf = { > > .mq_mode= ETH_MQ_RX_RSS, > > .max_rx_pkt_len = ETHER_MAX_LEN, > > .split_hdr_size = 0, > > - .offloads = DEV_RX_OFFLOAD_CHECKSUM | > > - DEV_RX_OFFLOAD_CRC_STRIP, > > + .offloads = DEV_RX_OFFLOAD_CRC_STRIP, > > }, > > .rx_adv_conf = { > > .rss_conf = { > > diff --git a/examples/l3fwd-acl/main.c b/examples/l3fwd-acl/main.c > > index 7c063a8d0..c66b5b462 100644 > > --- a/examples/l3fwd-acl/main.c > > +++ b/examples/l3fwd-acl/main.c > > @@ -127,8 +127,7 @@ static struct rte_eth_conf port_conf = { > > .mq_mode= ETH_MQ_RX_RSS, > > .max_rx_pkt_len = ETHER_MAX_LEN, > > .split_hdr_size = 0, > > - .offloads = (DEV_RX_OFFLOAD_CRC_STRIP | > > - DEV_RX_OFFLOAD_CHECKSUM), > > + .offloads = DEV_RX_OFFLOAD_CRC_STRIP, > > }, > > .rx_adv_conf = { > > .rss_conf = { > > diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c > > index d15cd520e..b00a8ec45 100644 > > --- a/examples/l3fwd-power/main.c > > +++ b/examples/l3fwd-power/main.c > > @@ -180,8 +180,8 @@ static struct rte_eth_conf port_conf = { > > .mq_mode= ETH_MQ_RX_RSS, > > .max_rx_pkt_len = ETHER_MAX_LEN, > > .split_hdr_size = 0, > > - .offloads = (DEV_RX_OFFLOAD_CRC_STRIP | > > - DEV_RX_OFFLOAD_CHECKSUM), > > + .offloads = DEV_RX_OFFLOAD_CRC_STRIP, > > + > > }, > > .rx_adv_conf = { > >
Re: [dpdk-dev] eventdev: method for finding out unlink status
> From: Jerin Jacob [mailto:jerin.ja...@caviumnetworks.com] > Sent: Monday, July 30, 2018 10:29 AM > To: Elo, Matias (Nokia - FI/Espoo) > Cc: dev@dpdk.org; Van Haaren, Harry > Subject: Re: [dpdk-dev] eventdev: method for finding out unlink status > > -Original Message- > > Date: Mon, 30 Jul 2018 09:17:47 + > > From: "Elo, Matias (Nokia - FI/Espoo)" > > To: Jerin Jacob > > CC: "dev@dpdk.org" , "Van Haaren, Harry" > > > > Subject: Re: [dpdk-dev] eventdev: method for finding out unlink status > > x-mailer: Apple Mail (2.3445.9.1) > > > > > > >> > > >> In bug report https://bugs.dpdk.org/show_bug.cgi?id=60 we have been > discussing > > >> issues related to events ending up in wrong ports after calling > > >> rte_event_port_unlink(). In addition of finding few bugs we have > identified a > > >> need for a new API call (or documentation extension) for an application > to be > > > > > > From HW perspective, documentation extension should be enough. adding > > > "there may be pre-scheduled events and the application is responsible to > process them" > > > on unlink(). Since dequeue() has which queue it is dequeue-ed from, the > > > application can allays make action based on that(i.e, Is the event > > > post/pre to unlink) > > > > At least in case of SW eventdev the problem is how the application can know > that > > it has processed all pre-scheduled events. E.g. dequeue may return nothing > but since > > the scheduler is running as a separate process events may still end up to > the unlinked > > port asynchronously. > > Can't we do, dequeue() in loop to get all the events from port. If > dequeue returns with zero event then ports are drained up. Right? Nope - because the scheduler might not have performed and "Acked" the unlink(), and internally it has *just* scheduled an event, but it wasn't available in the dequeue ring yet. Aka, its racy behavior - and we need a way to retrieve this "Unlink Ack" from the scheduler (which runs in another thread in event/sw). > > >> able to find out when an unlink() call has finished and no new events are > > >> scheduled anymore to the particular event port. This is required e.g. > when doing > > >> clean-up after an application thread stops processing events. > > > > > > If thread stopping then it better to call dev_stop(). At least in HW > > > implementation, > > > > For an application doing dynamic load balancing stopping the whole eventdev > is not an > > option. > > OK. Makes sense. Doing unlink() and link() in fastpath is not a > problem. Correct > Changing core assignment to event port is problem without stop(). I > guess, you > application or general would be OK with that constraint. I don't think that the eventdev API requires 1:1 Lcore / Port mapping, so really a PMD should be able to handle any thread calling any port. The event/sw PMD allows any thread to call dequeue/enqueue any port, so long as it is not being accessed by another thread. > > > A given event port assigned to a new lcore other than > > > it previous one then we need to do some clean up at port level. > > > > In my case I'm mapping an event port per thread statically (basically > thread_id == port_id), > > so this shouldn't be an issue. This is the common case - but I don't think we should demand it. There is a valid scale-down model which just polls *all* ports using a single lcore, instead of unlink() of multiple ports. For this "runtime scale down" use-case the missing information is being able to identify when an unlink is complete. After that (and ensuring the port buffer is empty) the application can be guaranteed that there are no more events going to be sent to that port, and the application can take the worker lcore out of its polling-loop and put it to sleep. As mentioned before, I think an "unlinks_in_progress()" function is perhaps the easiest way to achieve this functionality, as it allows relatively simple tracking of unlinks() using an atomic counter in sw. (Implementation details become complex when we have a separate core running event/sw, separate cores polling, and a control-plane thread calling unlink...) I think the end result we're hoping for is something like pseudo code below, (keep in mind that the event/sw has a service-core thread running it, so no application code there): int worker_poll = 1; worker() { while(worker_poll) { // eventdev_dequeue_burst() etc } go_to_sleep(1); } control_plane_scale_down() { unlink(evdev, worker, queue_id); while(unlinks_in_progress(evdev) > 0) usleep(100); /* here we know that the unlink is complete. * so we can now stop the worker from polling */ worker_poll = 0; } Hope my pseudo-code makes pseudo-sense :) -Harry
Re: [dpdk-dev] [PATCH] examples: remove Rx checksum offload
Monday, July 30, 2018 12:36 PM, Jerin Jacob: > > Subject: RE: [dpdk-dev] [PATCH] examples: remove Rx checksum offload > > > > > > Hi Jerin, > > Hi Konstantin, > > > > > > > > > As of now, application does not check PKT_RX_*_CKSUM_* flags per > > > packet, so it does not matter DEV_RX_OFFLOAD_CHECKSUM enabled or > not. > > > > > > Removing DEV_RX_OFFLOAD_CHECKSUM offload so that driver can save > a > > > few cycles if possible. > > > > Personally, I'd move in other direction: keep RX checksum offload and > > add checks inside sample apps to handle (drop) packets with invalid > checksum. > > OK. Till someones add the DROP logic in application, Can we take this patch? > Because there is no point in enabling DEV_RX_OFFLOAD_CHECKSUM without > DROP or any meaning full action in application. I agree w/ Constantine. If the l3fwd application targets to simulate vrouter which do IP forwarding validating the checksum is part of the required logic. The patch should not remove it rather add check on the mbuf checksum fields and drop the mbuf accordingly. > > > > Konstantin > > > > > > > > Signed-off-by: Jerin Jacob > > > --- > > > examples/ip_fragmentation/main.c| 3 +-- > > > examples/ip_reassembly/main.c | 3 +-- > > > examples/ipsec-secgw/ipsec-secgw.c | 3 +-- > > > examples/l3fwd-acl/main.c | 3 +-- > > > examples/l3fwd-power/main.c | 4 ++-- > > > examples/l3fwd-vf/main.c| 3 +-- > > > examples/l3fwd/main.c | 3 +-- > > > examples/load_balancer/init.c | 3 +-- > > > examples/multi_process/symmetric_mp/main.c | 3 +-- > > > examples/performance-thread/l3fwd-thread/main.c | 3 +-- > > > examples/qos_meter/main.c | 3 +-- > > > 11 files changed, 12 insertions(+), 22 deletions(-) > > > > > > diff --git a/examples/ip_fragmentation/main.c > > > b/examples/ip_fragmentation/main.c > > > index 5306d7672..7cbd5dc10 100644 > > > --- a/examples/ip_fragmentation/main.c > > > +++ b/examples/ip_fragmentation/main.c > > > @@ -140,8 +140,7 @@ static struct rte_eth_conf port_conf = { > > > .rxmode = { > > > .max_rx_pkt_len = JUMBO_FRAME_MAX_SIZE, > > > .split_hdr_size = 0, > > > - .offloads = (DEV_RX_OFFLOAD_CHECKSUM | > > > - DEV_RX_OFFLOAD_JUMBO_FRAME | > > > + .offloads = (DEV_RX_OFFLOAD_JUMBO_FRAME | > > >DEV_RX_OFFLOAD_CRC_STRIP), > > > }, > > > .txmode = { > > > diff --git a/examples/ip_reassembly/main.c > > > b/examples/ip_reassembly/main.c index b830f67a5..7acc6b7b5 100644 > > > --- a/examples/ip_reassembly/main.c > > > +++ b/examples/ip_reassembly/main.c > > > @@ -164,8 +164,7 @@ static struct rte_eth_conf port_conf = { > > > .mq_mode= ETH_MQ_RX_RSS, > > > .max_rx_pkt_len = JUMBO_FRAME_MAX_SIZE, > > > .split_hdr_size = 0, > > > - .offloads = (DEV_RX_OFFLOAD_CHECKSUM | > > > - DEV_RX_OFFLOAD_JUMBO_FRAME | > > > + .offloads = (DEV_RX_OFFLOAD_JUMBO_FRAME | > > >DEV_RX_OFFLOAD_CRC_STRIP), > > > }, > > > .rx_adv_conf = { > > > diff --git a/examples/ipsec-secgw/ipsec-secgw.c > > > b/examples/ipsec-secgw/ipsec-secgw.c > > > index b45b87bde..63eef1f26 100644 > > > --- a/examples/ipsec-secgw/ipsec-secgw.c > > > +++ b/examples/ipsec-secgw/ipsec-secgw.c > > > @@ -197,8 +197,7 @@ static struct rte_eth_conf port_conf = { > > > .mq_mode= ETH_MQ_RX_RSS, > > > .max_rx_pkt_len = ETHER_MAX_LEN, > > > .split_hdr_size = 0, > > > - .offloads = DEV_RX_OFFLOAD_CHECKSUM | > > > - DEV_RX_OFFLOAD_CRC_STRIP, > > > + .offloads = DEV_RX_OFFLOAD_CRC_STRIP, > > > }, > > > .rx_adv_conf = { > > > .rss_conf = { > > > diff --git a/examples/l3fwd-acl/main.c b/examples/l3fwd-acl/main.c > > > index 7c063a8d0..c66b5b462 100644 > > > --- a/examples/l3fwd-acl/main.c > > > +++ b/examples/l3fwd-acl/main.c > > > @@ -127,8 +127,7 @@ static struct rte_eth_conf port_conf = { > > > .mq_mode= ETH_MQ_RX_RSS, > > > .max_rx_pkt_len = ETHER_MAX_LEN, > > > .split_hdr_size = 0, > > > - .offloads = (DEV_RX_OFFLOAD_CRC_STRIP | > > > - DEV_RX_OFFLOAD_CHECKSUM), > > > + .offloads = DEV_RX_OFFLOAD_CRC_STRIP, > > > }, > > > .rx_adv_conf = { > > > .rss_conf = { > > > diff --git a/examples/l3fwd-power/main.c > > > b/examples/l3fwd-power/main.c index d15cd520e..b00a8ec45 100644 > > > --- a/examples/l3fwd-power/main.c > > > +++ b/examples/l3fwd-power/main.c > > > @@ -180,8 +180,8 @@ static struct rte_eth_conf port_conf = { > > > .mq_mode= ETH_MQ_RX_RSS, > > >
Re: [dpdk-dev] [PATCH] examples: remove Rx checksum offload
30/07/2018 11:35, Jerin Jacob: > From: "Ananyev, Konstantin" > > > > > > As of now, application does not check PKT_RX_*_CKSUM_* flags per > > > packet, so it does not matter DEV_RX_OFFLOAD_CHECKSUM enabled or not. > > > > > > Removing DEV_RX_OFFLOAD_CHECKSUM offload so that driver can save a few > > > cycles if possible. > > > > Personally, I'd move in other direction: keep RX checksum offload and add > > checks inside sample apps to handle (drop) packets with invalid checksum. > > OK. Till someones add the DROP logic in application, Can we take > this patch? Because there is no point in enabling DEV_RX_OFFLOAD_CHECKSUM > without DROP or any meaning full action in application. If there is no patch sent to use this offload on August 1st, then I will apply this patch to remove the offload request.
Re: [dpdk-dev] [PATCH v5] vfio: fix workaround of BAR mapping
On 30-Jul-18 9:51 AM, Burakov, Anatoly wrote: On 29-Jul-18 9:44 AM, Jerin Jacob wrote: -Original Message- Date: Thu, 26 Jul 2018 11:35:43 +0200 From: Thomas Monjalon To: Takeshi Yoshimura Cc: dev@dpdk.org, Anatoly Burakov Subject: Re: [dpdk-dev] [PATCH v5] vfio: fix workaround of BAR mapping 20/07/2018 10:13, Takeshi Yoshimura: Currently, VFIO will try to map around MSI-X table in the BARs. When MSI-X table (page-aligned) size is equal to (page-aligned) size of BAR, VFIO will just skip the BAR. Recent kernel versions will allow VFIO to map the entire BAR containing MSI-X tables (*), so instead of trying to map around the MSI-X vector or skipping the BAR entirely if it's not possible, we can now try mapping the entire BAR first. If mapping the entire BAR doesn't succeed, fall back to the old behavior of mapping around MSI-X table or skipping the BAR. (*): "vfio-pci: Allow mapping MSIX BAR", https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/ commit/?id=a32295c612c57990d17fb0f41e7134394b2f35f6 Fixes: 90a1633b2347 ("eal/linux: allow to map BARs with MSI-X tables") Signed-off-by: Takeshi Yoshimura Reviewed-by: Anatoly Burakov This change set breaks thunderx/octeontx platform with following error.(Tested with 4.9.0 kernel) EAL: probe driver: 177d:a034 net_thunderx EAL: using IOMMU type 1 (Type 1) EAL: pci_map_resource(): cannot mmap(44, 0x6020, 0x20, 0x400): Invalid argument (0x) EAL: PCI device 0001:01:00.2 on NUMA socket 0 EAL: probe driver: 177d:a034 net_thunderx EAL: pci_map_resource(): cannot mmap(47, 0x6060, 0x20, 0x400): Invalid argument (0x) According Linux kernel change, user space application suppose to use VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability to detect this feature to work < 4.15 kernel. Right? if so, Why we are doing this retry based logic? I don't think anything's broken there - just a gratuitous error message. But yes, i seem to have missed the region info flag. It was my suggestion to use the retry logic. I'll submit a patch fixing this. The patch to fix it involves way more work than i am comfortable with submitting to rc3, so i believe we should revert this patch and postpone the change to 18.11. -- Thanks, Anatoly
Re: [dpdk-dev] eventdev: method for finding out unlink status
> I don't think that the eventdev API requires 1:1 Lcore / Port mapping, so > really a > PMD should be able to handle any thread calling any port. > > The event/sw PMD allows any thread to call dequeue/enqueue any port, > so long as it is not being accessed by another thread. > > A given event port assigned to a new lcore other than it previous one then we need to do some clean up at port level. >>> >>> In my case I'm mapping an event port per thread statically (basically >> thread_id == port_id), >>> so this shouldn't be an issue. > > This is the common case - but I don't think we should demand it. > There is a valid scale-down model which just polls *all* ports using > a single lcore, instead of unlink() of multiple ports. I agree, 1 : 1 Lcore / Port mapping shouldn't be required. > I think the end result we're hoping for is something like pseudo code below, > (keep in mind that the event/sw has a service-core thread running it, so no > application code there): > > int worker_poll = 1; > > worker() { > while(worker_poll) { > // eventdev_dequeue_burst() etc > } > go_to_sleep(1); > } > > control_plane_scale_down() { > unlink(evdev, worker, queue_id); > while(unlinks_in_progress(evdev) > 0) > usleep(100); > > /* here we know that the unlink is complete. > * so we can now stop the worker from polling */ > worker_poll = 0; > } > > Hope my pseudo-code makes pseudo-sense :) Makes sense =) One use case this API wouldn't support is if an application would unlink only a subset of linked queues. In this case events could be still arriving constantly after unlinks_in_progress() returns zero, so there is no way for the application to know when all events from unlinked queues have been processed. However, at least in our application this information is not needed.
Re: [dpdk-dev] eventdev: method for finding out unlink status
-Original Message- > Date: Mon, 30 Jul 2018 09:38:01 + > From: "Van Haaren, Harry" > To: Jerin Jacob , "Elo, Matias (Nokia - > FI/Espoo)" > CC: "dev@dpdk.org" > Subject: RE: [dpdk-dev] eventdev: method for finding out unlink status > > > > From: Jerin Jacob [mailto:jerin.ja...@caviumnetworks.com] > > Sent: Monday, July 30, 2018 10:29 AM > > To: Elo, Matias (Nokia - FI/Espoo) > > Cc: dev@dpdk.org; Van Haaren, Harry > > Subject: Re: [dpdk-dev] eventdev: method for finding out unlink status > > > > -Original Message- > > > Date: Mon, 30 Jul 2018 09:17:47 + > > > From: "Elo, Matias (Nokia - FI/Espoo)" > > > To: Jerin Jacob > > > CC: "dev@dpdk.org" , "Van Haaren, Harry" > > > > > > Subject: Re: [dpdk-dev] eventdev: method for finding out unlink status > > > x-mailer: Apple Mail (2.3445.9.1) > > > > > > > > > >> > > > >> In bug report https://bugs.dpdk.org/show_bug.cgi?id=60 we have been > > discussing > > > >> issues related to events ending up in wrong ports after calling > > > >> rte_event_port_unlink(). In addition of finding few bugs we have > > identified a > > > >> need for a new API call (or documentation extension) for an application > > to be > > > > > > > > From HW perspective, documentation extension should be enough. adding > > > > "there may be pre-scheduled events and the application is responsible to > > process them" > > > > on unlink(). Since dequeue() has which queue it is dequeue-ed from, the > > > > application can allays make action based on that(i.e, Is the event > > > > post/pre to unlink) > > > > > > At least in case of SW eventdev the problem is how the application can > > > know > > that > > > it has processed all pre-scheduled events. E.g. dequeue may return nothing > > but since > > > the scheduler is running as a separate process events may still end up to > > the unlinked > > > port asynchronously. > > > > Can't we do, dequeue() in loop to get all the events from port. If > > dequeue returns with zero event then ports are drained up. Right? > > Nope - because the scheduler might not have performed and "Acked" the > unlink(), and internally it has *just* scheduled an event, but it wasn't > available in the dequeue ring yet. > > Aka, its racy behavior - and we need a way to retrieve this "Unlink Ack" > from the scheduler (which runs in another thread in event/sw). OK. Some bits specific to event/sw. We will address it. > > > > > >> able to find out when an unlink() call has finished and no new events > > > >> are > > > >> scheduled anymore to the particular event port. This is required e.g. > > when doing > > > >> clean-up after an application thread stops processing events. > > > > > > > > If thread stopping then it better to call dev_stop(). At least in HW > > > > implementation, > > > > > > For an application doing dynamic load balancing stopping the whole > > > eventdev > > is not an > > > option. > > > > OK. Makes sense. Doing unlink() and link() in fastpath is not a > > problem. > > Correct > > > > Changing core assignment to event port is problem without stop(). I > > guess, you > > application or general would be OK with that constraint. > > > I don't think that the eventdev API requires 1:1 Lcore / Port mapping, so > really a > PMD should be able to handle any thread calling any port. > > The event/sw PMD allows any thread to call dequeue/enqueue any port, > so long as it is not being accessed by another thread. Yes. True. Eventdev API does not required 1:1 Lcore/Port mapping. Just like event/sw requires some bits to clear "Unlink Ack". At least, our HW implementation we need some bit clear when we change lcore to port mapping. Currently we are doing it in stop() call, If there is a real valid use case to change lcore to port mapping without stop, we would like to propose and API to flush/clear state on Lcore/port mapping change. It can be NOP for event/sw. > > > > > > A given event port assigned to a new lcore other than > > > > it previous one then we need to do some clean up at port level. > > > > > > In my case I'm mapping an event port per thread statically (basically > > thread_id == port_id), > > > so this shouldn't be an issue. > > This is the common case - but I don't think we should demand it. > There is a valid scale-down model which just polls *all* ports using > a single lcore, instead of unlink() of multiple ports. > > > For this "runtime scale down" use-case the missing information is being > able to identify when an unlink is complete. After that (and ensuring the > port buffer is empty) the application can be guaranteed that there are no > more events going to be sent to that port, and the application can take > the worker lcore out of its polling-loop and put it to sleep. > > As mentioned before, I think an "unlinks_in_progress()" function is perhaps > the easiest way to achieve this functionality, as it allows relatively simple > tracking of unlinks() using an atomic counter in sw. (Implementatio
[dpdk-dev] [PATCH] vfio: revert retry logic for MSI-X BAR mapping
This reverts commit d4774a568ba0a5923229974a002972c83eb04570. The patch is incomplete because kernel 4.16+, while being capable of mapping MSI-X BARs, will also report if such a capability is available. Without checking this capability, gratuitous errors are displayed on kernels <4.16 while VFIO is attempting to mmap MSI-X BAR and fails, which can be confusing to the user. Signed-off-by: Anatoly Burakov --- drivers/bus/pci/linux/pci_vfio.c | 93 ++-- 1 file changed, 41 insertions(+), 52 deletions(-) diff --git a/drivers/bus/pci/linux/pci_vfio.c b/drivers/bus/pci/linux/pci_vfio.c index bb6ef7b67..686386d6a 100644 --- a/drivers/bus/pci/linux/pci_vfio.c +++ b/drivers/bus/pci/linux/pci_vfio.c @@ -332,59 +332,50 @@ pci_vfio_mmap_bar(int vfio_dev_fd, struct mapped_pci_resource *vfio_res, void *bar_addr; struct pci_msix_table *msix_table = &vfio_res->msix_table; struct pci_map *bar = &vfio_res->maps[bar_index]; - bool again = false; if (bar->size == 0) /* Skip this BAR */ return 0; + if (msix_table->bar_index == bar_index) { + /* +* VFIO will not let us map the MSI-X table, +* but we can map around it. +*/ + uint32_t table_start = msix_table->offset; + uint32_t table_end = table_start + msix_table->size; + table_end = (table_end + ~PAGE_MASK) & PAGE_MASK; + table_start &= PAGE_MASK; + + if (table_start == 0 && table_end >= bar->size) { + /* Cannot map this BAR */ + RTE_LOG(DEBUG, EAL, "Skipping BAR%d\n", bar_index); + bar->size = 0; + bar->addr = 0; + return 0; + } + + memreg[0].offset = bar->offset; + memreg[0].size = table_start; + memreg[1].offset = bar->offset + table_end; + memreg[1].size = bar->size - table_end; + + RTE_LOG(DEBUG, EAL, + "Trying to map BAR%d that contains the MSI-X " + "table. Trying offsets: " + "0x%04lx:0x%04lx, 0x%04lx:0x%04lx\n", bar_index, + memreg[0].offset, memreg[0].size, + memreg[1].offset, memreg[1].size); + } else { + memreg[0].offset = bar->offset; + memreg[0].size = bar->size; + } + /* reserve the address using an inaccessible mapping */ bar_addr = mmap(bar->addr, bar->size, 0, MAP_PRIVATE | MAP_ANONYMOUS | additional_flags, -1, 0); - if (bar_addr == MAP_FAILED) { - RTE_LOG(ERR, EAL, - "Failed to create inaccessible mapping for BAR%d\n", - bar_index); - return -1; - } - - memreg[0].offset = bar->offset; - memreg[0].size = bar->size; - do { + if (bar_addr != MAP_FAILED) { void *map_addr = NULL; - if (again) { - /* -* VFIO did not let us map the MSI-X table, -* but we can map around it. -*/ - uint32_t table_start = msix_table->offset; - uint32_t table_end = table_start + msix_table->size; - table_end = (table_end + ~PAGE_MASK) & PAGE_MASK; - table_start &= PAGE_MASK; - - if (table_start == 0 && table_end >= bar->size) { - /* Cannot map this BAR */ - RTE_LOG(DEBUG, EAL, "Skipping BAR%d\n", - bar_index); - munmap(bar_addr, bar->size); - bar->size = 0; - bar->addr = 0; - return 0; - } - - memreg[0].offset = bar->offset; - memreg[0].size = table_start; - memreg[1].offset = bar->offset + table_end; - memreg[1].size = bar->size - table_end; - - RTE_LOG(DEBUG, EAL, - "Trying to map BAR%d that contains the MSI-X " - "table. Trying offsets: " - "0x%04lx:0x%04lx, 0x%04lx:0x%04lx\n", bar_index, - memreg[0].offset, memreg[0].size, - memreg[1].offset, memreg[1].size); - } - if (memreg[0].size) { /* actual map of first part */ map_addr = pci_map_resource(bar_addr, vfio_dev_fd, @@ -393,12 +384,6 @@ pci_vfio_mmap_bar(int vfio_dev_fd, struct mapped_pci_resource *vfio_
Re: [dpdk-dev] [PATCH] examples: remove Rx checksum offload
> -Original Message- > From: Thomas Monjalon [mailto:tho...@monjalon.net] > Sent: Monday, July 30, 2018 10:51 AM > To: Jerin Jacob ; Ananyev, Konstantin > > Cc: dev@dpdk.org; Yigit, Ferruh ; shah...@mellanox.com > Subject: Re: [dpdk-dev] [PATCH] examples: remove Rx checksum offload > > 30/07/2018 11:35, Jerin Jacob: > > From: "Ananyev, Konstantin" > > > > > > > > As of now, application does not check PKT_RX_*_CKSUM_* flags per > > > > packet, so it does not matter DEV_RX_OFFLOAD_CHECKSUM enabled or not. > > > > > > > > Removing DEV_RX_OFFLOAD_CHECKSUM offload so that driver can save a few > > > > cycles if possible. > > > > > > Personally, I'd move in other direction: keep RX checksum offload and add > > > checks inside sample apps to handle (drop) packets with invalid checksum. > > > > OK. Till someones add the DROP logic in application, Can we take > > this patch? Because there is no point in enabling DEV_RX_OFFLOAD_CHECKSUM > > without DROP or any meaning full action in application. Probably, but at least it gives users a right estimation how long the proper RX/TX routine would take. >From other side what the point to disable these flags now, if we know that we are doing wrong thing and will have to re-enable them again in future? > > If there is no patch sent to use this offload on August 1st, > then I will apply this patch to remove the offload request. > Isn't it too late to do such things right now? We are in RC3 stage and doesn't look like a critical issue. Konstantin
[dpdk-dev] [PATCH 18.11] pci/vfio: allow mapping MSI-X BARs if kernel allows it
Currently, DPDK will skip mapping some areas (or even an entire BAR) if MSI-X happens to be in it but is smaller than page address. Kernels 4.16+ will allow mapping MSI-X BARs [1], and will report this as a capability flag. Capability flags themselves are also only supported since kernel 4.6 [2]. This commit will introduce support for checking VFIO capabilities, and will use it to check if we are allowed to map BARs with MSI-X tables in them, along with backwards compatibility for older kernels, including a workaround for a variable rename in VFIO region info structure [3]. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/ linux.git/commit/?id=a32295c612c57990d17fb0f41e7134394b2f35f6 [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/ linux.git/commit/?id=c84982adb23bcf3b99b79ca33527cd2625fbe279 [3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/ linux.git/commit/?id=ff63eb638d63b95e489f976428f1df01391e15e4 Signed-off-by: Anatoly Burakov --- drivers/bus/pci/linux/pci_vfio.c | 127 --- lib/librte_eal/common/include/rte_vfio.h | 26 + 2 files changed, 140 insertions(+), 13 deletions(-) diff --git a/drivers/bus/pci/linux/pci_vfio.c b/drivers/bus/pci/linux/pci_vfio.c index 686386d6a..e7765ee11 100644 --- a/drivers/bus/pci/linux/pci_vfio.c +++ b/drivers/bus/pci/linux/pci_vfio.c @@ -415,6 +415,88 @@ pci_vfio_mmap_bar(int vfio_dev_fd, struct mapped_pci_resource *vfio_res, return 0; } +/* + * region info may contain capability headers, so we need to keep reallocating + * the memory until we match allocated memory size with argsz. + */ +static int +pci_vfio_get_region_info(int vfio_dev_fd, struct vfio_region_info **info, + int region) +{ + struct vfio_region_info *ri; + size_t argsz = sizeof(*ri); + int ret; + + ri = malloc(sizeof(*ri)); + if (ri == NULL) { + RTE_LOG(ERR, EAL, "Cannot allocate memory for region info\n"); + return -1; + } +again: + memset(ri, 0, argsz); + ri->argsz = argsz; + ri->index = region; + + ret = ioctl(vfio_dev_fd, VFIO_DEVICE_GET_REGION_INFO, info); + if (ret) { + free(ri); + return ret; + } + if (ri->argsz != argsz) { + argsz = ri->argsz; + ri = realloc(ri, argsz); + + if (ri == NULL) { + RTE_LOG(ERR, EAL, "Cannot reallocate memory for region info\n"); + return -1; + } + goto again; + } + *info = ri; + + return 0; +} + +static struct vfio_info_cap_header * +pci_vfio_info_cap(struct vfio_region_info *info, int cap) +{ + struct vfio_info_cap_header *h; + size_t offset; + + if ((info->flags & RTE_VFIO_INFO_FLAG_CAPS) == 0) { + /* VFIO info does not advertise capabilities */ + return NULL; + } + + offset = VFIO_CAP_OFFSET(info); + while (offset != 0) { + h = RTE_PTR_ADD(info, offset); + if (h->id == cap) + return h; + offset = h->next; + } + return NULL; +} + +static int +pci_vfio_msix_is_mappable(int vfio_dev_fd, int msix_region) +{ + struct vfio_region_info *info; + int ret; + + ret = pci_vfio_get_region_info(vfio_dev_fd, &info, msix_region); + if (ret < 0) + return -1; + + ret = pci_vfio_info_cap(info, RTE_VFIO_CAP_MSIX_MAPPABLE) != NULL; + + /* cleanup */ + free(info); + + return ret; +} + + static int pci_vfio_map_resource_primary(struct rte_pci_device *dev) { @@ -464,56 +546,75 @@ pci_vfio_map_resource_primary(struct rte_pci_device *dev) if (ret < 0) { RTE_LOG(ERR, EAL, " %s cannot get MSI-X BAR number!\n", pci_addr); - goto err_vfio_dev_fd; + goto err_vfio_res; + } + /* if we found our MSI-X BAR region, check if we can mmap it */ + if (vfio_res->msix_table.bar_index != -1) { + int ret = pci_vfio_msix_is_mappable(vfio_dev_fd, + vfio_res->msix_table.bar_index); + if (ret < 0) { + RTE_LOG(ERR, EAL, "Couldn't check if MSI-X BAR is mappable\n"); + goto err_vfio_res; + } else if (ret != 0) { + /* we can map it, so we don't care where it is */ + RTE_LOG(DEBUG, EAL, "VFIO reports MSI-X BAR as mappable\n"); + vfio_res->msix_table.bar_index = -1; + } } for (i = 0; i < (int) vfio_res->nb_maps; i++) { - struct vfio_region_info reg = { .argsz = sizeof(reg) }; + struct vfio_region_info *reg; void *bar_addr; - reg.index = i; - - ret = ioctl(vfio_dev_fd,
Re: [dpdk-dev] [PATCH] examples: remove Rx checksum offload
-Original Message- > Date: Mon, 30 Jul 2018 11:00:02 + > From: "Ananyev, Konstantin" > To: Thomas Monjalon , Jerin Jacob > > CC: "dev@dpdk.org" , "Yigit, Ferruh" > , "shah...@mellanox.com" > Subject: RE: [dpdk-dev] [PATCH] examples: remove Rx checksum offload > > External Email > > > -Original Message- > > From: Thomas Monjalon [mailto:tho...@monjalon.net] > > Sent: Monday, July 30, 2018 10:51 AM > > To: Jerin Jacob ; Ananyev, Konstantin > > > > Cc: dev@dpdk.org; Yigit, Ferruh ; > > shah...@mellanox.com > > Subject: Re: [dpdk-dev] [PATCH] examples: remove Rx checksum offload > > > > 30/07/2018 11:35, Jerin Jacob: > > > From: "Ananyev, Konstantin" > > > > > > > > > > As of now, application does not check PKT_RX_*_CKSUM_* flags per > > > > > packet, so it does not matter DEV_RX_OFFLOAD_CHECKSUM enabled or not. > > > > > > > > > > Removing DEV_RX_OFFLOAD_CHECKSUM offload so that driver can save a few > > > > > cycles if possible. > > > > > > > > Personally, I'd move in other direction: keep RX checksum offload and > > > > add > > > > checks inside sample apps to handle (drop) packets with invalid > > > > checksum. > > > > > > OK. Till someones add the DROP logic in application, Can we take > > > this patch? Because there is no point in enabling DEV_RX_OFFLOAD_CHECKSUM > > > without DROP or any meaning full action in application. > > Probably, but at least it gives users a right estimation how long the proper > RX/TX routine would take. For estimation, application can add any flag they want in local setup. It does not need to be upstream with out feature complete. > From other side what the point to disable these flags now, if we know that At least nicvf Rx routines are crafted based DEV_RX_OFFLOAD_CHECKSUM flags. If driver Rx routine crafted such case it will be useful. > we are doing wrong thing and will have to re-enable them again in future? But it is not correct now either. Right? > > > > > If there is no patch sent to use this offload on August 1st, > > then I will apply this patch to remove the offload request. > > > > Isn't it too late to do such things right now? > We are in RC3 stage and doesn't look like a critical issue. Yes. We can add it when have we proper fix. Currently, it signaling a wrong interpretation to application. > Konstantin > >
Re: [dpdk-dev] eventdev: method for finding out unlink status
>> For this "runtime scale down" use-case the missing information is being >> able to identify when an unlink is complete. After that (and ensuring the >> port buffer is empty) the application can be guaranteed that there are no >> more events going to be sent to that port, and the application can take >> the worker lcore out of its polling-loop and put it to sleep. >> >> As mentioned before, I think an "unlinks_in_progress()" function is perhaps >> the easiest way to achieve this functionality, as it allows relatively simple >> tracking of unlinks() using an atomic counter in sw. (Implementation details >> become complex when we have a separate core running event/sw, separate cores >> polling, and a control-plane thread calling unlink...) >> >> I think the end result we're hoping for is something like pseudo code below, >> (keep in mind that the event/sw has a service-core thread running it, so no >> application code there): >> >> int worker_poll = 1; >> >> worker() { >> while(worker_poll) { >> // eventdev_dequeue_burst() etc >> } >> go_to_sleep(1); >> } >> >> control_plane_scale_down() { >> unlink(evdev, worker, queue_id); >> while(unlinks_in_progress(evdev) > 0) >> usleep(100); >> >> /* here we know that the unlink is complete. >> * so we can now stop the worker from polling */ >> worker_poll = 0; >> } > > > Make sense. Instead of rte_event_is_unlink_in_progress(), How about > adding a callback in rte_event_port_unlink() which will be called on > unlink completion. It will reduce the need for ONE more API. > > Anyway it RC2 now, so we can not accept a new feature. So we will have > time for deprecation notice. > Both solutions should work but I would perhaps favor Harry's approach as it requires less code in the application side and doesn't break backward compatibility.
Re: [dpdk-dev] [PATCH] examples: remove Rx checksum offload
> > -Original Message- > > Date: Mon, 30 Jul 2018 11:00:02 + > > From: "Ananyev, Konstantin" > > To: Thomas Monjalon , Jerin Jacob > > > > CC: "dev@dpdk.org" , "Yigit, Ferruh" > > , "shah...@mellanox.com" > > Subject: RE: [dpdk-dev] [PATCH] examples: remove Rx checksum offload > > > > External Email > > > > > -Original Message- > > > From: Thomas Monjalon [mailto:tho...@monjalon.net] > > > Sent: Monday, July 30, 2018 10:51 AM > > > To: Jerin Jacob ; Ananyev, Konstantin > > > > > > Cc: dev@dpdk.org; Yigit, Ferruh ; > > > shah...@mellanox.com > > > Subject: Re: [dpdk-dev] [PATCH] examples: remove Rx checksum offload > > > > > > 30/07/2018 11:35, Jerin Jacob: > > > > From: "Ananyev, Konstantin" > > > > > > > > > > > > As of now, application does not check PKT_RX_*_CKSUM_* flags per > > > > > > packet, so it does not matter DEV_RX_OFFLOAD_CHECKSUM enabled or > > > > > > not. > > > > > > > > > > > > Removing DEV_RX_OFFLOAD_CHECKSUM offload so that driver can save a > > > > > > few > > > > > > cycles if possible. > > > > > > > > > > Personally, I'd move in other direction: keep RX checksum offload and > > > > > add > > > > > checks inside sample apps to handle (drop) packets with invalid > > > > > checksum. > > > > > > > > OK. Till someones add the DROP logic in application, Can we take > > > > this patch? Because there is no point in enabling > > > > DEV_RX_OFFLOAD_CHECKSUM > > > > without DROP or any meaning full action in application. > > > > Probably, but at least it gives users a right estimation how long the proper > > RX/TX routine would take. > > For estimation, application can add any flag they want in local setup. > It does not need to be upstream with out feature complete. > > > From other side what the point to disable these flags now, if we know that > > At least nicvf Rx routines are crafted based DEV_RX_OFFLOAD_CHECKSUM > flags. If driver Rx routine crafted such case it will be useful. > > > we are doing wrong thing and will have to re-enable them again in future? > > But it is not correct now either. Right? Yes, right now invalid cksum information is simply ignored. As you pointed - some PMD select RX routine based on checksum offload flags. Yes, removing these flags might produce better performance numbers. But from my perspective - it would be an artificial and temporary improvement, as for l3fwd like apps we'll need to revert it back and add code to drop invalid packets. Konstantin > > > > > > > > > If there is no patch sent to use this offload on August 1st, > > > then I will apply this patch to remove the offload request. > > > > > > > Isn't it too late to do such things right now? > > We are in RC3 stage and doesn't look like a critical issue. > > Yes. We can add it when have we proper fix. Currently, it signaling a wrong > interpretation to application. > > > > Konstantin > > > >
Re: [dpdk-dev] eventdev: method for finding out unlink status
-Original Message- > Date: Mon, 30 Jul 2018 13:36:35 + > From: "Elo, Matias (Nokia - FI/Espoo)" > To: Jerin Jacob > CC: "Van Haaren, Harry" , "dev@dpdk.org" > > Subject: Re: [dpdk-dev] eventdev: method for finding out unlink status > x-mailer: Apple Mail (2.3445.9.1) > > > >> For this "runtime scale down" use-case the missing information is being > >> able to identify when an unlink is complete. After that (and ensuring the > >> port buffer is empty) the application can be guaranteed that there are no > >> more events going to be sent to that port, and the application can take > >> the worker lcore out of its polling-loop and put it to sleep. > >> > >> As mentioned before, I think an "unlinks_in_progress()" function is perhaps > >> the easiest way to achieve this functionality, as it allows relatively > >> simple > >> tracking of unlinks() using an atomic counter in sw. (Implementation > >> details > >> become complex when we have a separate core running event/sw, separate > >> cores > >> polling, and a control-plane thread calling unlink...) > >> > >> I think the end result we're hoping for is something like pseudo code > >> below, > >> (keep in mind that the event/sw has a service-core thread running it, so no > >> application code there): > >> > >> int worker_poll = 1; > >> > >> worker() { > >> while(worker_poll) { > >> // eventdev_dequeue_burst() etc > >> } > >> go_to_sleep(1); > >> } > >> > >> control_plane_scale_down() { > >> unlink(evdev, worker, queue_id); > >> while(unlinks_in_progress(evdev) > 0) > >> usleep(100); > >> > >> /* here we know that the unlink is complete. > >> * so we can now stop the worker from polling */ > >> worker_poll = 0; > >> } > > > > > > Make sense. Instead of rte_event_is_unlink_in_progress(), How about > > adding a callback in rte_event_port_unlink() which will be called on > > unlink completion. It will reduce the need for ONE more API. > > > > Anyway it RC2 now, so we can not accept a new feature. So we will have > > time for deprecation notice. > > > > Both solutions should work but I would perhaps favor Harry's approach as it > requires less code in the application side and doesn't break backward > compatibility. OK. Does rte_event_port_unlink() returning -EBUSY will help? while (rte_event_port_unlink() != nr_links) usleep(100); I am trying to think, how can address this requirements without creating new API and/or less impact to other drivers which don't have this requirements? Are we calling this API in fastpath? or it is control thread as mentioned in harry's pseudo code.
Re: [dpdk-dev] [PATCH] examples: remove Rx checksum offload
-Original Message- > Date: Mon, 30 Jul 2018 14:12:12 + > From: "Ananyev, Konstantin" > To: Jerin Jacob > CC: Thomas Monjalon , "dev@dpdk.org" , > "Yigit, Ferruh" , "shah...@mellanox.com" > > Subject: RE: [dpdk-dev] [PATCH] examples: remove Rx checksum offload > > > > > > -Original Message- > > > Date: Mon, 30 Jul 2018 11:00:02 + > > > From: "Ananyev, Konstantin" > > > To: Thomas Monjalon , Jerin Jacob > > > > > > CC: "dev@dpdk.org" , "Yigit, Ferruh" > > > , "shah...@mellanox.com" > > > Subject: RE: [dpdk-dev] [PATCH] examples: remove Rx checksum offload > > > > > > External Email > > > > > > > -Original Message- > > > > From: Thomas Monjalon [mailto:tho...@monjalon.net] > > > > Sent: Monday, July 30, 2018 10:51 AM > > > > To: Jerin Jacob ; Ananyev, Konstantin > > > > > > > > Cc: dev@dpdk.org; Yigit, Ferruh ; > > > > shah...@mellanox.com > > > > Subject: Re: [dpdk-dev] [PATCH] examples: remove Rx checksum offload > > > > > > > > 30/07/2018 11:35, Jerin Jacob: > > > > > From: "Ananyev, Konstantin" > > > > > > > > > > > > > > As of now, application does not check PKT_RX_*_CKSUM_* flags per > > > > > > > packet, so it does not matter DEV_RX_OFFLOAD_CHECKSUM enabled or > > > > > > > not. > > > > > > > > > > > > > > Removing DEV_RX_OFFLOAD_CHECKSUM offload so that driver can save > > > > > > > a few > > > > > > > cycles if possible. > > > > > > > > > > > > Personally, I'd move in other direction: keep RX checksum offload > > > > > > and add > > > > > > checks inside sample apps to handle (drop) packets with invalid > > > > > > checksum. > > > > > > > > > > OK. Till someones add the DROP logic in application, Can we take > > > > > this patch? Because there is no point in enabling > > > > > DEV_RX_OFFLOAD_CHECKSUM > > > > > without DROP or any meaning full action in application. > > > > > > Probably, but at least it gives users a right estimation how long the > > > proper > > > RX/TX routine would take. > > > > For estimation, application can add any flag they want in local setup. > > It does not need to be upstream with out feature complete. > > > > > From other side what the point to disable these flags now, if we know that > > > > At least nicvf Rx routines are crafted based DEV_RX_OFFLOAD_CHECKSUM > > flags. If driver Rx routine crafted such case it will be useful. > > > > > we are doing wrong thing and will have to re-enable them again in future? > > > > But it is not correct now either. Right? > > Yes, right now invalid cksum information is simply ignored. > As you pointed - some PMD select RX routine based on checksum offload flags. > Yes, removing these flags might produce better performance numbers. > But from my perspective - it would be an artificial and temporary improvement, > as for l3fwd like apps we'll need to revert it back and add code to drop > invalid packets. IMO, It is OK get a performance hit when do that support in l3fwd. There is no harm in removing the DEV_RX_OFFLOAD_CHECKSUM flag for now and it is correct from application perspective.(you are enabling an offload when you are using it, else don't enable it. I believe, this was philosophy for enabling Rx/Tx offloads) Since it is going in circles, I leave decision to ethdev maintainers. > Konstantin > > > > > > > > > > > > > > If there is no patch sent to use this offload on August 1st, > > > > then I will apply this patch to remove the offload request. > > > > > > > > > > Isn't it too late to do such things right now? > > > We are in RC3 stage and doesn't look like a critical issue. > > > > Yes. We can add it when have we proper fix. Currently, it signaling a wrong > > interpretation to application. > > > > > > > Konstantin > > > > > >
Re: [dpdk-dev] [PATCH] examples/multi_process: remove l2fwd fork example
> -Original Message- > From: Burakov, Anatoly > Sent: Monday, July 30, 2018 3:53 AM > To: Eads, Gage ; dev@dpdk.org > Subject: Re: [PATCH] examples/multi_process: remove l2fwd fork example > > On 27-Jul-18 7:56 PM, Gage Eads wrote: > > l2fwd_fork relies on a multiprocess model that that DPDK does not > > support (calling rte_eal_init() before fork()), in particular in light > > of recent EAL changes like the multiproess communication channel. > > > > This example can mislead users into thinking this is a supported > > multiprocess model; hence, this commit removes this example and the > > corresponding user guide documentation as well. > > > > Signed-off-by: Gage Eads > > --- > > Hi Gage, > > Maybe the commit message should reference the earlier discussion on the > mailing list. Certainly, will do. Thanks, Gage > > -- > Thanks, > Anatoly
Re: [dpdk-dev] [PATCH] examples: remove Rx checksum offload
On 30.07.2018 17:40, Jerin Jacob wrote: -Original Message- Date: Mon, 30 Jul 2018 14:12:12 + From: "Ananyev, Konstantin" To: Jerin Jacob CC: Thomas Monjalon , "dev@dpdk.org" , "Yigit, Ferruh" , "shah...@mellanox.com" Subject: RE: [dpdk-dev] [PATCH] examples: remove Rx checksum offload -Original Message- Date: Mon, 30 Jul 2018 11:00:02 + From: "Ananyev, Konstantin" To: Thomas Monjalon , Jerin Jacob CC: "dev@dpdk.org" , "Yigit, Ferruh" , "shah...@mellanox.com" Subject: RE: [dpdk-dev] [PATCH] examples: remove Rx checksum offload External Email -Original Message- From: Thomas Monjalon [mailto:tho...@monjalon.net] Sent: Monday, July 30, 2018 10:51 AM To: Jerin Jacob ; Ananyev, Konstantin Cc: dev@dpdk.org; Yigit, Ferruh ; shah...@mellanox.com Subject: Re: [dpdk-dev] [PATCH] examples: remove Rx checksum offload 30/07/2018 11:35, Jerin Jacob: From: "Ananyev, Konstantin" As of now, application does not check PKT_RX_*_CKSUM_* flags per packet, so it does not matter DEV_RX_OFFLOAD_CHECKSUM enabled or not. Removing DEV_RX_OFFLOAD_CHECKSUM offload so that driver can save a few cycles if possible. Personally, I'd move in other direction: keep RX checksum offload and add checks inside sample apps to handle (drop) packets with invalid checksum. OK. Till someones add the DROP logic in application, Can we take this patch? Because there is no point in enabling DEV_RX_OFFLOAD_CHECKSUM without DROP or any meaning full action in application. Probably, but at least it gives users a right estimation how long the proper RX/TX routine would take. For estimation, application can add any flag they want in local setup. It does not need to be upstream with out feature complete. From other side what the point to disable these flags now, if we know that At least nicvf Rx routines are crafted based DEV_RX_OFFLOAD_CHECKSUM flags. If driver Rx routine crafted such case it will be useful. we are doing wrong thing and will have to re-enable them again in future? But it is not correct now either. Right? Yes, right now invalid cksum information is simply ignored. As you pointed - some PMD select RX routine based on checksum offload flags. Yes, removing these flags might produce better performance numbers. But from my perspective - it would be an artificial and temporary improvement, as for l3fwd like apps we'll need to revert it back and add code to drop invalid packets. IMO, It is OK get a performance hit when do that support in l3fwd. There is no harm in removing the DEV_RX_OFFLOAD_CHECKSUM flag for now and it is correct from application perspective.(you are enabling an offload when you are using it, else don't enable it. I believe, this was philosophy for enabling Rx/Tx offloads) Since it is going in circles, I leave decision to ethdev maintainers. I think that IPv4 checksum offload is essential for l3fwd. So, it should be enabled and taken into account. I'm not sure about TCP and UDP checksum offloads. It is not l3fwd business to take a look at upper layers. In any case, there is no agreement on the patch and it is already RC3 stage of the release. There is no rush to apply it since it is not a critical bug fix. I agree with Konstantin here. Andrew Konstantin If there is no patch sent to use this offload on August 1st, then I will apply this patch to remove the offload request. Isn't it too late to do such things right now? We are in RC3 stage and doesn't look like a critical issue. Yes. We can add it when have we proper fix. Currently, it signaling a wrong interpretation to application. Konstantin
Re: [dpdk-dev] eventdev: method for finding out unlink status
On 30 Jul 16:06, Jerin Jacob wrote: > -Original Message- > > Date: Mon, 30 Jul 2018 09:38:01 + > > From: "Van Haaren, Harry" > > To: Jerin Jacob , "Elo, Matias (Nokia - > > FI/Espoo)" > > CC: "dev@dpdk.org" > > Subject: RE: [dpdk-dev] eventdev: method for finding out unlink status > > > > > > > From: Jerin Jacob [mailto:jerin.ja...@caviumnetworks.com] > > > Sent: Monday, July 30, 2018 10:29 AM > > > To: Elo, Matias (Nokia - FI/Espoo) > > > Cc: dev@dpdk.org; Van Haaren, Harry > > > Subject: Re: [dpdk-dev] eventdev: method for finding out unlink status > > > > > > -Original Message- > > > > Date: Mon, 30 Jul 2018 09:17:47 + > > > > From: "Elo, Matias (Nokia - FI/Espoo)" > > > > To: Jerin Jacob > > > > CC: "dev@dpdk.org" , "Van Haaren, Harry" > > > > > > > > Subject: Re: [dpdk-dev] eventdev: method for finding out unlink status > > > > x-mailer: Apple Mail (2.3445.9.1) > > > > > > > > > > > > >> > > > > >> In bug report https://bugs.dpdk.org/show_bug.cgi?id=60 we have been > > > discussing > > > > >> issues related to events ending up in wrong ports after calling > > > > >> rte_event_port_unlink(). In addition of finding few bugs we have > > > identified a > > > > >> need for a new API call (or documentation extension) for an > > > > >> application > > > to be > > > > > > > > > > From HW perspective, documentation extension should be enough. adding > > > > > "there may be pre-scheduled events and the application is responsible > > > > > to > > > process them" > > > > > on unlink(). Since dequeue() has which queue it is dequeue-ed from, > > > > > the > > > > > application can allays make action based on that(i.e, Is the event > > > > > post/pre to unlink) > > > > > > > > At least in case of SW eventdev the problem is how the application can > > > > know > > > that > > > > it has processed all pre-scheduled events. E.g. dequeue may return > > > > nothing > > > but since > > > > the scheduler is running as a separate process events may still end up > > > > to > > > the unlinked > > > > port asynchronously. > > > > > > Can't we do, dequeue() in loop to get all the events from port. If > > > dequeue returns with zero event then ports are drained up. Right? > > > > Nope - because the scheduler might not have performed and "Acked" the > > unlink(), and internally it has *just* scheduled an event, but it wasn't > > available in the dequeue ring yet. > > > > Aka, its racy behavior - and we need a way to retrieve this "Unlink Ack" > > from the scheduler (which runs in another thread in event/sw). > > OK. Some bits specific to event/sw. We will address it. BTW: OPDL is not support unlink in runtime. so if we need suggest user do a query to the CAP bits first. > > > > > > > > > >> able to find out when an unlink() call has finished and no new > > > > >> events are > > > > >> scheduled anymore to the particular event port. This is required e.g. > > > when doing > > > > >> clean-up after an application thread stops processing events. > > > > > > > > > > If thread stopping then it better to call dev_stop(). At least in HW > > > > > implementation, > > > > > > > > For an application doing dynamic load balancing stopping the whole > > > > eventdev > > > is not an > > > > option. > > > > > > OK. Makes sense. Doing unlink() and link() in fastpath is not a > > > problem. > > > > Correct > > > > > > > Changing core assignment to event port is problem without stop(). I > > > guess, you > > > application or general would be OK with that constraint. > > > > > > I don't think that the eventdev API requires 1:1 Lcore / Port mapping, so > > really a > > PMD should be able to handle any thread calling any port. > > > > The event/sw PMD allows any thread to call dequeue/enqueue any port, > > so long as it is not being accessed by another thread. > > Yes. True. Eventdev API does not required 1:1 Lcore/Port mapping. > Just like event/sw requires some bits to clear "Unlink Ack". At least, > our HW implementation we need some bit clear when we change lcore to port > mapping. Currently we are doing it in stop() call, If there is a real valid > use > case to change lcore to port mapping without stop, we would like to > propose and API to flush/clear state on Lcore/port mapping change. > It can be NOP for event/sw. > > > > > > > > > > A given event port assigned to a new lcore other than > > > > > it previous one then we need to do some clean up at port level. > > > > > > > > In my case I'm mapping an event port per thread statically (basically > > > thread_id == port_id), > > > > so this shouldn't be an issue. > > > > This is the common case - but I don't think we should demand it. > > There is a valid scale-down model which just polls *all* ports using > > a single lcore, instead of unlink() of multiple ports. > > > > > > For this "runtime scale down" use-case the missing information is being > > able to identify when an unlink is complete. After that (and ensuring the > > po
[dpdk-dev] [RFC] ethdev: add generic L2/L3 tunnel encapsulation actions
Currenlty the encap/decap actions only support encapsulation of VXLAN and NVGRE L2 packets. There is a need to add more L2 tunnels and also L3 tunnels. One issue with the current approch is the duplication of code. For example the code for handling NVGRE and VXLAN are exactly the same, and each new tunnel will have the same exact structure. Last issue with the current approach is the use of rte_items. The most significant issue with that is that the PMD needs to convert the items and this hurts the insertion rate. Other issue is that the rte_item has 3 members while we only need the spec (last and mask are useless). I know that the extra member have only small memory impact but considering that we can have millions of rules, this became more important consideration, and it is bad practice to add a variable that is never used. My suggestion is to create 2 commands, one for encapsulation of L2 packets and one for encapsulation of L3 tunnels. The parameters for those functions will be a uint8_t buffer with a length parameter. The current approach is not implemented yet in drivers yet, and is marked as experimental, so it should be removed. Any comments will be hugely appreciated. Signed-off-by: Ori Kam --- lib/librte_ethdev/rte_flow.h | 111 ++ 1 files changed, 47 insertions(+), 64 deletions(-) diff --git a/lib/librte_ethdev/rte_flow.h b/lib/librte_ethdev/rte_flow.h index f8ba71c..3549d7d 100644 --- a/lib/librte_ethdev/rte_flow.h +++ b/lib/librte_ethdev/rte_flow.h @@ -1473,38 +1473,37 @@ enum rte_flow_action_type { RTE_FLOW_ACTION_TYPE_OF_PUSH_MPLS, /** -* Encapsulate flow in VXLAN tunnel as defined in -* rte_flow_action_vxlan_encap action structure. +* Encapsulate flow with header tunnel as defined in +* rte_flow_action_tunnel_encap action structure. * -* See struct rte_flow_action_vxlan_encap. +* See struct rte_flow_action_tunnel_encap. */ - RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP, + RTE_FLOW_ACTION_TYPE_TUNNEL_ENCAP, /** -* Decapsulate outer most VXLAN tunnel from matched flow. +* Decapsulate outer most tunnel from matched flow. * -* If flow pattern does not define a valid VXLAN tunnel (as specified by -* RFC7348) then the PMD should return a RTE_FLOW_ERROR_TYPE_ACTION -* error. +* The flow pattern must have a valid tunnel header. */ - RTE_FLOW_ACTION_TYPE_VXLAN_DECAP, + RTE_FLOW_ACTION_TYPE_TUNNEL_DECAP, /** -* Encapsulate flow in NVGRE tunnel defined in the -* rte_flow_action_nvgre_encap action structure. +* Remove L2 header and encapsulate with header tunnel as defined in +* rte_flow_action_tunnel_encap_l3 action structure. * -* See struct rte_flow_action_nvgre_encap. +* See struct rte_flow_action_tunnel_encap_l3. */ - RTE_FLOW_ACTION_TYPE_NVGRE_ENCAP, + RTE_FLOW_ACTION_TYPE_TUNNEL_ENCAP_L3, /** -* Decapsulate outer most NVGRE tunnel from matched flow. +* Decapsulate outer most tunnel from matched flow, +* and encap the remaining header with the given one. * -* If flow pattern does not define a valid NVGRE tunnel (as specified by -* RFC7637) then the PMD should return a RTE_FLOW_ERROR_TYPE_ACTION -* error. +* The flow pattern must have a valid tunnel header. +* +* See struct rte_flow_action_tunnel_decap_l3. */ - RTE_FLOW_ACTION_TYPE_NVGRE_DECAP, + RTE_FLOW_ACTION_TYPE_TUNNEL_DECAP_L3, }; /** @@ -1803,69 +1802,53 @@ struct rte_flow_action_of_push_mpls { * @warning * @b EXPERIMENTAL: this structure may change without prior notice * - * RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP - * - * VXLAN tunnel end-point encapsulation data definition - * - * The tunnel definition is provided through the flow item pattern, the - * provided pattern must conform to RFC7348 for the tunnel specified. The flow - * definition must be provided in order from the RTE_FLOW_ITEM_TYPE_ETH - * definition up the end item which is specified by RTE_FLOW_ITEM_TYPE_END. - * - * The mask field allows user to specify which fields in the flow item - * definitions can be ignored and which have valid data and can be used - * verbatim. + * RTE_FLOW_ACTION_TYPE_TUNNEL_ENCAP * - * Note: the last field is not used in the definition of a tunnel and can be - * ignored. - * - * Valid flow definition for RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP include: - * - * - ETH / IPV4 / UDP / VXLAN / END - * - ETH / IPV6 / UDP / VXLAN / END - * - ETH / VLAN / IPV4 / UDP / VXLAN / END + * Tunnel end-point encapsulation data definition * + * The tunnel definition is provided through raw buffer that holds + * the headers that should encapsulate the packet. + * The given encapsulation should be a valid packet header. */ -struct rte_flow_action_vxl
Re: [dpdk-dev] [RFC] ethdev: add generic L2/L3 tunnel encapsulation actions
On Mon, 30 Jul 2018 19:19:25 +0300 Ori Kam wrote: > Currenlty the encap/decap actions only support encapsulation > of VXLAN and NVGRE L2 packets. > There is a need to add more L2 tunnels and also L3 tunnels. > > One issue with the current approch is the duplication of code. > For example the code for handling NVGRE and VXLAN are exactly the same, > and each new tunnel will have the same exact structure. > > Last issue with the current approach is the use of rte_items. > The most significant issue with that is that the PMD needs to convert > the items and this hurts the insertion rate. Other issue is that > the rte_item has 3 members while we only need the spec (last and mask > are useless). I know that the extra member have only small memory > impact but considering that we can have millions of rules, this became > more important consideration, and it is bad practice to add a variable > that is never used. > > My suggestion is to create 2 commands, one for encapsulation of L2 > packets and one for encapsulation of L3 tunnels. > The parameters for those functions will be a uint8_t buffer with > a length parameter. > > The current approach is not implemented yet in drivers yet, and > is marked as experimental, so it should be removed. > > Any comments will be hugely appreciated. > > Signed-off-by: Ori Kam What about binary and source compatibilities with older release?
[dpdk-dev] [PATCH] net/i40e: fix avx2 driver check for rx rearm
This commit fixes an infinite loop bug that could occur if the i40e AVX2 driver is used, and high traffic rates cause the mempool from which the rxq pulls mbufs to become empty. The result would be an infinite loop of checking if we should perform an rx rearm, calling the function and an error return due the the mempool being emtpy. The fix is to align the code in the AVX2 driver with the SSE driver, where an if() is used instead of a while(), allowing the thread to return from i40e rx funtion even if the mempool is empty. Fixes: dafadd73762e ("net/i40e: add AVX2 Rx function") Cc: bruce.richard...@intel.com Cc: sta...@dpdk.org Reported-by: David Coyle Signed-off-by: Harry van Haaren --- Cc: tho...@monjalon.net @Thomas, please consider this fix for inclusion in 18.08-rc3 assuming it gets verified as a good fix and Acked. Reporter please verify: Cc: david.co...@intel.com i40e maintainers, please Review/Ack: Cc: beilei.x...@intel.com Cc: qi.z.zh...@intel.com --- drivers/net/i40e/i40e_rxtx_vec_avx2.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/i40e/i40e_rxtx_vec_avx2.c b/drivers/net/i40e/i40e_rxtx_vec_avx2.c index dbcb61f..23179b3 100644 --- a/drivers/net/i40e/i40e_rxtx_vec_avx2.c +++ b/drivers/net/i40e/i40e_rxtx_vec_avx2.c @@ -188,7 +188,7 @@ _recv_raw_pkts_vec_avx2(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_pkts, /* See if we need to rearm the RX queue - gives the prefetch a bit * of time to act */ - while (rxq->rxrearm_nb > RTE_I40E_RXQ_REARM_THRESH) + if (rxq->rxrearm_nb > RTE_I40E_RXQ_REARM_THRESH) i40e_rxq_rearm(rxq); /* Before we start moving massive data around, check to see if -- 2.7.4
[dpdk-dev] [PATCH v2] net/i40e: fix avx2 driver check for rx rearm
This commit fixes an infinite loop bug that could occur if the i40e AVX2 driver is used, and high traffic rates cause the mempool from which the rxq pulls mbufs to become empty. The result would be an infinite loop of checking if we should perform an rx rearm, calling the function and an error return due the the mempool being emtpy. The fix is to align the code in the AVX2 driver with the SSE driver, where an if() is used instead of a while(), allowing the thread to return from i40e rx function even if the mempool is empty. Fixes: dafadd73762e ("net/i40e: add AVX2 Rx function") Cc: bruce.richard...@intel.com Cc: sta...@dpdk.org Reported-by: David Coyle Signed-off-by: Harry van Haaren --- v2: - Fix typo, sorry for email flood :) Cc: tho...@monjalon.net @Thomas, please consider this fix for inclusion in 18.08-rc3 assuming it gets verified as a good fix and Acked. Reporter please verify: Cc: david.co...@intel.com i40e maintainers, please Review/Ack: Cc: beilei.x...@intel.com Cc: qi.z.zh...@intel.com --- drivers/net/i40e/i40e_rxtx_vec_avx2.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/i40e/i40e_rxtx_vec_avx2.c b/drivers/net/i40e/i40e_rxtx_vec_avx2.c index dbcb61f..23179b3 100644 --- a/drivers/net/i40e/i40e_rxtx_vec_avx2.c +++ b/drivers/net/i40e/i40e_rxtx_vec_avx2.c @@ -188,7 +188,7 @@ _recv_raw_pkts_vec_avx2(struct i40e_rx_queue *rxq, struct rte_mbuf **rx_pkts, /* See if we need to rearm the RX queue - gives the prefetch a bit * of time to act */ - while (rxq->rxrearm_nb > RTE_I40E_RXQ_REARM_THRESH) + if (rxq->rxrearm_nb > RTE_I40E_RXQ_REARM_THRESH) i40e_rxq_rearm(rxq); /* Before we start moving massive data around, check to see if -- 2.7.4
Re: [dpdk-dev] [RFC] ethdev: add generic L2/L3 tunnel encapsulation actions
> -Original Message- > From: Stephen Hemminger [mailto:step...@networkplumber.org] > Sent: Monday, July 30, 2018 8:29 PM > To: Ori Kam > Cc: Xueming(Steven) Li ; Dekel Peled > ; Shahaf Shuler ; Adrien > Mazarguil ; Thomas Monjalon > ; Yongseok Koh ; > ferruh.yi...@intel.com; arybche...@solarflare.com; dev@dpdk.org > Subject: Re: [dpdk-dev] [RFC] ethdev: add generic L2/L3 tunnel > encapsulation actions > > On Mon, 30 Jul 2018 19:19:25 +0300 > Ori Kam wrote: > > > Currenlty the encap/decap actions only support encapsulation > > of VXLAN and NVGRE L2 packets. > > There is a need to add more L2 tunnels and also L3 tunnels. > > > > One issue with the current approch is the duplication of code. > > For example the code for handling NVGRE and VXLAN are exactly the > same, > > and each new tunnel will have the same exact structure. > > > > Last issue with the current approach is the use of rte_items. > > The most significant issue with that is that the PMD needs to convert > > the items and this hurts the insertion rate. Other issue is that > > the rte_item has 3 members while we only need the spec (last and mask > > are useless). I know that the extra member have only small memory > > impact but considering that we can have millions of rules, this became > > more important consideration, and it is bad practice to add a variable > > that is never used. > > > > My suggestion is to create 2 commands, one for encapsulation of L2 > > packets and one for encapsulation of L3 tunnels. > > The parameters for those functions will be a uint8_t buffer with > > a length parameter. > > > > The current approach is not implemented yet in drivers yet, and > > is marked as experimental, so it should be removed. > > > > Any comments will be hugely appreciated. > > > > Signed-off-by: Ori Kam > > What about binary and source compatibilities with older release? I'm not sure what you mean, currently this feature is not implemented In any PMD (as far as I can see) so no one uses it, and it is marked as experimental. In any case if this is an issue we can keep the old one and just add the new one. Best, Ori
[dpdk-dev] Questions about TX descriptors run out occasionally
Hi Experts, I'm developing my own dpdk-based application via Intel 82599ES port. My Application is doing a job to send ICMP requests (packet size varies from 64 bytes to 1472 bytes, 200,000 pps, 1.1Gbps) and receive responses, with ARP request/response and ICMP response handling when necessary. It was working pretty fine in 5 hours to 10 days randomly and then TX descriptors run out and cannot be freed by ixgbe_tx_free_bufs() due to DD bit is not set: /* check DD bit on threshold descriptor */ status = txq->tx_ring[txq->tx_next_dd].wb.status; if (!(status & IXGBE_ADVTXD_STAT_DD)) return 0; My tx queue setup is: tx_conf->tx_thresh.pthresh = 64; tx_conf->tx_thresh.hthresh = 0; tx_conf->tx_thresh.wthresh = 0; tx_conf->tx_free_thresh = 256; tx_conf->tx_rs_thresh = 32; tx_conf->txq_flags = ETH_TXQ_FLAGS_NOMULTSEGS | ETH_TXQ_FLAGS_NOOFFLOADS; I tried to read code to see if there is any case to take these descriptors and never set IXGBE_ADVTXD_STAT_DD back but no luck yet. And I have not even found the related code when IXGBE_ADVTXD_STAT_DD is set/unset when descriptor is taken/released other than reset queues... So may I ask: 1. where do we set/unset IXGBE_ADVTXD_STAT_DD when descriptor is taken/released? 2. any suggestion or information I should focus on to debug this issue? Is this typically because of my upper application not alloc/free correctly or any other problem? 3. another friend in dpdk-user list raised same issue in fm10k driver, but later he mentioned his problem was because of overheating of NIC (temperature was close to 85 degree Celsius). After setting system FAN to full speed, he made it work perfectly. Since in my system I don't have fan/temp sensors so I could not check this. Might this problem be caused by high temperature in case? Regards, Hui from Alimail macOS
Re: [dpdk-dev] [PATCH] hash table: add a bucket iterator function
Hi, Qiaobin, Thanks for the patch. If I understand correctly your use case is to use hash table as a "cache" that new entries should evict stale ones automatically. Could you be more specific on your use scenarios so that I can have more context? We are actually working on an extendable version of rte_hash which will link extra buckets to current table if the table is full. As long as the table is sized appropriately, there will be no conflict issues thus you don’t need to replace old entries. Will this solve your issue? Also, if the “cache” mode is what you want, we have the membership library “rte_member”. Is it applicable to your case? w.r.t. the patch set you proposed, my concern is that the new API will expose too much internals to the user, e.g. the bucket/entries structure should be implementation dependent. Exposing internal structures would prevent improving the implementation down the road unless we keep changing API which is not recommended. >-Original Message- >From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Wiles, Keith >Sent: Sunday, July 29, 2018 6:17 AM >To: Qiaobin Fu >Cc: Richardson, Bruce ; De Lara Guarch, Pablo >; dev@dpdk.org; >mic...@digirati.com.br; douce...@bu.edu >Subject: Re: [dpdk-dev] [PATCH] hash table: add a bucket iterator function > > > >> On Jul 28, 2018, at 12:48 PM, Qiaobin Fu wrote: >> >> Function rte_hash_bucket_iterate() enables callers to >> incrementally iterate over the hash table bucket by bucket, >> so that it can avoid creating hiccups and thrashing the cache >> of the processor. >> >> This patch mainly deals with cases in which >> the hash table is full and one needs to decide if the incoming >> entry is more valuable than the entries already in the hash table. >> >> Signed-off-by: Qiaobin Fu >> Reviewed-by: Cody Doucette >> Reviewed-by: Michel Machado >> --- >> lib/librte_hash/rte_cuckoo_hash.c| 60 + >> lib/librte_hash/rte_hash.h | 66 >> lib/librte_hash/rte_hash_version.map | 4 ++ >> 3 files changed, 130 insertions(+) >> >> diff --git a/lib/librte_hash/rte_cuckoo_hash.c >> b/lib/librte_hash/rte_cuckoo_hash.c >> index a07543a29..2b90f3593 100644 >> --- a/lib/librte_hash/rte_cuckoo_hash.c >> +++ b/lib/librte_hash/rte_cuckoo_hash.c >> @@ -1160,3 +1160,63 @@ rte_hash_iterate(const struct rte_hash *h, const void >> **key, void **data, uint32 >> >> return position - 1; >> } >> + >> +int >> +rte_hash_get_num_buckets(struct rte_hash *h) >> +{ >> +RETURN_IF_TRUE((h == NULL), -EINVAL); >> +return h->num_buckets; >> +} >> + >> +uint32_t >> +rte_hash_get_primary_bucket(const struct rte_hash *h, hash_sig_t sig) >> +{ >> +return sig & h->bucket_bitmask; >> +} >> + >> +uint32_t >> +rte_hash_get_secondary_bucket(const struct rte_hash *h, hash_sig_t sig) >> +{ >> +return rte_hash_secondary_hash(sig) & h->bucket_bitmask; >> +} > >The above two functions should have the RETURN_IF_TRUE((h == NULL), -EINVAL); >statements just like the first one. > >> + >> +int32_t >> +rte_hash_bucket_iterate(const struct rte_hash *h, >> +const void **key, void **data, uint32_t *bidx, uint32_t *next) >> +{ >> +uint32_t position; >> +struct rte_hash_key *next_key; >> + >> +RETURN_IF_TRUE(((h == NULL) || (key == NULL) || (data == NULL) || >> +(bidx == NULL) || (next == NULL)), -EINVAL); >> + >> +/* Out of bounds. */ >> +if (*bidx >= h->num_buckets || *next >= RTE_HASH_BUCKET_ENTRIES) >> +goto next_bucket; >> + >> +/* If current position is empty, go to the next one. */ >> +while (h->buckets[*bidx].key_idx[*next] == EMPTY_SLOT) { >> +(*next)++; >> +/* End of bucket. */ >> +if (*next >= RTE_HASH_BUCKET_ENTRIES) >> +goto next_bucket; >> +} >> + >> +/* Get position of entry in key table. */ >> +position = h->buckets[*bidx].key_idx[*next]; >> +next_key = (struct rte_hash_key *) ((char *)h->key_store + >> +position * h->key_entry_size); >> +/* Return key and data. */ >> +*key = next_key->key; >> +*data = next_key->pdata; >> + >> +/* Increment iterator. */ >> +(*next)++; >> + >> +return position - 1; >> + >> +next_bucket: >> +*bidx = (*bidx + 1) >= h->num_buckets ? 0 : *bidx + 1; >> +*next = 0; > >If this is an error case why are we setting *next to zero and incrementing >*bidx ? >I would expect we should not change the values on an error because they cannot >debug the values on return. > >> +return -ENOENT; >> +} >> diff --git a/lib/librte_hash/rte_hash.h b/lib/librte_hash/rte_hash.h >> index f71ca9fbf..329bf42d6 100644 >> --- a/lib/librte_hash/rte_hash.h >> +++ b/lib/librte_hash/rte_hash.h >> @@ -94,6 +94,17 @@ rte_hash_create(const struct rte_hash_parameters *params); >> */ >> void rte_hash_set_cmp_func(struct rte_hash *h, rte_hash_cmp_eq_t func); >> >> +/** >> + * Get the number of
Re: [dpdk-dev] [PATCH] vfio: revert retry logic for MSI-X BAR mapping
-Original Message- > Date: Mon, 30 Jul 2018 11:59:06 +0100 > From: Anatoly Burakov > To: dev@dpdk.org > CC: tho...@monjalon.net, jerin.ja...@caviumnetworks.com, > t.yoshimura8...@gmail.com > Subject: [PATCH] vfio: revert retry logic for MSI-X BAR mapping > X-Mailer: git-send-email 1.7.0.7 > > External Email > > This reverts commit d4774a568ba0a5923229974a002972c83eb04570. > > The patch is incomplete because kernel 4.16+, while being capable > of mapping MSI-X BARs, will also report if such a capability is > available. Without checking this capability, gratuitous errors > are displayed on kernels <4.16 while VFIO is attempting to mmap > MSI-X BAR and fails, which can be confusing to the user. > > Signed-off-by: Anatoly Burakov Acked-by: Jerin Jacob > --- > drivers/bus/pci/linux/pci_vfio.c | 93 ++-- > 1 file changed, 41 insertions(+), 52 deletions(-) > > diff --git a/drivers/bus/pci/linux/pci_vfio.c > b/drivers/bus/pci/linux/pci_vfio.c > index bb6ef7b67..686386d6a 100644 > --- a/drivers/bus/pci/linux/pci_vfio.c > +++ b/drivers/bus/pci/linux/pci_vfio.c > @@ -332,59 +332,50 @@ pci_vfio_mmap_bar(int vfio_dev_fd, struct > mapped_pci_resource *vfio_res, > void *bar_addr; > struct pci_msix_table *msix_table = &vfio_res->msix_table; > struct pci_map *bar = &vfio_res->maps[bar_index]; > - bool again = false; > > if (bar->size == 0) > /* Skip this BAR */ > return 0; > > + if (msix_table->bar_index == bar_index) { > + /* > +* VFIO will not let us map the MSI-X table, > +* but we can map around it. > +*/ > + uint32_t table_start = msix_table->offset; > + uint32_t table_end = table_start + msix_table->size; > + table_end = (table_end + ~PAGE_MASK) & PAGE_MASK; > + table_start &= PAGE_MASK; > + > + if (table_start == 0 && table_end >= bar->size) { > + /* Cannot map this BAR */ > + RTE_LOG(DEBUG, EAL, "Skipping BAR%d\n", bar_index); > + bar->size = 0; > + bar->addr = 0; > + return 0; > + } > + > + memreg[0].offset = bar->offset; > + memreg[0].size = table_start; > + memreg[1].offset = bar->offset + table_end; > + memreg[1].size = bar->size - table_end; > + > + RTE_LOG(DEBUG, EAL, > + "Trying to map BAR%d that contains the MSI-X " > + "table. Trying offsets: " > + "0x%04lx:0x%04lx, 0x%04lx:0x%04lx\n", bar_index, > + memreg[0].offset, memreg[0].size, > + memreg[1].offset, memreg[1].size); > + } else { > + memreg[0].offset = bar->offset; > + memreg[0].size = bar->size; > + } > + > /* reserve the address using an inaccessible mapping */ > bar_addr = mmap(bar->addr, bar->size, 0, MAP_PRIVATE | > MAP_ANONYMOUS | additional_flags, -1, 0); > - if (bar_addr == MAP_FAILED) { > - RTE_LOG(ERR, EAL, > - "Failed to create inaccessible mapping for BAR%d\n", > - bar_index); > - return -1; > - } > - > - memreg[0].offset = bar->offset; > - memreg[0].size = bar->size; > - do { > + if (bar_addr != MAP_FAILED) { > void *map_addr = NULL; > - if (again) { > - /* > -* VFIO did not let us map the MSI-X table, > -* but we can map around it. > -*/ > - uint32_t table_start = msix_table->offset; > - uint32_t table_end = table_start + msix_table->size; > - table_end = (table_end + ~PAGE_MASK) & PAGE_MASK; > - table_start &= PAGE_MASK; > - > - if (table_start == 0 && table_end >= bar->size) { > - /* Cannot map this BAR */ > - RTE_LOG(DEBUG, EAL, "Skipping BAR%d\n", > - bar_index); > - munmap(bar_addr, bar->size); > - bar->size = 0; > - bar->addr = 0; > - return 0; > - } > - > - memreg[0].offset = bar->offset; > - memreg[0].size = table_start; > - memreg[1].offset = bar->offset + table_end; > - memreg[1].size = bar->size - table_end; > - > - RTE_LOG(DEBUG, EAL, > - "Trying to map BAR%d that contains the M
Re: [dpdk-dev] [PATCH] hash table: add a bucket iterator function
Hi Yipeng, Thanks for the feedbacks! > On Jul 30, 2018, at 4:24 PM, Wang, Yipeng1 wrote: > > Hi, Qiaobin, > > Thanks for the patch. If I understand correctly your use case is to use hash > table as a "cache" that new entries should evict stale ones automatically. > Could you be more specific on your use scenarios so that I can have more > context? > Actually, we didn’t use hash table as a “cache”, and there is no automation to evict stale ones. Instead, the functionrte_hash_bucket_iterate() allows the callers to iterate over the hash table in an incremental way, i.e., one bucket by another bucket, so that the caller doesn’t have to iterate over the whole hash table in one time. This way can avoid creating hiccups and achieve better cache performance. One possible use scenario is when DDoS attacks happen, one may want to take more CPU time to process the packets, thus iterating over the hash table incrementally is desirable. > We are actually working on an extendable version of rte_hash which will link > extra buckets to current table if the table is full. As long as the table is > sized appropriately, there will be no conflict issues thus you don’t need to > replace old entries. Will this solve your issue? Also, if the “cache” mode is > what you want, we have the membership library “rte_member”. Is it applicable > to your case? I agree that adding extra buckets to current table when the table is full can alleviate this scenario. Indeed, we also considered this solution before coming up our proposed solution. However, it’s still highly desirable to have a bucket iterator function. Considering the scenario where the machine’s memory is limited and cannot always allocate new memory to the hash table (e.g., DDoS attacks, switch measurement tasks, etc.), a solution is to allow the callers evict some less important (the criteria for the eviction is based on the caller’s needs) entries. Indeed, we don’t have the “cache” mode, our implementation is a way to achieve better cache performance. So, the rte_member won’t help here. > > w.r.t. the patch set you proposed, my concern is that the new API will expose > too much internals to the user, e.g. the bucket/entries structure should be > implementation dependent. Exposing internal structures would prevent > improving the implementation down the road unless we keep changing API which > is not recommended. > The functions we add here give a way for the callers to iterate over the specific bucket, which potentially contains the entry. These functions can be made general enough to allow callers to heuristically evict some entries, and add the new ones to the hash table. Otherwise, there is no way to evict some less important entries. Hope this clarifies the patch. Best, Qiaobin > >> -Original Message- >> From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Wiles, Keith >> Sent: Sunday, July 29, 2018 6:17 AM >> To: Qiaobin Fu >> Cc: Richardson, Bruce ; De Lara Guarch, Pablo >> ; dev@dpdk.org; >> mic...@digirati.com.br; douce...@bu.edu >> Subject: Re: [dpdk-dev] [PATCH] hash table: add a bucket iterator function >> >> >> >>> On Jul 28, 2018, at 12:48 PM, Qiaobin Fu wrote: >>> >>> Function rte_hash_bucket_iterate() enables callers to >>> incrementally iterate over the hash table bucket by bucket, >>> so that it can avoid creating hiccups and thrashing the cache >>> of the processor. >>> >>> This patch mainly deals with cases in which >>> the hash table is full and one needs to decide if the incoming >>> entry is more valuable than the entries already in the hash table. >>> >>> Signed-off-by: Qiaobin Fu >>> Reviewed-by: Cody Doucette >>> Reviewed-by: Michel Machado >>> --- >>> lib/librte_hash/rte_cuckoo_hash.c| 60 + >>> lib/librte_hash/rte_hash.h | 66 >>> lib/librte_hash/rte_hash_version.map | 4 ++ >>> 3 files changed, 130 insertions(+) >>> >>> diff --git a/lib/librte_hash/rte_cuckoo_hash.c >>> b/lib/librte_hash/rte_cuckoo_hash.c >>> index a07543a29..2b90f3593 100644 >>> --- a/lib/librte_hash/rte_cuckoo_hash.c >>> +++ b/lib/librte_hash/rte_cuckoo_hash.c >>> @@ -1160,3 +1160,63 @@ rte_hash_iterate(const struct rte_hash *h, const >>> void **key, void **data, uint32 >>> >>> return position - 1; >>> } >>> + >>> +int >>> +rte_hash_get_num_buckets(struct rte_hash *h) >>> +{ >>> + RETURN_IF_TRUE((h == NULL), -EINVAL); >>> + return h->num_buckets; >>> +} >>> + >>> +uint32_t >>> +rte_hash_get_primary_bucket(const struct rte_hash *h, hash_sig_t sig) >>> +{ >>> + return sig & h->bucket_bitmask; >>> +} >>> + >>> +uint32_t >>> +rte_hash_get_secondary_bucket(const struct rte_hash *h, hash_sig_t sig) >>> +{ >>> + return rte_hash_secondary_hash(sig) & h->bucket_bitmask; >>> +} >> >> The above two functions should have the RETURN_IF_TRUE((h == NULL), >> -EINVAL); statements just like the first one. >> >>> + >>> +int32_t >>> +rte_hash