RE: [PATCH] vhost: fix unnecessary dirty page logging
Hi, > -Original Message- > From: Maxime Coquelin > Sent: 2022年7月7日 19:31 > To: Xia, Chenbo ; Ding, Xuan > Cc: dev@dpdk.org; Hu, Jiayu ; He, Xingguang > ; Yang, YvonneX ; Jiang, > Cheng1 > Subject: Re: [PATCH] vhost: fix unnecessary dirty page logging > > > > On 7/7/22 11:51, Xia, Chenbo wrote: > >> -Original Message- > >> From: Ding, Xuan > >> Sent: Thursday, July 7, 2022 2:55 PM > >> To: maxime.coque...@redhat.com; Xia, Chenbo > >> Cc: dev@dpdk.org; Hu, Jiayu ; He, Xingguang > >> ; Yang, YvonneX ; > >> Jiang, > >> Cheng1 ; Ding, Xuan > >> Subject: [PATCH] vhost: fix unnecessary dirty page logging > >> > >> From: Xuan Ding > >> > >> The dirty page logging is only required in vhost enqueue direction > >> for live migration. This patch removes the unnecessary dirty page > >> logging in vhost dequeue direction. Otherwise, it will result in a > >> performance drop. Some if-else judgements are also optimized to improve > performance. > >> > >> Fixes: 6d823bb302c7 ("vhost: prepare sync for descriptor to mbuf > >> refactoring") > >> Fixes: b6eee3e83402 ("vhost: fix sync dequeue offload") > >> > >> Signed-off-by: Xuan Ding > >> --- > > > > Reviewed-by: Chenbo Xia > > > > Although it's late in release, we can consider to merge this as I see > > it impacts the performance by 5%: > > > http://inbox.dpdk.org/dev/BYAPR11MB2711F13CDA2B0A4535A6591EFE839@B > YAPR > > 11MB2711.namprd11.prod.outlook.com/T/#t > > Yes, I raised we need it in -rc4 at today's Release status meeting. > I'll review it today. > > > But also, it will be good to know the performance issue is solved by > > sharing the test results. > > Yes, Intel performance results would be appreciated. This fix patch is for the issue reported at 22.07-rc3. The refactoring patch brings a 3%~5% perf drop in vhost sync path. With fix patch, the perf drop introduced by refactoring is solved. However, the testing result shows there still exists ~5% packed ring perf drop compared with 22.03. We find the improving checksum offload patch series in 22.07 may contribute to the packed ring perf drop. Because we always do checksum checks in PMD. Could you help to double check this patch series, is it as expected? Your assistance is really appreciated. Regards, Xuan > > > Thanks, > > Chenbo > > > > Thanks, > Maxime
[PATCH] common/sfc_efx/base: remove VQ index check during VQ start
From: Abhimanyu Saini The used/avail queue indexes are not bound by queue size, because the descriptor entry index is calculated by a simple modulo between queue index and queue_size So, do not check initial used and avail queue indexes against queue size because it is possible for these indexes to be greater than queue size in the following cases: 1) The queue is created to be migrated into, or 2) The client issues a qstop/qstart after running datapath Fixes: 4dda72dbdeab3 ("common/sfc_efx/base: add base virtio support for vDPA") Cc: sta...@dpdk.org Signed-off-by: Abhimanyu Saini --- drivers/common/sfc_efx/base/rhead_virtio.c | 12 +--- 1 file changed, 1 insertion(+), 11 deletions(-) diff --git a/drivers/common/sfc_efx/base/rhead_virtio.c b/drivers/common/sfc_efx/base/rhead_virtio.c index 335cb74..7f08717 100644 --- a/drivers/common/sfc_efx/base/rhead_virtio.c +++ b/drivers/common/sfc_efx/base/rhead_virtio.c @@ -47,14 +47,6 @@ goto fail2; } - if (evvdp != NULL) { - if ((evvdp->evvd_vq_cidx > evvcp->evvc_vq_size) || - (evvdp->evvd_vq_pidx > evvcp->evvc_vq_size)) { - rc = EINVAL; - goto fail3; - } - } - req.emr_cmd = MC_CMD_VIRTIO_INIT_QUEUE; req.emr_in_buf = payload; req.emr_in_length = MC_CMD_VIRTIO_INIT_QUEUE_REQ_LEN; @@ -116,15 +108,13 @@ if (req.emr_rc != 0) { rc = req.emr_rc; - goto fail4; + goto fail3; } evvp->evv_vi_index = vi_index; return (0); -fail4: - EFSYS_PROBE(fail4); fail3: EFSYS_PROBE(fail3); fail2: -- 1.8.3.1
RE: [PATCH] vhost: fix unnecessary dirty page logging
> -Original Message- > From: Ding, Xuan > Sent: Friday, July 8, 2022 3:04 PM > To: Maxime Coquelin ; Xia, Chenbo > > Cc: dev@dpdk.org; Hu, Jiayu ; He, Xingguang > ; Yang, YvonneX ; > Jiang, Cheng1 > Subject: RE: [PATCH] vhost: fix unnecessary dirty page logging > > Hi, > > > -Original Message- > > From: Maxime Coquelin > > Sent: 2022年7月7日 19:31 > > To: Xia, Chenbo ; Ding, Xuan > > > > Cc: dev@dpdk.org; Hu, Jiayu ; He, Xingguang > > ; Yang, YvonneX ; > > Jiang, > > Cheng1 > > Subject: Re: [PATCH] vhost: fix unnecessary dirty page logging > > > > > > > > On 7/7/22 11:51, Xia, Chenbo wrote: > > >> -Original Message- > > >> From: Ding, Xuan > > >> Sent: Thursday, July 7, 2022 2:55 PM > > >> To: maxime.coque...@redhat.com; Xia, Chenbo > > >> Cc: dev@dpdk.org; Hu, Jiayu ; He, Xingguang > > >> ; Yang, YvonneX ; > > >> Jiang, > > >> Cheng1 ; Ding, Xuan > > >> Subject: [PATCH] vhost: fix unnecessary dirty page logging > > >> > > >> From: Xuan Ding > > >> > > >> The dirty page logging is only required in vhost enqueue direction > > >> for live migration. This patch removes the unnecessary dirty page > > >> logging in vhost dequeue direction. Otherwise, it will result in a > > >> performance drop. Some if-else judgements are also optimized to > > >> improve > > performance. > > >> > > >> Fixes: 6d823bb302c7 ("vhost: prepare sync for descriptor to mbuf > > >> refactoring") > > >> Fixes: b6eee3e83402 ("vhost: fix sync dequeue offload") > > >> > > >> Signed-off-by: Xuan Ding > > >> --- > > > > > > Reviewed-by: Chenbo Xia > > > > > > Although it's late in release, we can consider to merge this as I > > > see it impacts the performance by 5%: > > > > > > http://inbox.dpdk.org/dev/BYAPR11MB2711F13CDA2B0A4535A6591EFE839 > @B > > YAPR > > > 11MB2711.namprd11.prod.outlook.com/T/#t > > > > Yes, I raised we need it in -rc4 at today's Release status meeting. > > I'll review it today. > > > > > But also, it will be good to know the performance issue is solved by > > > sharing the test results. > > > > Yes, Intel performance results would be appreciated. > > This fix patch is for the issue reported at 22.07-rc3. The refactoring patch > brings a 3%~5% perf drop in vhost sync path. > With fix patch, the perf drop introduced by refactoring is solved. The fix is for a critical issue, I tested dpdk22.07-rc3 with the patch, the performance drop introduced by refactoring have been fixed and no other regressions were seen as a result of the patch, but the packed ring still exist drop around 5%. Tested-by: Xingguang He > > However, the testing result shows there still exists ~5% packed ring perf drop > compared with 22.03. > We find the improving checksum offload patch series in 22.07 may contribute > to the packed ring perf drop. > Because we always do checksum checks in PMD. I tested the performance of Vhost before and after checksum offload patch series, there exist ~5% performance drop. > > Could you help to double check this patch series, is it as expected? > Your assistance is really appreciated. > > Regards, > Xuan > > > > > > Thanks, > > > Chenbo > > > > > > > Thanks, > > Maxime
Re: [PATCH] vhost: fix unnecessary dirty page logging
Hi, On 7/8/22 09:04, Ding, Xuan wrote: Hi, -Original Message- From: Maxime Coquelin Sent: 2022年7月7日 19:31 To: Xia, Chenbo ; Ding, Xuan Cc: dev@dpdk.org; Hu, Jiayu ; He, Xingguang ; Yang, YvonneX ; Jiang, Cheng1 Subject: Re: [PATCH] vhost: fix unnecessary dirty page logging On 7/7/22 11:51, Xia, Chenbo wrote: -Original Message- From: Ding, Xuan Sent: Thursday, July 7, 2022 2:55 PM To: maxime.coque...@redhat.com; Xia, Chenbo Cc: dev@dpdk.org; Hu, Jiayu ; He, Xingguang ; Yang, YvonneX ; Jiang, Cheng1 ; Ding, Xuan Subject: [PATCH] vhost: fix unnecessary dirty page logging From: Xuan Ding The dirty page logging is only required in vhost enqueue direction for live migration. This patch removes the unnecessary dirty page logging in vhost dequeue direction. Otherwise, it will result in a performance drop. Some if-else judgements are also optimized to improve performance. Fixes: 6d823bb302c7 ("vhost: prepare sync for descriptor to mbuf refactoring") Fixes: b6eee3e83402 ("vhost: fix sync dequeue offload") Signed-off-by: Xuan Ding --- Reviewed-by: Chenbo Xia Although it's late in release, we can consider to merge this as I see it impacts the performance by 5%: http://inbox.dpdk.org/dev/BYAPR11MB2711F13CDA2B0A4535A6591EFE839@B YAPR 11MB2711.namprd11.prod.outlook.com/T/#t Yes, I raised we need it in -rc4 at today's Release status meeting. I'll review it today. But also, it will be good to know the performance issue is solved by sharing the test results. Yes, Intel performance results would be appreciated. This fix patch is for the issue reported at 22.07-rc3. The refactoring patch brings a 3%~5% perf drop in vhost sync path. With fix patch, the perf drop introduced by refactoring is solved. Good. However, the testing result shows there still exists ~5% packed ring perf drop compared with 22.03. We find the improving checksum offload patch series in 22.07 may contribute to the packed ring perf drop. Because we always do checksum checks in PMD. Could you help to double check this patch series, is it as expected? Your assistance is really appreciated. I will look again, but if your analysis is right, there's not much we can do. The series is filling gaps in the checksum offload support. I will not have time to work on it for v22.07, we're too late in the cycle. What surprises me is that only packed ring is impacted. Regards, Maxime Regards, Xuan Thanks, Chenbo Thanks, Maxime
[PATCH] vdpa/sfc: enable support for multi-queue
From: Abhimanyu Saini Increase the number to defaut RX/TX queue pairs to 8, and add MQ feature flag to vDPA protocol features. Signed-off-by: Abhimanyu Saini --- drivers/vdpa/sfc/sfc_vdpa_hw.c | 2 ++ drivers/vdpa/sfc/sfc_vdpa_ops.c | 6 -- drivers/vdpa/sfc/sfc_vdpa_ops.h | 2 +- 3 files changed, 7 insertions(+), 3 deletions(-) diff --git a/drivers/vdpa/sfc/sfc_vdpa_hw.c b/drivers/vdpa/sfc/sfc_vdpa_hw.c index a7018b1..edb7e35 100644 --- a/drivers/vdpa/sfc/sfc_vdpa_hw.c +++ b/drivers/vdpa/sfc/sfc_vdpa_hw.c @@ -286,6 +286,8 @@ SFC_VDPA_ASSERT(max_queue_cnt > 0); sva->max_queue_count = max_queue_cnt; + sfc_vdpa_log_init(sva, "NIC init done with %u pair(s) of queues", + max_queue_cnt); return 0; diff --git a/drivers/vdpa/sfc/sfc_vdpa_ops.c b/drivers/vdpa/sfc/sfc_vdpa_ops.c index b84699d..e4cde34 100644 --- a/drivers/vdpa/sfc/sfc_vdpa_ops.c +++ b/drivers/vdpa/sfc/sfc_vdpa_ops.c @@ -24,14 +24,16 @@ (1ULL << VHOST_USER_PROTOCOL_F_SLAVE_REQ) | \ (1ULL << VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD) | \ (1ULL << VHOST_USER_PROTOCOL_F_HOST_NOTIFIER) | \ -(1ULL << VHOST_USER_PROTOCOL_F_LOG_SHMFD)) +(1ULL << VHOST_USER_PROTOCOL_F_LOG_SHMFD) | \ +(1ULL << VHOST_USER_PROTOCOL_F_MQ)) /* * Set of features which are enabled by default. * Protocol feature bit is needed to enable notification notifier ctrl. */ #define SFC_VDPA_DEFAULT_FEATURES \ - (1ULL << VHOST_USER_F_PROTOCOL_FEATURES) + ((1ULL << VHOST_USER_F_PROTOCOL_FEATURES) | \ +(1ULL << VIRTIO_NET_F_MQ)) #define SFC_VDPA_MSIX_IRQ_SET_BUF_LEN \ (sizeof(struct vfio_irq_set) + \ diff --git a/drivers/vdpa/sfc/sfc_vdpa_ops.h b/drivers/vdpa/sfc/sfc_vdpa_ops.h index 9dbd5b8..5c8e352 100644 --- a/drivers/vdpa/sfc/sfc_vdpa_ops.h +++ b/drivers/vdpa/sfc/sfc_vdpa_ops.h @@ -7,7 +7,7 @@ #include -#define SFC_VDPA_MAX_QUEUE_PAIRS 1 +#define SFC_VDPA_MAX_QUEUE_PAIRS 8 enum sfc_vdpa_context { SFC_VDPA_AS_VF -- 1.8.3.1
RE: [PATCH] vhost: fix unnecessary dirty page logging
> -Original Message- > From: Maxime Coquelin > Sent: 2022年7月8日 15:58 > To: Ding, Xuan ; Xia, Chenbo > Cc: dev@dpdk.org; Hu, Jiayu ; He, Xingguang > ; Yang, YvonneX ; Jiang, > Cheng1 > Subject: Re: [PATCH] vhost: fix unnecessary dirty page logging > > Hi, > > On 7/8/22 09:04, Ding, Xuan wrote: > > Hi, > > > >> -Original Message- > >> From: Maxime Coquelin > >> Sent: 2022年7月7日 19:31 > >> To: Xia, Chenbo ; Ding, Xuan > >> > >> Cc: dev@dpdk.org; Hu, Jiayu ; He, Xingguang > >> ; Yang, YvonneX ; > >> Jiang, > >> Cheng1 > >> Subject: Re: [PATCH] vhost: fix unnecessary dirty page logging > >> > >> > >> > >> On 7/7/22 11:51, Xia, Chenbo wrote: > -Original Message- > From: Ding, Xuan > Sent: Thursday, July 7, 2022 2:55 PM > To: maxime.coque...@redhat.com; Xia, Chenbo > Cc: dev@dpdk.org; Hu, Jiayu ; He, Xingguang > ; Yang, YvonneX ; > Jiang, > Cheng1 ; Ding, Xuan > Subject: [PATCH] vhost: fix unnecessary dirty page logging > > From: Xuan Ding > > The dirty page logging is only required in vhost enqueue direction > for live migration. This patch removes the unnecessary dirty page > logging in vhost dequeue direction. Otherwise, it will result in a > performance drop. Some if-else judgements are also optimized to > improve > >> performance. > > Fixes: 6d823bb302c7 ("vhost: prepare sync for descriptor to mbuf > refactoring") > Fixes: b6eee3e83402 ("vhost: fix sync dequeue offload") > > Signed-off-by: Xuan Ding > --- > >>> > >>> Reviewed-by: Chenbo Xia > >>> > >>> Although it's late in release, we can consider to merge this as I > >>> see it impacts the performance by 5%: > >>> > >> > http://inbox.dpdk.org/dev/BYAPR11MB2711F13CDA2B0A4535A6591EFE839@B > >> YAPR > >>> 11MB2711.namprd11.prod.outlook.com/T/#t > >> > >> Yes, I raised we need it in -rc4 at today's Release status meeting. > >> I'll review it today. > >> > >>> But also, it will be good to know the performance issue is solved by > >>> sharing the test results. > >> > >> Yes, Intel performance results would be appreciated. > > > > This fix patch is for the issue reported at 22.07-rc3. The refactoring patch > brings a 3%~5% perf drop in vhost sync path. > > With fix patch, the perf drop introduced by refactoring is solved. > > Good. > > > However, the testing result shows there still exists ~5% packed ring perf > > drop > compared with 22.03. > > We find the improving checksum offload patch series in 22.07 may contribute > to the packed ring perf drop. > > Because we always do checksum checks in PMD. > > > > Could you help to double check this patch series, is it as expected? > > Your assistance is really appreciated. > > I will look again, but if your analysis is right, there's not much we can do. > The > series is filling gaps in the checksum offload support. Yes, I think so. > > I will not have time to work on it for v22.07, we're too late in the cycle. > What > surprises me is that only packed ring is impacted. From the testing results, the packed ring is more sensitive to the patch series, while split ring shows a much lower drop in performance. Thanks for your support! Regards, Xuan > > Regards, > Maxime > > > > > Regards, > > Xuan > > > >> > >>> Thanks, > >>> Chenbo > >>> > >> > >> Thanks, > >> Maxime > >
Re: [PATCH] vdpa/sfc: resolve race between libvhost and dev_conf
Hello Abhimanyu, Vijay, On Thu, Jul 7, 2022 at 2:38 PM Maxime Coquelin wrote: > On 7/6/22 11:24, abhimanyu.sa...@xilinx.com wrote: > > From: Abhimanyu Saini > > > > libvhost calls dev_conf() before prosessing the > > VHOST_USER_SET_VRING_CALL message for the last VQ. So > > this message is processed after dev_conf() returns. > > > > However, the dev_conf() function spawns a thread to set > > rte_vhost_host_notifier_ctrl() before returning control to > > libvhost. This parallel thread in turn invokes get_notify_area(). > > To get the notify_area, the vdpa driver needs to query the HW and > > for this query it needs an enabled VQ. > > > > But at the same time libvhost is processing the last > > VHOST_USER_SET_VRING_CALL, and to do that it disables the last VQ. > > > > Hence there is a race b/w the libvhost and the vdpa driver. > > > > To resolve this race condition, query the HW and cache notify_area > > inside dev_conf() instead of doing it the parallel thread. > > > > Signed-off-by: Abhimanyu Saini > > --- > > drivers/vdpa/sfc/sfc_vdpa_ops.c | 36 ++-- > > drivers/vdpa/sfc/sfc_vdpa_ops.h | 1 + > > 2 files changed, 19 insertions(+), 18 deletions(-) > > > > During today's Release status meeting, Andrew mentioned that this patch > has been for a log time already in your internal tree. > > So it gives a bit of confidence in taking it in -rc4. - But it is neither reviewed, nor acked by the driver maintainer. Vijay, as this driver maintainer, your opinion matters. We are in rc4 stage and we merge only critical fixes now. There won't be much time to test this fix once merged (and I am not talking about fixing a regression). Are you confident with this fix? is it required for the 22.07 release? If we don't get an answer, the safer is to let those fixes slip to a next release. - Besides, I see there is a new fix for some sfc driver. https://patches.dpdk.org/project/dpdk/patch/20220708073702.29391-1-asa...@xilinx.com/ The same questions will be asked. -- David Marchand
RE: [PATCH] vhost: fix unnecessary dirty page logging
> -Original Message- > From: He, Xingguang > Sent: Friday, July 8, 2022 3:53 PM > To: Ding, Xuan ; Maxime Coquelin > ; Xia, Chenbo > Cc: dev@dpdk.org; Hu, Jiayu ; Yang, YvonneX > ; Jiang, Cheng1 > Subject: RE: [PATCH] vhost: fix unnecessary dirty page logging > > > -Original Message- > > From: Ding, Xuan > > Sent: Friday, July 8, 2022 3:04 PM > > To: Maxime Coquelin ; Xia, Chenbo > > > > Cc: dev@dpdk.org; Hu, Jiayu ; He, Xingguang > > ; Yang, YvonneX ; > > Jiang, Cheng1 > > Subject: RE: [PATCH] vhost: fix unnecessary dirty page logging > > > > Hi, > > > > > -Original Message- > > > From: Maxime Coquelin > > > Sent: 2022年7月7日 19:31 > > > To: Xia, Chenbo ; Ding, Xuan > > > > > > Cc: dev@dpdk.org; Hu, Jiayu ; He, Xingguang > > > ; Yang, YvonneX ; > > > Jiang, > > > Cheng1 > > > Subject: Re: [PATCH] vhost: fix unnecessary dirty page logging > > > > > > > > > > > > On 7/7/22 11:51, Xia, Chenbo wrote: > > > >> -Original Message- > > > >> From: Ding, Xuan > > > >> Sent: Thursday, July 7, 2022 2:55 PM > > > >> To: maxime.coque...@redhat.com; Xia, Chenbo > > > >> > > > >> Cc: dev@dpdk.org; Hu, Jiayu ; He, Xingguang > > > >> ; Yang, YvonneX > ; > > > >> Jiang, > > > >> Cheng1 ; Ding, Xuan > > > >> Subject: [PATCH] vhost: fix unnecessary dirty page logging > > > >> > > > >> From: Xuan Ding > > > >> > > > >> The dirty page logging is only required in vhost enqueue > > > >> direction for live migration. This patch removes the unnecessary > > > >> dirty page logging in vhost dequeue direction. Otherwise, it will > > > >> result in a performance drop. Some if-else judgements are also > > > >> optimized to improve > > > performance. > > > >> > > > >> Fixes: 6d823bb302c7 ("vhost: prepare sync for descriptor to mbuf > > > >> refactoring") > > > >> Fixes: b6eee3e83402 ("vhost: fix sync dequeue offload") > > > >> > > > >> Signed-off-by: Xuan Ding > > > >> --- > > > > > > > > Reviewed-by: Chenbo Xia > > > > > > > > Although it's late in release, we can consider to merge this as I > > > > see it impacts the performance by 5%: > > > > > > > > > > http://inbox.dpdk.org/dev/BYAPR11MB2711F13CDA2B0A4535A6591EFE839 > > @B > > > YAPR > > > > 11MB2711.namprd11.prod.outlook.com/T/#t > > > > > > Yes, I raised we need it in -rc4 at today's Release status meeting. > > > I'll review it today. > > > > > > > But also, it will be good to know the performance issue is solved > > > > by sharing the test results. > > > > > > Yes, Intel performance results would be appreciated. > > > > This fix patch is for the issue reported at 22.07-rc3. The refactoring > > patch brings a 3%~5% perf drop in vhost sync path. > > With fix patch, the perf drop introduced by refactoring is solved. > > The fix is for a critical issue, I tested dpdk22.07-rc3 with the patch, the > performance drop introduced by refactoring have been fixed and no other > regressions were seen as a result of the patch, but the packed ring still > exist > drop around 5%. > > Tested-by: Xingguang He > > > > > However, the testing result shows there still exists ~5% packed ring > > perf drop compared with 22.03. > > We find the improving checksum offload patch series in 22.07 may > > contribute to the packed ring perf drop. > > Because we always do checksum checks in PMD. > > I tested the performance of Vhost before and after checksum offload patch > series, there exist ~5% performance drop. Sorry, the drop of checksum offload patch series is ~3%. The drop between DPDK22.03 and DPDK22.07-rc3 is ~5%. > > > > > Could you help to double check this patch series, is it as expected? > > Your assistance is really appreciated. > > > > Regards, > > Xuan > > > > > > > > > Thanks, > > > > Chenbo > > > > > > > > > > Thanks, > > > Maxime
RE: [PATCH V2] app/testpmd: fix display types failure when query RSS rule
> -Original Message- > From: Li, WeiyuanX > Sent: Friday, July 8, 2022 1:32 PM > To: Huisong Li ; ferruh.yi...@xilinx.com; > andrew.rybche...@oktetlabs.ru; dev@dpdk.org > Cc: tho...@monjalon.net; huangda...@huawei.com; > liudongdo...@huawei.com > Subject: RE: [PATCH V2] app/testpmd: fix display types failure when query > RSS rule > > > -Original Message- > > From: Huisong Li > > Sent: Friday, July 8, 2022 9:42 AM > > To: ferruh.yi...@xilinx.com; andrew.rybche...@oktetlabs.ru; > > dev@dpdk.org > > Cc: tho...@monjalon.net; Li, WeiyuanX ; > > huangda...@huawei.com; liudongdo...@huawei.com; > lihuis...@huawei.com > > Subject: [PATCH V2] app/testpmd: fix display types failure when query > > RSS rule > > > > Now testpmd fails to display types when query RSS rule. The failure is > > because the '\n' character is missing at the end of the function > > 'rss_config_display()'. Actually, all places calling 'xxx_types_display()' > > need to '\n'. So this patch moves '\n' to the inside of these function. > > > > Fixes: 534988c490f1 ("app/testpmd: unify RSS types display") > > Fixes: 44a37f3cffe0 ("app/testpmd: compact RSS types output") > > > > --- > > v2: > > - move '\n' to the inside of 'xxx_types_display()'. > > > > Signed-off-by: Huisong Li > > Tested-by: Weiyuan Li Thanks Huisong. Hi Huisong, Ferruh, May I know that this patch can be merged into RC4? Best regards, Yu Jiang
[PATCH v2] vdpa/sfc: enable support for multi-queue
From: Abhimanyu Saini Increase the number to default RX/TX queue pairs to 8, and add MQ feature flag to vDPA protocol features. Signed-off-by: Abhimanyu Saini --- v2: Fix checkpatch warnings drivers/vdpa/sfc/sfc_vdpa_hw.c | 2 ++ drivers/vdpa/sfc/sfc_vdpa_ops.c | 6 -- drivers/vdpa/sfc/sfc_vdpa_ops.h | 2 +- 3 files changed, 7 insertions(+), 3 deletions(-) diff --git a/drivers/vdpa/sfc/sfc_vdpa_hw.c b/drivers/vdpa/sfc/sfc_vdpa_hw.c index a7018b1..edb7e35 100644 --- a/drivers/vdpa/sfc/sfc_vdpa_hw.c +++ b/drivers/vdpa/sfc/sfc_vdpa_hw.c @@ -286,6 +286,8 @@ SFC_VDPA_ASSERT(max_queue_cnt > 0); sva->max_queue_count = max_queue_cnt; + sfc_vdpa_log_init(sva, "NIC init done with %u pair(s) of queues", + max_queue_cnt); return 0; diff --git a/drivers/vdpa/sfc/sfc_vdpa_ops.c b/drivers/vdpa/sfc/sfc_vdpa_ops.c index b84699d..e4cde34 100644 --- a/drivers/vdpa/sfc/sfc_vdpa_ops.c +++ b/drivers/vdpa/sfc/sfc_vdpa_ops.c @@ -24,14 +24,16 @@ (1ULL << VHOST_USER_PROTOCOL_F_SLAVE_REQ) | \ (1ULL << VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD) | \ (1ULL << VHOST_USER_PROTOCOL_F_HOST_NOTIFIER) | \ -(1ULL << VHOST_USER_PROTOCOL_F_LOG_SHMFD)) +(1ULL << VHOST_USER_PROTOCOL_F_LOG_SHMFD) | \ +(1ULL << VHOST_USER_PROTOCOL_F_MQ)) /* * Set of features which are enabled by default. * Protocol feature bit is needed to enable notification notifier ctrl. */ #define SFC_VDPA_DEFAULT_FEATURES \ - (1ULL << VHOST_USER_F_PROTOCOL_FEATURES) + ((1ULL << VHOST_USER_F_PROTOCOL_FEATURES) | \ +(1ULL << VIRTIO_NET_F_MQ)) #define SFC_VDPA_MSIX_IRQ_SET_BUF_LEN \ (sizeof(struct vfio_irq_set) + \ diff --git a/drivers/vdpa/sfc/sfc_vdpa_ops.h b/drivers/vdpa/sfc/sfc_vdpa_ops.h index 9dbd5b8..5c8e352 100644 --- a/drivers/vdpa/sfc/sfc_vdpa_ops.h +++ b/drivers/vdpa/sfc/sfc_vdpa_ops.h @@ -7,7 +7,7 @@ #include -#define SFC_VDPA_MAX_QUEUE_PAIRS 1 +#define SFC_VDPA_MAX_QUEUE_PAIRS 8 enum sfc_vdpa_context { SFC_VDPA_AS_VF -- 1.8.3.1
Re: [PATCH] vdpa/ifc/base: fix null pointer dereference
On 7/8/22 07:57, Andy Pei wrote: Fix null pointer dereference reported in coverity scan. Output some log information when lm_cfg is null. Make lm_cfg is not null before operate on lm_cfg. Make sure* Coverity issue: 378882 Fixes: d7fe5a2861e7 ("net/ifc: support live migration") Signed-off-by: Andy Pei --- drivers/vdpa/ifc/base/ifcvf.c | 31 --- drivers/vdpa/ifc/base/ifcvf_osdep.h | 1 + 2 files changed, 21 insertions(+), 11 deletions(-) Reviewed-by: Maxime Coquelin Thanks, Maxime
[Bug 1050] Burst packet receive from Mellanox gives "Segmentation fault" on CQE decompression
https://bugs.dpdk.org/show_bug.cgi?id=1050 Bug ID: 1050 Summary: Burst packet receive from Mellanox gives "Segmentation fault" on CQE decompression Product: DPDK Version: unspecified Hardware: All OS: Linux Status: UNCONFIRMED Severity: critical Priority: Normal Component: ethdev Assignee: dev@dpdk.org Reporter: omer.ya...@ceng.metu.edu.tr Target Milestone: --- When I try to capture huge traffic, I got this segmentation fault. I have two RX queue per ports. Totally I have two port and both have traffic. If the traffic ratio is greater than 80Gbps (21Mpps), then I got the Segmentation Fault. I tried to check where is the error, I found that In function `rxq_cq_decompress_v`, line 141 gives seg fault due to the null value of elts. Also I shared a screen that includes Call Stack, elts values and seg fault. -- You are receiving this mail because: You are the assignee for the bug.
[PATCH v2] vdpa/ifc/base: fix null pointer dereference
Fix null pointer dereference reported in coverity scan. Output some log information when lm_cfg is null. Make sure lm_cfg is not null before operate on lm_cfg. Coverity issue: 378882 Fixes: d7fe5a2861e7 ("net/ifc: support live migration") Signed-off-by: Andy Pei --- drivers/vdpa/ifc/base/ifcvf.c | 31 --- drivers/vdpa/ifc/base/ifcvf_osdep.h | 1 + 2 files changed, 21 insertions(+), 11 deletions(-) diff --git a/drivers/vdpa/ifc/base/ifcvf.c b/drivers/vdpa/ifc/base/ifcvf.c index 0a9f71a..f1e1474 100644 --- a/drivers/vdpa/ifc/base/ifcvf.c +++ b/drivers/vdpa/ifc/base/ifcvf.c @@ -87,6 +87,8 @@ } hw->lm_cfg = hw->mem_resource[4].addr; + if (!hw->lm_cfg) + WARNINGOUT("HW support live migration not support!\n"); if (hw->common_cfg == NULL || hw->notify_base == NULL || hw->isr == NULL || hw->dev_cfg == NULL) { @@ -218,17 +220,19 @@ &cfg->queue_used_hi); IFCVF_WRITE_REG16(hw->vring[i].size, &cfg->queue_size); - if (hw->device_type == IFCVF_BLK) - *(u32 *)(lm_cfg + IFCVF_LM_RING_STATE_OFFSET + - i * IFCVF_LM_CFG_SIZE) = - (u32)hw->vring[i].last_avail_idx | - ((u32)hw->vring[i].last_used_idx << 16); - else - *(u32 *)(lm_cfg + IFCVF_LM_RING_STATE_OFFSET + - (i / 2) * IFCVF_LM_CFG_SIZE + - (i % 2) * 4) = - (u32)hw->vring[i].last_avail_idx | - ((u32)hw->vring[i].last_used_idx << 16); + if (lm_cfg) { + if (hw->device_type == IFCVF_BLK) + *(u32 *)(lm_cfg + IFCVF_LM_RING_STATE_OFFSET + + i * IFCVF_LM_CFG_SIZE) = + (u32)hw->vring[i].last_avail_idx | + ((u32)hw->vring[i].last_used_idx << 16); + else + *(u32 *)(lm_cfg + IFCVF_LM_RING_STATE_OFFSET + + (i / 2) * IFCVF_LM_CFG_SIZE + + (i % 2) * 4) = + (u32)hw->vring[i].last_avail_idx | + ((u32)hw->vring[i].last_used_idx << 16); + } IFCVF_WRITE_REG16(i + 1, &cfg->queue_msix_vector); if (IFCVF_READ_REG16(&cfg->queue_msix_vector) == @@ -320,6 +324,8 @@ u8 *lm_cfg; lm_cfg = hw->lm_cfg; + if (!lm_cfg) + return; *(u32 *)(lm_cfg + IFCVF_LM_BASE_ADDR_LOW) = log_base & IFCVF_32_BIT_MASK; @@ -342,6 +348,9 @@ u8 *lm_cfg; lm_cfg = hw->lm_cfg; + if (!lm_cfg) + return; + *(u32 *)(lm_cfg + IFCVF_LM_LOGGING_CTRL) = IFCVF_LM_DISABLE; } diff --git a/drivers/vdpa/ifc/base/ifcvf_osdep.h b/drivers/vdpa/ifc/base/ifcvf_osdep.h index 6aef25e..8a47fcb 100644 --- a/drivers/vdpa/ifc/base/ifcvf_osdep.h +++ b/drivers/vdpa/ifc/base/ifcvf_osdep.h @@ -14,6 +14,7 @@ #include #include +#define WARNINGOUT(S, args...)RTE_LOG(WARNING, PMD, S, ##args) #define DEBUGOUT(S, args...)RTE_LOG(DEBUG, PMD, S, ##args) #define STATIC static -- 1.8.3.1
Re: [PATCH v2] vdpa/ifc/base: fix null pointer dereference
On 7/8/22 11:10, Andy Pei wrote: Fix null pointer dereference reported in coverity scan. Output some log information when lm_cfg is null. Make sure lm_cfg is not null before operate on lm_cfg. Coverity issue: 378882 Fixes: d7fe5a2861e7 ("net/ifc: support live migration") Signed-off-by: Andy Pei --- drivers/vdpa/ifc/base/ifcvf.c | 31 --- drivers/vdpa/ifc/base/ifcvf_osdep.h | 1 + 2 files changed, 21 insertions(+), 11 deletions(-) Thanks, but I already applied v1 locally and did the change. Rejecting this v2 in patchwork. Thanks, Maxime
Re: [PATCH] vdpa/ifc/base: fix null pointer dereference
On 7/8/22 07:57, Andy Pei wrote: Fix null pointer dereference reported in coverity scan. Output some log information when lm_cfg is null. Make lm_cfg is not null before operate on lm_cfg. Coverity issue: 378882 Fixes: d7fe5a2861e7 ("net/ifc: support live migration") Signed-off-by: Andy Pei --- drivers/vdpa/ifc/base/ifcvf.c | 31 --- drivers/vdpa/ifc/base/ifcvf_osdep.h | 1 + 2 files changed, 21 insertions(+), 11 deletions(-) Applied to dpdk-next-virtio/main. Thanks, Maxime
Re: [PATCH] vhost: fix unnecessary dirty page logging
On 7/7/22 08:55, xuan.d...@intel.com wrote: From: Xuan Ding The dirty page logging is only required in vhost enqueue direction for live migration. This patch removes the unnecessary dirty page logging in vhost dequeue direction. Otherwise, it will result in a performance drop. Some if-else judgements are also optimized to improve performance. Fixes: 6d823bb302c7 ("vhost: prepare sync for descriptor to mbuf refactoring") Fixes: b6eee3e83402 ("vhost: fix sync dequeue offload") Signed-off-by: Xuan Ding --- lib/vhost/virtio_net.c | 31 +-- 1 file changed, 13 insertions(+), 18 deletions(-) Applied to dpdk-next-virtio/main. Thanks, Maxime
Re: [PATCH v3] doc: add release notes for async vhost dequeue data-path
On 6/28/22 04:06, Cheng Jiang wrote: Add release notes for asynchronous vhost dequeue data-path. Emphasize that split virtqueue and packed virtqueue are both supported in asynchronous vhost dequeue data-path. Signed-off-by: Cheng Jiang --- v3: code rebased. v2: fixed a full stop missing in the commit message. doc/guides/rel_notes/release_22_07.rst | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) Applied to dpdk-next-virtio/main. Thanks, Maxime
Re: [PATCH] vdpa/mlx5: fix leak on event thread creation
On 6/20/22 15:10, David Marchand wrote: As stated in the manual, pthread_attr_init return value should be checked. Besides, a pthread_attr_t should be destroyed once unused. In practice, we may have no leak (from what I read in glibc current code), but this may change in the future. Stick to a correct use of the API. Fixes: 5cf3fd3af4df ("vdpa/mlx5: add CPU core parameter to bind polling thread") Cc: sta...@dpdk.org Signed-off-by: David Marchand --- Note: this is only compile tested. --- drivers/vdpa/mlx5/mlx5_vdpa_event.c | 30 +++-- 1 file changed, 20 insertions(+), 10 deletions(-) Applied to dpdk-next-virtio/main. Thanks, Maxime
Re: [PATCH v3 0/4] Enhance docs on virtio-user as KNI replacement
On 6/10/22 17:35, Bruce Richardson wrote: Since use of KNI is no longer advised (due to known issues and the fact it's an out-of-tree module), virtio-user is recommended for new developments. To help encourage this, we improve the doc HOWTO section on using virtio-user and add a link to that document from the main KNI section in the programmers guide. v3: * Added two extra patches: - rename document files to match new section name - add note to KNI programmers guide section * minor updates to patch 2, and added stable on CC for that patch Bruce Richardson (4): doc/howto: rework section on virtio-user as exception path doc/howto: rename files to match new content name doc/howto: add code example to virtio-user exception path doc doc/prog_guide: add reference to virtio-user from KNI doc svg => virtio_user_as_exception_path.svg} | 0 doc/guides/howto/index.rst| 2 +- .../howto/virtio_user_as_exception_path.rst | 214 ++ .../howto/virtio_user_as_exceptional_path.rst | 118 -- .../prog_guide/kernel_nic_interface.rst | 6 + 5 files changed, 221 insertions(+), 119 deletions(-) rename doc/guides/howto/img/{virtio_user_as_exceptional_path.svg => virtio_user_as_exception_path.svg} (100%) create mode 100644 doc/guides/howto/virtio_user_as_exception_path.rst delete mode 100644 doc/guides/howto/virtio_user_as_exceptional_path.rst -- 2.34.1 Applied to dpdk-next-virtio/main. Thanks, Maxime
RE: [PATCH] vdpa/ifc/base: fix null pointer dereference
Hi Maxime, Thanks for your efforts. > -Original Message- > From: Maxime Coquelin > Sent: Friday, July 8, 2022 5:11 PM > To: Pei, Andy ; dev@dpdk.org > Cc: Xia, Chenbo ; Wang, Xiao W > > Subject: Re: [PATCH] vdpa/ifc/base: fix null pointer dereference > > > > On 7/8/22 07:57, Andy Pei wrote: > > Fix null pointer dereference reported in coverity scan. > > Output some log information when lm_cfg is null. > > Make lm_cfg is not null before operate on lm_cfg. > > > > Coverity issue: 378882 > > Fixes: d7fe5a2861e7 ("net/ifc: support live migration") > > > > Signed-off-by: Andy Pei > > --- > > drivers/vdpa/ifc/base/ifcvf.c | 31 --- > > drivers/vdpa/ifc/base/ifcvf_osdep.h | 1 + > > 2 files changed, 21 insertions(+), 11 deletions(-) > > > > Applied to dpdk-next-virtio/main. > > Thanks, > Maxime
Re: [PATCH] vdpa/sfc: resolve race between libvhost and dev_conf
On 7/6/22 11:24, abhimanyu.sa...@xilinx.com wrote: From: Abhimanyu Saini libvhost calls dev_conf() before prosessing the VHOST_USER_SET_VRING_CALL message for the last VQ. So this message is processed after dev_conf() returns. However, the dev_conf() function spawns a thread to set rte_vhost_host_notifier_ctrl() before returning control to libvhost. This parallel thread in turn invokes get_notify_area(). To get the notify_area, the vdpa driver needs to query the HW and for this query it needs an enabled VQ. But at the same time libvhost is processing the last VHOST_USER_SET_VRING_CALL, and to do that it disables the last VQ. Hence there is a race b/w the libvhost and the vdpa driver. To resolve this race condition, query the HW and cache notify_area inside dev_conf() instead of doing it the parallel thread. Signed-off-by: Abhimanyu Saini --- drivers/vdpa/sfc/sfc_vdpa_ops.c | 36 ++-- drivers/vdpa/sfc/sfc_vdpa_ops.h | 1 + 2 files changed, 19 insertions(+), 18 deletions(-) Applied to dpdk-next-virtio/main. Thanks, Maxime
Re: [PATCH 1/2] common: add safe version of foreach-list to Linux
08/07/2022 10:56, Khan, Hamza: > From: Thomas Monjalon > > 07/07/2022 17:59, Khan, Hamza: > > > I have sent v2 patch with the aforementioned fix. > > > However Is being held until the list moderator can review it for > > > approval > > > > I've unblocked it. > > This is blocked because you are not registered in the mailing list, so it is > > considered as spam. > > > The maintainers will only accept fixes into RC4 if they can be confident in > the patch. > So I have edited the Commit Message and created a V3 > I registered with the mailing list yesterday but it seems that it is being > held again. > Do you have any suggestions on how to avoid this for future patches ? Once you register, you receive an email with a confirmation link. Look in your spam folder.
Re: [PATCH] vdpa/sfc: resolve race between libvhost and dev_conf
On 7/8/22 11:23, Maxime Coquelin wrote: On 7/6/22 11:24, abhimanyu.sa...@xilinx.com wrote: From: Abhimanyu Saini libvhost calls dev_conf() before prosessing the VHOST_USER_SET_VRING_CALL message for the last VQ. So this message is processed after dev_conf() returns. However, the dev_conf() function spawns a thread to set rte_vhost_host_notifier_ctrl() before returning control to libvhost. This parallel thread in turn invokes get_notify_area(). To get the notify_area, the vdpa driver needs to query the HW and for this query it needs an enabled VQ. But at the same time libvhost is processing the last VHOST_USER_SET_VRING_CALL, and to do that it disables the last VQ. Hence there is a race b/w the libvhost and the vdpa driver. To resolve this race condition, query the HW and cache notify_area inside dev_conf() instead of doing it the parallel thread. Signed-off-by: Abhimanyu Saini --- drivers/vdpa/sfc/sfc_vdpa_ops.c | 36 ++-- drivers/vdpa/sfc/sfc_vdpa_ops.h | 1 + 2 files changed, 19 insertions(+), 18 deletions(-) Applied to dpdk-next-virtio/main. Sorry, I notice it is missing the Fixes tag, and cc'ing stable. Can you confirm this is needed and provide the faulty commit? Thanks, Maxime Thanks, Maxime
RE: [PATCH 1/2] common: add safe version of foreach-list to Linux
> -Original Message- > From: Thomas Monjalon > Sent: Friday 8 July 2022 10:25 > To: Khan, Hamza > Cc: dev@dpdk.org > Subject: Re: [PATCH 1/2] common: add safe version of foreach-list to Linux > > 08/07/2022 10:56, Khan, Hamza: > > From: Thomas Monjalon > > > 07/07/2022 17:59, Khan, Hamza: > > > > I have sent v2 patch with the aforementioned fix. > > > > However Is being held until the list moderator can review it for > > > > approval > > > > > > I've unblocked it. > > > This is blocked because you are not registered in the mailing list, > > > so it is considered as spam. > > > > > The maintainers will only accept fixes into RC4 if they can be confident in > the patch. > > So I have edited the Commit Message and created a V3 I registered with > > the mailing list yesterday but it seems that it is being held again. > > Do you have any suggestions on how to avoid this for future patches ? > > Once you register, you receive an email with a confirmation link. > Look in your spam folder. > Yes it was found in the junk folder Thank You
Re: [PATCH v3] examples/vm_power_manager: use safe version of list iterator
On 08/07/2022 09:51, Hamza Khan wrote: Currently, when vm_power_manager exits, we are using a LIST_FOREACH macro to iterate over VM info structures while freeing them. This leads to use-after-free error. To address this, replace all usages of LIST_* with TAILQ_* macros, and use the RTE_TAILQ_FOREACH_SAFE macro to iterate and delete VM info structures. * The change is small and doesn’t affect other code * Testing was performed on the patch Fixes: e8ae9b662506 ("examples/vm_power: channel manager and monitor in host") Cc: alan.ca...@intel.com Cc: sta...@dpdk.org Signed-off-by: Hamza Khan --- V3: Update commit message V2: Use RTE_TAILQ_* marcos --- examples/vm_power_manager/channel_manager.c | 20 +++- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/examples/vm_power_manager/channel_manager.c b/examples/vm_power_manager/channel_manager.c index 838465ab4b..e82c26ddca 100644 --- a/examples/vm_power_manager/channel_manager.c +++ b/examples/vm_power_manager/channel_manager.c @@ -29,6 +29,8 @@ #include "channel_monitor.h" #include "power_manager.h" +#include "rte_tailq.h" + #define RTE_LOGTYPE_CHANNEL_MANAGER RTE_LOGTYPE_USER1 @@ -58,16 +60,16 @@ struct virtual_machine_info { virDomainInfo info; rte_spinlock_t config_spinlock; int allow_query; - LIST_ENTRY(virtual_machine_info) vms_info; + RTE_TAILQ_ENTRY(virtual_machine_info) vms_info; }; -LIST_HEAD(, virtual_machine_info) vm_list_head; +RTE_TAILQ_HEAD(, virtual_machine_info) vm_list_head; static struct virtual_machine_info * find_domain_by_name(const char *name) { struct virtual_machine_info *info; - LIST_FOREACH(info, &vm_list_head, vms_info) { + RTE_TAILQ_FOREACH(info, &vm_list_head, vms_info) { if (!strncmp(info->name, name, CHANNEL_MGR_MAX_NAME_LEN-1)) return info; } @@ -878,7 +880,7 @@ add_vm(const char *vm_name) new_domain->allow_query = 0; rte_spinlock_init(&(new_domain->config_spinlock)); - LIST_INSERT_HEAD(&vm_list_head, new_domain, vms_info); + TAILQ_INSERT_HEAD(&vm_list_head, new_domain, vms_info); return 0; } @@ -900,7 +902,7 @@ remove_vm(const char *vm_name) rte_spinlock_unlock(&vm_info->config_spinlock); return -1; } - LIST_REMOVE(vm_info, vms_info); + TAILQ_REMOVE(&vm_list_head, vm_info, vms_info); rte_spinlock_unlock(&vm_info->config_spinlock); rte_free(vm_info); return 0; @@ -953,7 +955,7 @@ channel_manager_init(const char *path __rte_unused) { virNodeInfo info; - LIST_INIT(&vm_list_head); + TAILQ_INIT(&vm_list_head); if (connect_hypervisor(path) < 0) { global_n_host_cpus = 64; global_hypervisor_available = 0; @@ -1005,9 +1007,9 @@ channel_manager_exit(void) { unsigned i; char mask[RTE_MAX_LCORE]; - struct virtual_machine_info *vm_info; + struct virtual_machine_info *vm_info, *tmp; - LIST_FOREACH(vm_info, &vm_list_head, vms_info) { + RTE_TAILQ_FOREACH_SAFE(vm_info, &vm_list_head, vms_info, tmp) { rte_spinlock_lock(&(vm_info->config_spinlock)); @@ -1022,7 +1024,7 @@ channel_manager_exit(void) } rte_spinlock_unlock(&(vm_info->config_spinlock)); - LIST_REMOVE(vm_info, vms_info); + TAILQ_REMOVE(&vm_list_head, vm_info, vms_info); rte_free(vm_info); } Acked-by: David Hunt
[PATCH V1] doc: add tested Intel platforms with Intel NICs
Add tested Intel platforms with Intel NICs to v22.07 release note. Signed-off-by: Lingli Chen --- doc/guides/rel_notes/release_22_07.rst | 104 + 1 file changed, 104 insertions(+) diff --git a/doc/guides/rel_notes/release_22_07.rst b/doc/guides/rel_notes/release_22_07.rst index b40d6f9d84..abb787afc6 100644 --- a/doc/guides/rel_notes/release_22_07.rst +++ b/doc/guides/rel_notes/release_22_07.rst @@ -356,3 +356,107 @@ Tested Platforms This section is a comment. Do not overwrite or remove it. Also, make sure to start the actual text at the margin. === + + * Intel\ |reg| platforms with Intel\ |reg| NICs combinations + + * CPU + + * Intel\ |reg| Atom\ |trade| CPU C3758 @ 2.20GHz + * Intel\ |reg| Xeon\ |reg| CPU D-1553N @ 2.30GHz + * Intel\ |reg| Xeon\ |reg| CPU E5-2680 v2 @ 2.80GHz + * Intel\ |reg| Xeon\ |reg| CPU E5-2699 v3 @ 2.30GHz + * Intel\ |reg| Xeon\ |reg| CPU E5-2699 v4 @ 2.20GHz + * Intel\ |reg| Xeon\ |reg| Gold 6139 CPU @ 2.30GHz + * Intel\ |reg| Xeon\ |reg| Gold 6140M CPU @ 2.30GHz + * Intel\ |reg| Xeon\ |reg| Gold 6252N CPU @ 2.30GHz + * Intel\ |reg| Xeon\ |reg| Gold 6348 CPU @ 2.60GHz + * Intel\ |reg| Xeon\ |reg| Platinum 8180 CPU @ 2.50GHz + * Intel\ |reg| Xeon\ |reg| Platinum 8180M CPU @ 2.50GHz + * Intel\ |reg| Xeon\ |reg| Platinum 8280M CPU @ 2.70GHz + * Intel\ |reg| Xeon\ |reg| Platinum 8380 CPU @ 2.30GHz + + * OS: + + * Fedora 35 + * FreeBSD 13.0 + * Red Hat Enterprise Linux Server release 8.4 + * Red Hat Enterprise Linux Server release 8.5 + * CentOS7.9 + * Ubuntu 20.04.4 + * Ubuntu 22.04 + + * NICs: + + * Intel\ |reg| Ethernet Controller E810-C for SFP (4x25G) + + * Firmware version: 4.00 0x80011845 1.3236.0 + * Device id (pf/vf): 8086:1593 / 8086:1889 + * Driver version: 1.9.5_dirty (ice) + * OS Default DDP: 1.3.30.0 + * COMMS DDP: 1.3.37.0 + * Wireless Edge DDP: 1.3.10.0 + + * Intel\ |reg| Ethernet Controller E810-C for QSFP (2x100G) + + * Firmware version: 4.00 0x800117e8 1.3236.0 + * Device id (pf/vf): 8086:1592 / 8086:1889 + * Driver version: 1.9.5_dirty (ice) + * OS Default DDP: 1.3.30.0 + * COMMS DDP: 1.3.37.0 + * Wireless Edge DDP: 1.3.10.0 + + * Intel\ |reg| Ethernet Controller E810-XXV for SFP (2x25G) + + * Firmware version: 4.00 0x800117e5 1.3236.0 + * Device id (pf/vf): 8086:159b / 8086:1889 + * Driver version: 1.9.5_dirty (ice) + * OS Default DDP: 1.3.30.0 + * COMMS DDP: 1.3.37.0 + + * Intel\ |reg| 82599ES 10 Gigabit Ethernet Controller + + * Firmware version: 0x61bf0001 + * Device id (pf/vf): 8086:10fb / 8086:10ed + * Driver version(out-tree): 5.15.2 (ixgbe) + * Driver version(in-tree): 5.15.0-27-generic (ixgbe) + + * Intel\ |reg| Ethernet Converged Network Adapter X710-DA4 (4x10G) + + * Firmware version: 8.70 0x8000c3d5 1.3179.0 + * Device id (pf/vf): 8086:1572 / 8086:154c + * Driver version(out-tree): 2.19.3 (i40e) + * Driver version(in-tree): 5.15.0-27-generic (i40e) + + * Intel\ |reg| Corporation Ethernet Connection X722 for 10GbE SFP+ (2x10G) + + * Firmware version: 5.60 0x800035cb 1.3179.0 + * Device id (pf/vf): 8086:37d0 / 8086:37cd + * Driver version(out-tree): 2.19.3 (i40e) + * Driver version(in-tree): 5.13.0-30-generic (i40e) + + * Intel\ |reg| Corporation Ethernet Connection X722 for 10GBASE-T + + * Firmware version: 5.60 0x8000357f 1.2935.0 + * Device id (pf/vf): 8086:37d2 / 8086:37cd + * Driver version(out-tree): 2.19.3 (i40e) + * Driver version(in-tree): 5.13.0-30-generic (i40e) + + * Intel\ |reg| Ethernet Converged Network Adapter XXV710-DA2 (2x25G) + + * Firmware version: 8.70 0x8000c3eb 1.3179.0 + * Device id (pf/vf): 8086:158b / 8086:154c + * Driver version(out-tree): 2.19.3 (i40e) + * Driver version(in-tree): 5.15.0-27-generic (i40e) + + * Intel\ |reg| Ethernet Converged Network Adapter XL710-QDA2 (2X40G) + + * Firmware version(PF): 8.70 0x8000c40f 1.3179.0 + * Device id (pf/vf): 8086:1583 / 8086:154c + * Driver version(out-tree): 2.19.3 (i40e) + * Driver version(in-tree): 5.15.0-27-generic (i40e) + + * Intel\ |reg| Ethernet Converged Network Adapter X710-T2L + + * Firmware version: 8.70 0x8000c3e3 1.3179.0 + * Device id (pf): 8086:15ff + * Driver version: 2.19.3 (i40e) -- 2.17.1
RE: [PATCH] vdpa/sfc: resolve race between libvhost and dev_conf
[AMD Official Use Only - General] Sorry Maxime, I forgot to copy stable. Yes it's required, the faulty commit is : 630be406dcbfc26260e9d9688c40a381d0f012db -Original Message- From: Maxime Coquelin Sent: Friday, July 8, 2022 2:55 PM To: abhimanyu.sa...@xilinx.com; dev@dpdk.org Cc: chenbo@intel.com; andrew.rybche...@oktetlabs.ru; Saini, Abhimanyu Subject: Re: [PATCH] vdpa/sfc: resolve race between libvhost and dev_conf [CAUTION: External Email] On 7/8/22 11:23, Maxime Coquelin wrote: > > > On 7/6/22 11:24, abhimanyu.sa...@xilinx.com wrote: >> From: Abhimanyu Saini >> >> libvhost calls dev_conf() before prosessing the >> VHOST_USER_SET_VRING_CALL message for the last VQ. So >> this message is processed after dev_conf() returns. >> >> However, the dev_conf() function spawns a thread to set >> rte_vhost_host_notifier_ctrl() before returning control to >> libvhost. This parallel thread in turn invokes get_notify_area(). >> To get the notify_area, the vdpa driver needs to query the HW and >> for this query it needs an enabled VQ. >> >> But at the same time libvhost is processing the last >> VHOST_USER_SET_VRING_CALL, and to do that it disables the last VQ. >> >> Hence there is a race b/w the libvhost and the vdpa driver. >> >> To resolve this race condition, query the HW and cache notify_area >> inside dev_conf() instead of doing it the parallel thread. >> >> Signed-off-by: Abhimanyu Saini >> --- >> drivers/vdpa/sfc/sfc_vdpa_ops.c | 36 >> ++-- >> drivers/vdpa/sfc/sfc_vdpa_ops.h | 1 + >> 2 files changed, 19 insertions(+), 18 deletions(-) >> > > Applied to dpdk-next-virtio/main. Sorry, I notice it is missing the Fixes tag, and cc'ing stable. Can you confirm this is needed and provide the faulty commit? Thanks, Maxime > Thanks, > Maxime
Re: [PATCH V2] app/testpmd: fix display types failure when query RSS rule
On 7/8/2022 6:32 AM, Li, WeiyuanX wrote: -Original Message- From: Huisong Li Sent: Friday, July 8, 2022 9:42 AM To: ferruh.yi...@xilinx.com; andrew.rybche...@oktetlabs.ru; dev@dpdk.org Cc: tho...@monjalon.net; Li, WeiyuanX ; huangda...@huawei.com; liudongdo...@huawei.com; lihuis...@huawei.com Subject: [PATCH V2] app/testpmd: fix display types failure when query RSS rule Now testpmd fails to display types when query RSS rule. The failure is because the '\n' character is missing at the end of the function 'rss_config_display()'. Actually, all places calling 'xxx_types_display()' need to '\n'. So this patch moves '\n' to the inside of these function. Bugzilla ID: 1048 Fixes: 534988c490f1 ("app/testpmd: unify RSS types display") Fixes: 44a37f3cffe0 ("app/testpmd: compact RSS types output") --- v2: - move '\n' to the inside of 'xxx_types_display()'. Signed-off-by: Huisong Li Tested-by: Weiyuan Li Reviewed-by: Ferruh Yigit Applied to dpdk-next-net/main, thanks.
Re: [PATCH] vdpa/sfc: resolve race between libvhost and dev_conf
On 7/8/22 12:25, Saini, Abhimanyu wrote: [AMD Official Use Only - General] Sorry Maxime, I forgot to copy stable. Yes it's required, the faulty commit is : 630be406dcbfc26260e9d9688c40a381d0f012db Thanks, ammended the commit message and applied to dpdk-next-virtio/main. -Original Message- From: Maxime Coquelin Sent: Friday, July 8, 2022 2:55 PM To: abhimanyu.sa...@xilinx.com; dev@dpdk.org Cc: chenbo@intel.com; andrew.rybche...@oktetlabs.ru; Saini, Abhimanyu Subject: Re: [PATCH] vdpa/sfc: resolve race between libvhost and dev_conf [CAUTION: External Email] On 7/8/22 11:23, Maxime Coquelin wrote: On 7/6/22 11:24, abhimanyu.sa...@xilinx.com wrote: From: Abhimanyu Saini libvhost calls dev_conf() before prosessing the VHOST_USER_SET_VRING_CALL message for the last VQ. So this message is processed after dev_conf() returns. However, the dev_conf() function spawns a thread to set rte_vhost_host_notifier_ctrl() before returning control to libvhost. This parallel thread in turn invokes get_notify_area(). To get the notify_area, the vdpa driver needs to query the HW and for this query it needs an enabled VQ. But at the same time libvhost is processing the last VHOST_USER_SET_VRING_CALL, and to do that it disables the last VQ. Hence there is a race b/w the libvhost and the vdpa driver. To resolve this race condition, query the HW and cache notify_area inside dev_conf() instead of doing it the parallel thread. Signed-off-by: Abhimanyu Saini --- drivers/vdpa/sfc/sfc_vdpa_ops.c | 36 ++-- drivers/vdpa/sfc/sfc_vdpa_ops.h | 1 + 2 files changed, 19 insertions(+), 18 deletions(-) Applied to dpdk-next-virtio/main. Sorry, I notice it is missing the Fixes tag, and cc'ing stable. Can you confirm this is needed and provide the faulty commit? Thanks, Maxime Thanks, Maxime
Re: [PATCH 2/2] net: have checksum routines accept unaligned data
On 2022-07-07 23:44, Morten Brørup wrote: >> From: Mattias Rönnblom [mailto:hof...@lysator.liu.se] >> Sent: Thursday, 7 July 2022 20.35 >> >> From: Mattias Rönnblom >> >> __rte_raw_cksum() (used by rte_raw_cksum() among others) accessed its >> data through an uint16_t pointer, which allowed the compiler to assume >> the data was 16-bit aligned. This in turn would, with certain >> architectures and compiler flag combinations, result in code with SIMD >> load or store instructions with restrictions on data alignment. >> >> This patch keeps the old algorithm, but data is read using memcpy() >> instead of direct pointer access, forcing the compiler to always >> generate code that handles unaligned input. The __may_alias__ GCC >> attribute is no longer needed. >> >> The data on which the Internet checksum functions operates are almost >> always 16-bit aligned, but there are exceptions. In particular, the >> PDCP protocol header may (literally) have an odd size. >> >> Performance impact seems to range from none to a very slight >> regression. >> >> Bugzilla ID: 1035 >> Cc: sta...@dpdk.org >> >> Signed-off-by: Mattias Rönnblom >> --- >> lib/net/rte_ip.h | 19 --- >> 1 file changed, 12 insertions(+), 7 deletions(-) >> >> diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h >> index b502481670..a9e6251f14 100644 >> --- a/lib/net/rte_ip.h >> +++ b/lib/net/rte_ip.h >> @@ -160,18 +160,23 @@ rte_ipv4_hdr_len(const struct rte_ipv4_hdr >> *ipv4_hdr) >> static inline uint32_t >> __rte_raw_cksum(const void *buf, size_t len, uint32_t sum) >> { >> -/* extend strict-aliasing rules */ >> -typedef uint16_t __attribute__((__may_alias__)) u16_p; >> -const u16_p *u16_buf = (const u16_p *)buf; >> -const u16_p *end = u16_buf + len / sizeof(*u16_buf); >> +const void *end; > > I would set "end" here instead, possibly making the pointer const too. And > add spaces around '/'. > const void * const end = RTE_PTR_ADD(buf, (len / sizeof(uint16_t)) * > sizeof(uint16_t)); > I don't think that makes the code more readable. >> >> -for (; u16_buf != end; ++u16_buf) >> -sum += *u16_buf; >> +for (end = RTE_PTR_ADD(buf, (len/sizeof(uint16_t)) * >> sizeof(uint16_t)); >> + buf != end; buf = RTE_PTR_ADD(buf, sizeof(uint16_t))) { >> +uint16_t v; >> + >> +memcpy(&v, buf, sizeof(uint16_t)); >> +sum += v; >> +} >> >> /* if length is odd, keeping it byte order independent */ >> if (unlikely(len % 2)) { >> +uint8_t last; >> uint16_t left = 0; >> -*(unsigned char *)&left = *(const unsigned char *)end; >> + >> +memcpy(&last, end, 1); >> +*(unsigned char *)&left = last; > > Couldn't you just memcpy(&left, end, 1), and omit the temporary variable > "last"? > Good point. I don't like how this code is clever vis-à-vis byte order, but then I also don't have a better suggestion. >> sum += left; >> } >> >> -- >> 2.25.1 >> > > With our without my suggested changes, it looks good. > > Reviewed-by: Morten Brørup > Thanks!
Re: [PATCH v3] config/arm: add Phytium FT-2000+
05/07/2022 15:50, Ruifeng Wang: > > From: luzhipeng > > > > Here adds configs for Phytium server. > > > > Signed-off-by: luzhipeng > > Reviewed-by: Thomas Monjalon > > --- > > I did some changes in this v3: > > - PHYTIUM -> Phytium (as on the website) > > - ft2000plus -> FT-2000+ (as found online) > > - alphabetical ordering > > - extra commas for future extensions > > - g++ as cpp Meson command > > > > Please tell how to write your name with spaces and capitals in the order > > [first > > name] [family name], thanks. > > --- > > > > > implementer_qualcomm = { > > 'description': 'Qualcomm', > > 'flags': [ > > @@ -214,7 +228,7 @@ implementer_qualcomm = { > > '0xc00': { > > 'march': 'armv8-a', > > 'march_features': ['crc'] > > -} > > +}, > > Unintentional change? > The change is valid, but not related to this patch. > > Otherwise looks good to me. > Acked-by: Ruifeng Wang Applied (without above unintentional change), thanks.
[PATCH 1/2] test/service: add perf measurements for with stats mode
This commit improves the performance reporting of the service cores polling loop to show both with and without statistics collection modes. Collecting cycle statistics is costly, due to calls to rte_rdtsc() per service iteration. Reported-by: Mattias Rönnblom Suggested-by: Honnappa Nagarahalli Suggested-by: Morten Brørup Signed-off-by: Harry van Haaren --- This is split out as a seperate patch from the fix to allow measuring the before/after of the service stats atomic fixup. --- app/test/test_service_cores.c | 36 --- 1 file changed, 25 insertions(+), 11 deletions(-) diff --git a/app/test/test_service_cores.c b/app/test/test_service_cores.c index ced6ed0081..7415b6b686 100644 --- a/app/test/test_service_cores.c +++ b/app/test/test_service_cores.c @@ -777,6 +777,22 @@ service_run_on_app_core_func(void *arg) return rte_service_run_iter_on_app_lcore(*delay_service_id, 1); } +static float +service_app_lcore_perf_measure(uint32_t id) +{ + /* Performance test: call in a loop, and measure tsc() */ + const uint32_t perf_iters = (1 << 12); + uint64_t start = rte_rdtsc(); + uint32_t i; + for (i = 0; i < perf_iters; i++) { + int err = service_run_on_app_core_func(&id); + TEST_ASSERT_EQUAL(0, err, "perf test: returned run failure"); + } + uint64_t end = rte_rdtsc(); + + return (end - start)/(float)perf_iters; +} + static int service_app_lcore_poll_impl(const int mt_safe) { @@ -828,17 +844,15 @@ service_app_lcore_poll_impl(const int mt_safe) "MT Unsafe: App core1 didn't return -EBUSY"); } - /* Performance test: call in a loop, and measure tsc() */ - const uint32_t perf_iters = (1 << 12); - uint64_t start = rte_rdtsc(); - uint32_t i; - for (i = 0; i < perf_iters; i++) { - int err = service_run_on_app_core_func(&id); - TEST_ASSERT_EQUAL(0, err, "perf test: returned run failure"); - } - uint64_t end = rte_rdtsc(); - printf("perf test for %s: %0.1f cycles per call\n", mt_safe ? - "MT Safe" : "MT Unsafe", (end - start)/(float)perf_iters); + /* Measure performance of no-stats and with-stats. */ + float cyc_no_stats = service_app_lcore_perf_measure(id); + + TEST_ASSERT_EQUAL(0, rte_service_set_stats_enable(id, 1), + "failed to enable stats for service."); + float cyc_with_stats = service_app_lcore_perf_measure(id); + + printf("perf test for %s, no stats: %0.1f, with stats %0.1f cycles/call\n", + mt_safe ? "MT Safe" : "MT Unsafe", cyc_no_stats, cyc_with_stats); unregister_all(); return TEST_SUCCESS; -- 2.32.0
[PATCH 2/2] service: fix potential stats race-condition on MT services
This commit fixes a potential racey-add that could occur if multiple service-lcores were executing the same MT-safe service at the same time, with service statistics collection enabled. Because multiple threads can run and execute the service, the stats values can have multiple writer threads, resulting in the requirement of using atomic addition for correctness. Note that when a MT unsafe service is executed, a spinlock is held, so the stats increments are protected. This fact is used to avoid executing atomic add instructions when not required. This patch causes a 1.25x increase in cycle-cost for polling a MT safe service when statistics are enabled. No change was seen for MT unsafe services, or when statistics are disabled. Reported-by: Mattias Rönnblom Suggested-by: Honnappa Nagarahalli Suggested-by: Morten Brørup Signed-off-by: Harry van Haaren --- --- lib/eal/common/rte_service.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c index ef31b1f63c..f045e74ef3 100644 --- a/lib/eal/common/rte_service.c +++ b/lib/eal/common/rte_service.c @@ -363,9 +363,15 @@ service_runner_do_callback(struct rte_service_spec_impl *s, uint64_t start = rte_rdtsc(); s->spec.callback(userdata); uint64_t end = rte_rdtsc(); - s->cycles_spent += end - start; + uint64_t cycles = end - start; cs->calls_per_service[service_idx]++; - s->calls++; + if (service_mt_safe(s)) { + __atomic_fetch_add(&s->cycles_spent, cycles, __ATOMIC_RELAXED); + __atomic_fetch_add(&s->calls, 1, __ATOMIC_RELAXED); + } else { + s->cycles_spent += cycles; + s->calls++; + } } else s->spec.callback(userdata); } -- 2.32.0
RE: Service core statistics MT safety
> -Original Message- > From: Honnappa Nagarahalli > Sent: Thursday, July 7, 2022 6:26 PM > To: Van Haaren, Harry ; Morten Brørup > ; Mattias Rönnblom ; > mattias.ronnblom ; dev@dpdk.org > Cc: nd ; nd > Subject: RE: Service core statistics MT safety > > > > Yes, understood and agree. Per-core counters are preferred, as relaxed > > atomic > > ordered stores are enough to ensure non-tearing loads. Summation before > > reporting back from the stats-requesting thread. > > For backporting, per-core counters is a significant change. Is using > > atomics to fix > > the miss-behaviour a better idea? > Agree, the change will be simple. > > This will result in some perf impact which can be mitigated by disabling the > stats > where possible. Yes, agree, disabling stats is a workaround if they're not required yep. Correctness > performance. We can optimize to per-CPU stats in a future release. > > My question below is still open, is the below enough to fix the > > *functional* part > > of MT services? > > > > Code today uses normal ADD/INCQ (and hence the MT increments bug of > > tear/clobber exists) > >0x5649189d <+141>: add%rax,0x50(%rbx) > >0x564918a1 <+145>: incq 0x48(%rbx) > > > > For x86, my disassembly for RELAXED and SEQ_CST orderings are the same, > > using LOCK prefix ("proper atomics") > >0x5649189d <+141>: lock add %rax,0x50(%rbx) > >0x564918a2 <+146>: lock incq 0x48(%rbx) > > > > My understanding is that since 2x threads can write the same address, > > SEQ_CST > > is required. > > Can somebody more familiar with C11 confirm that? > We need to use RELAXED. You can think of it as, 'do we need a particular > order for > memory operations within a single thread?' In this case, we do not need a > particular > order for incrementing stats in the single thread context. Thanks for confirming; makes sense. 2 patches sent, you're on CC.
[PATCH v2 2/2] net: have checksum routines accept unaligned data
__rte_raw_cksum() (used by rte_raw_cksum() among others) accessed its data through an uint16_t pointer, which allowed the compiler to assume the data was 16-bit aligned. This in turn would, with certain architectures and compiler flag combinations, result in code with SIMD load or store instructions with restrictions on data alignment. This patch keeps the old algorithm, but data is read using memcpy() instead of direct pointer access, forcing the compiler to always generate code that handles unaligned input. The __may_alias__ GCC attribute is no longer needed. The data on which the Internet checksum functions operates are almost always 16-bit aligned, but there are exceptions. In particular, the PDCP protocol header may (literally) have an odd size. Performance impact seems to range from none to a very slight regression. Bugzilla ID: 1035 Cc: sta...@dpdk.org --- v2: * Simplified the odd-length conditional (Morten Brørup). Reviewed-by: Morten Brørup Signed-off-by: Mattias Rönnblom --- lib/net/rte_ip.h | 17 ++--- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h index b502481670..a0334d931e 100644 --- a/lib/net/rte_ip.h +++ b/lib/net/rte_ip.h @@ -160,18 +160,21 @@ rte_ipv4_hdr_len(const struct rte_ipv4_hdr *ipv4_hdr) static inline uint32_t __rte_raw_cksum(const void *buf, size_t len, uint32_t sum) { - /* extend strict-aliasing rules */ - typedef uint16_t __attribute__((__may_alias__)) u16_p; - const u16_p *u16_buf = (const u16_p *)buf; - const u16_p *end = u16_buf + len / sizeof(*u16_buf); + const void *end; - for (; u16_buf != end; ++u16_buf) - sum += *u16_buf; + for (end = RTE_PTR_ADD(buf, (len/sizeof(uint16_t)) * sizeof(uint16_t)); +buf != end; buf = RTE_PTR_ADD(buf, sizeof(uint16_t))) { + uint16_t v; + + memcpy(&v, buf, sizeof(uint16_t)); + sum += v; + } /* if length is odd, keeping it byte order independent */ if (unlikely(len % 2)) { uint16_t left = 0; - *(unsigned char *)&left = *(const unsigned char *)end; + + memcpy(&left, end, 1); sum += left; } -- 2.25.1
[PATCH v2 1/2] app/test: add cksum performance test
Add performance test for the rte_raw_cksum() function, which delegates the actual work to __rte_raw_cksum(), which in turn is used by other functions in need of Internet checksum calculation. Signed-off-by: Mattias Rönnblom --- v2: * Added __rte_unused to unused volatile variable, to keep the Intel compiler happy. --- MAINTAINERS| 1 + app/test/meson.build | 1 + app/test/test_cksum_perf.c | 118 + 3 files changed, 120 insertions(+) create mode 100644 app/test/test_cksum_perf.c diff --git a/MAINTAINERS b/MAINTAINERS index c923712946..2a4c99e05a 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -1414,6 +1414,7 @@ Network headers M: Olivier Matz F: lib/net/ F: app/test/test_cksum.c +F: app/test/test_cksum_perf.c Packet CRC M: Jasvinder Singh diff --git a/app/test/meson.build b/app/test/meson.build index 431c5bd318..191db03d1d 100644 --- a/app/test/meson.build +++ b/app/test/meson.build @@ -18,6 +18,7 @@ test_sources = files( 'test_bpf.c', 'test_byteorder.c', 'test_cksum.c', +'test_cksum_perf.c', 'test_cmdline.c', 'test_cmdline_cirbuf.c', 'test_cmdline_etheraddr.c', diff --git a/app/test/test_cksum_perf.c b/app/test/test_cksum_perf.c new file mode 100644 index 00..bff73cb3bb --- /dev/null +++ b/app/test/test_cksum_perf.c @@ -0,0 +1,118 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) 2022 Ericsson AB + */ + +#include + +#include +#include +#include +#include +#include + +#include "test.h" + +#define NUM_BLOCKS (10) +#define ITERATIONS (100) + +static const size_t data_sizes[] = { 20, 21, 100, 101, 1500, 1501 }; + +static __rte_noinline uint16_t +do_rte_raw_cksum(const void *buf, size_t len) +{ + return rte_raw_cksum(buf, len); +} + +static void +init_block(void *buf, size_t len) +{ + size_t i; + + for (i = 0; i < len; i++) + ((char *)buf)[i] = (uint8_t)rte_rand(); +} + +static int +test_cksum_perf_size_alignment(size_t block_size, bool aligned) +{ + char *data[NUM_BLOCKS]; + char *blocks[NUM_BLOCKS]; + unsigned int i; + uint64_t start; + uint64_t end; + /* Floating point to handle low (pseudo-)TSC frequencies */ + double block_latency; + double byte_latency; + volatile __rte_unused uint64_t sum = 0; + + for (i = 0; i < NUM_BLOCKS; i++) { + data[i] = rte_malloc(NULL, block_size + 1, 0); + + if (data[i] == NULL) { + printf("Failed to allocate memory for block\n"); + return TEST_FAILED; + } + + init_block(data[i], block_size + 1); + + blocks[i] = aligned ? data[i] : data[i] + 1; + } + + start = rte_rdtsc(); + + for (i = 0; i < ITERATIONS; i++) { + unsigned int j; + for (j = 0; j < NUM_BLOCKS; j++) + sum += do_rte_raw_cksum(blocks[j], block_size); + } + + end = rte_rdtsc(); + + block_latency = (end - start) / (double)(ITERATIONS * NUM_BLOCKS); + byte_latency = block_latency / block_size; + + printf("%-9s %10zd %19.1f %16.2f\n", aligned ? "Aligned" : "Unaligned", + block_size, block_latency, byte_latency); + + for (i = 0; i < NUM_BLOCKS; i++) + rte_free(data[i]); + + return TEST_SUCCESS; +} + +static int +test_cksum_perf_size(size_t block_size) +{ + int rc; + + rc = test_cksum_perf_size_alignment(block_size, true); + if (rc != TEST_SUCCESS) + return rc; + + rc = test_cksum_perf_size_alignment(block_size, false); + + return rc; +} + +static int +test_cksum_perf(void) +{ + uint16_t i; + + printf("### rte_raw_cksum() performance ###\n"); + printf("Alignment Block sizeTSC cycles/block TSC cycles/byte\n"); + + for (i = 0; i < RTE_DIM(data_sizes); i++) { + int rc; + + rc = test_cksum_perf_size(data_sizes[i]); + if (rc != TEST_SUCCESS) + return rc; + } + + return TEST_SUCCESS; +} + + +REGISTER_TEST_COMMAND(cksum_perf_autotest, test_cksum_perf); + -- 2.25.1
RE: [PATCH 2/2] net: have checksum routines accept unaligned data
> From: Mattias Rönnblom [mailto:mattias.ronnb...@ericsson.com] > Sent: Friday, 8 July 2022 14.44 > > On 2022-07-07 23:44, Morten Brørup wrote: > >> From: Mattias Rönnblom [mailto:hof...@lysator.liu.se] > >> Sent: Thursday, 7 July 2022 20.35 > >> > >> From: Mattias Rönnblom > >> > >> __rte_raw_cksum() (used by rte_raw_cksum() among others) accessed > its > >> data through an uint16_t pointer, which allowed the compiler to > assume > >> the data was 16-bit aligned. This in turn would, with certain > >> architectures and compiler flag combinations, result in code with > SIMD > >> load or store instructions with restrictions on data alignment. > >> > >> This patch keeps the old algorithm, but data is read using memcpy() > >> instead of direct pointer access, forcing the compiler to always > >> generate code that handles unaligned input. The __may_alias__ GCC > >> attribute is no longer needed. > >> > >> The data on which the Internet checksum functions operates are > almost > >> always 16-bit aligned, but there are exceptions. In particular, the > >> PDCP protocol header may (literally) have an odd size. > >> > >> Performance impact seems to range from none to a very slight > >> regression. > >> > >> Bugzilla ID: 1035 > >> Cc: sta...@dpdk.org > >> > >> Signed-off-by: Mattias Rönnblom > >> --- > >> lib/net/rte_ip.h | 19 --- > >> 1 file changed, 12 insertions(+), 7 deletions(-) > >> > >> diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h > >> index b502481670..a9e6251f14 100644 > >> --- a/lib/net/rte_ip.h > >> +++ b/lib/net/rte_ip.h > >> @@ -160,18 +160,23 @@ rte_ipv4_hdr_len(const struct rte_ipv4_hdr > >> *ipv4_hdr) > >> static inline uint32_t > >> __rte_raw_cksum(const void *buf, size_t len, uint32_t sum) > >> { > >> - /* extend strict-aliasing rules */ > >> - typedef uint16_t __attribute__((__may_alias__)) u16_p; > >> - const u16_p *u16_buf = (const u16_p *)buf; > >> - const u16_p *end = u16_buf + len / sizeof(*u16_buf); > >> + const void *end; > > > > I would set "end" here instead, possibly making the pointer const > too. And add spaces around '/'. > > const void * const end = RTE_PTR_ADD(buf, (len / sizeof(uint16_t)) * > sizeof(uint16_t)); > > > > I don't think that makes the code more readable. It's only a matter of taste... Your code, your decision. :-) I think the spaces are required by the coding standard; not sure, though. > > >> > >> - for (; u16_buf != end; ++u16_buf) > >> - sum += *u16_buf; > >> + for (end = RTE_PTR_ADD(buf, (len/sizeof(uint16_t)) * > >> sizeof(uint16_t)); > >> + buf != end; buf = RTE_PTR_ADD(buf, sizeof(uint16_t))) { > >> + uint16_t v; > >> + > >> + memcpy(&v, buf, sizeof(uint16_t)); > >> + sum += v; > >> + } > >> > >>/* if length is odd, keeping it byte order independent */ > >>if (unlikely(len % 2)) { > >> + uint8_t last; > >>uint16_t left = 0; > >> - *(unsigned char *)&left = *(const unsigned char *)end; > >> + > >> + memcpy(&last, end, 1); > >> + *(unsigned char *)&left = last; > > > > Couldn't you just memcpy(&left, end, 1), and omit the temporary > variable "last"? > > > > Good point. > > I don't like how this code is clever vis-à-vis byte order, but then I > also don't have a better suggestion. The byte ordering cleverness has its roots in RFC 1071. Stephen suggested using a union, although in a slightly different context. I'm not sure it will be more readable here, because it will require #ifdef to support byte ordering. Just thought I'd mention it, for your consideration. Your patch v2 just reached my inbox, and it looks good. No further response to this email is expected.
[PATCH v9] sched: enable CMAN at runtime
Added changes to enable CMAN (RED or PIE) at init from profile configuration file. By default CMAN code is enable but not in use, when there is no RED or PIE profile configured. Signed-off-by: Marcin Danilewicz --- Log: v2 change in rte_sched.h to avoid ABI breakage. v3 changes from comments v4 rebase to 22.07-rc1 v5 rebase to main latest v6 commit message fixed v7 changes from comments v8 with changes from comments v9 changes from comments tmgr.c cman_params set to null qos_sched/cfg_file.c removed redundant cman_params to NULL assignement subport_params[].cman_params assigned only when CMAN enabled --- config/rte_config.h | 3 - drivers/net/softnic/rte_eth_softnic_tm.c | 12 -- examples/ip_pipeline/tmgr.c | 6 +- examples/qos_sched/cfg_file.c| 49 +--- examples/qos_sched/cfg_file.h| 5 - examples/qos_sched/init.c| 76 +--- examples/qos_sched/main.h| 2 - examples/qos_sched/profile.cfg | 135 + examples/qos_sched/profile_pie.cfg | 142 ++ examples/qos_sched/profile_red.cfg | 143 +++ lib/sched/rte_sched.c| 47 +--- 11 files changed, 298 insertions(+), 322 deletions(-) create mode 100644 examples/qos_sched/profile_pie.cfg create mode 100644 examples/qos_sched/profile_red.cfg diff --git a/config/rte_config.h b/config/rte_config.h index 46549cb062..ae56a86394 100644 --- a/config/rte_config.h +++ b/config/rte_config.h @@ -88,9 +88,6 @@ /* rte_power defines */ #define RTE_MAX_LCORE_FREQS 64 -/* rte_sched defines */ -// RTE_SCHED_CMAN is not set - /* rte_graph defines */ #define RTE_GRAPH_BURST_SIZE 256 #define RTE_LIBRTE_GRAPH_STATS 1 diff --git a/drivers/net/softnic/rte_eth_softnic_tm.c b/drivers/net/softnic/rte_eth_softnic_tm.c index 6a7766ba1c..3e4bed81e9 100644 --- a/drivers/net/softnic/rte_eth_softnic_tm.c +++ b/drivers/net/softnic/rte_eth_softnic_tm.c @@ -420,11 +420,7 @@ pmd_tm_node_type_get(struct rte_eth_dev *dev, return 0; } -#ifdef RTE_SCHED_CMAN -#define WRED_SUPPORTED 1 -#else #define WRED_SUPPORTED 0 -#endif #define STATS_MASK_DEFAULT \ (RTE_TM_STATS_N_PKTS | \ @@ -2300,8 +2296,6 @@ tm_tc_wred_profile_get(struct rte_eth_dev *dev, uint32_t tc_id) return NULL; } -#ifdef RTE_SCHED_CMAN - static void wred_profiles_set(struct rte_eth_dev *dev, uint32_t subport_id) { @@ -2325,12 +2319,6 @@ wred_profiles_set(struct rte_eth_dev *dev, uint32_t subport_id) } } -#else - -#define wred_profiles_set(dev, subport_id) - -#endif - static struct tm_shared_shaper * tm_tc_shared_shaper_get(struct rte_eth_dev *dev, struct tm_node *tc_node) { diff --git a/examples/ip_pipeline/tmgr.c b/examples/ip_pipeline/tmgr.c index b138e885cf..18bee1e0fc 100644 --- a/examples/ip_pipeline/tmgr.c +++ b/examples/ip_pipeline/tmgr.c @@ -17,7 +17,6 @@ static uint32_t n_subport_profiles; static struct rte_sched_pipe_params pipe_profile[TMGR_PIPE_PROFILE_MAX]; -#ifdef RTE_SCHED_CMAN static struct rte_sched_cman_params cman_params = { .red_params = { /* Traffic Class 0 Colors Green / Yellow / Red */ @@ -86,7 +85,6 @@ static struct rte_sched_cman_params cman_params = { [12][2] = {.min_th = 32, .max_th = 64, .maxp_inv = 10, .wq_log2 = 9}, }, }; -#endif /* RTE_SCHED_CMAN */ static uint32_t n_pipe_profiles; @@ -96,9 +94,7 @@ static const struct rte_sched_subport_params subport_params_default = { .pipe_profiles = pipe_profile, .n_pipe_profiles = 0, /* filled at run time */ .n_max_pipe_profiles = RTE_DIM(pipe_profile), -#ifdef RTE_SCHED_CMAN - .cman_params = &cman_params, -#endif /* RTE_SCHED_CMAN */ + .cman_params = NULL, }; static struct tmgr_port_list tmgr_port_list; diff --git a/examples/qos_sched/cfg_file.c b/examples/qos_sched/cfg_file.c index 450482f07d..deaa28d391 100644 --- a/examples/qos_sched/cfg_file.c +++ b/examples/qos_sched/cfg_file.c @@ -23,6 +23,8 @@ uint32_t active_queues[RTE_SCHED_QUEUES_PER_PIPE]; uint32_t n_active_queues; +struct rte_sched_cman_params cman_params; + int cfg_load_port(struct rte_cfgfile *cfg, struct rte_sched_port_params *port_params) { @@ -229,40 +231,6 @@ cfg_load_subport_profile(struct rte_cfgfile *cfg, return 0; } -#ifdef RTE_SCHED_CMAN -void set_subport_cman_params(struct rte_sched_subport_params *subport_p, - struct rte_sched_cman_params cman_p) -{ - int j, k; - subport_p->cman_params->cman_mode = cman_p.cman_mode; - - for (j = 0; j < RTE_SCHED_TRAFFIC_CLASSES_PER_PIPE; j++) { - if (su
RE: [PATCH 2/2] service: fix potential stats race-condition on MT services
> From: Harry van Haaren [mailto:harry.van.haa...@intel.com] > Sent: Friday, 8 July 2022 14.57 > > This commit fixes a potential racey-add that could occur if > multiple service-lcores were executing the same MT-safe service > at the same time, with service statistics collection enabled. > > Because multiple threads can run and execute the service, the > stats values can have multiple writer threads, resulting in the > requirement of using atomic addition for correctness. > > Note that when a MT unsafe service is executed, a spinlock is > held, so the stats increments are protected. This fact is used > to avoid executing atomic add instructions when not required. > > This patch causes a 1.25x increase in cycle-cost for polling a > MT safe service when statistics are enabled. No change was seen > for MT unsafe services, or when statistics are disabled. > > Reported-by: Mattias Rönnblom > Suggested-by: Honnappa Nagarahalli > Suggested-by: Morten Brørup > Signed-off-by: Harry van Haaren > > --- [...] > + if (service_mt_safe(s)) { > + __atomic_fetch_add(&s->cycles_spent, cycles, > __ATOMIC_RELAXED); > + __atomic_fetch_add(&s->calls, 1, __ATOMIC_RELAXED); > + } else { > + s->cycles_spent += cycles; > + s->calls++; > + } Have you considered the performance cost of the __atomic_fetch_add(__ATOMIC_RELAXED) versus the performance cost of the branch to compare if the service is MT safe? It might be cheaper to just always use the atomic addition. I don't know, just mentioning that the compare-and-branch also has a cost. I'm not familiar with the DPDK services library, so perhaps MT safe and MT unsafe services are never mixed, in which case the branch will always take the same path, so branch prediction will eliminate the cost of branching.
RE: [PATCH 2/2] service: fix potential stats race-condition on MT services
> -Original Message- > From: Morten Brørup > Sent: Friday, July 8, 2022 2:23 PM > To: Van Haaren, Harry ; dev@dpdk.org > Cc: mattias.ronnblom ; Honnappa Nagarahalli > > Subject: RE: [PATCH 2/2] service: fix potential stats race-condition on MT > services > > This patch causes a 1.25x increase in cycle-cost for polling a > > MT safe service when statistics are enabled. No change was seen > > for MT unsafe services, or when statistics are disabled. > > > > Reported-by: Mattias Rönnblom > > Suggested-by: Honnappa Nagarahalli > > Suggested-by: Morten Brørup > > Signed-off-by: Harry van Haaren > > > > --- > > [...] > > > + if (service_mt_safe(s)) { > > + __atomic_fetch_add(&s->cycles_spent, cycles, > > __ATOMIC_RELAXED); > > + __atomic_fetch_add(&s->calls, 1, __ATOMIC_RELAXED); > > + } else { > > + s->cycles_spent += cycles; > > + s->calls++; > > + } > > Have you considered the performance cost of the > __atomic_fetch_add(__ATOMIC_RELAXED) versus the performance cost of the > branch to compare if the service is MT safe? It might be cheaper to just > always use > the atomic addition. I don't know, just mentioning that the > compare-and-branch also > has a cost. Great question! > I'm not familiar with the DPDK services library, so perhaps MT safe and MT > unsafe > services are never mixed, in which case the branch will always take the same > path, > so branch prediction will eliminate the cost of branching. MT safe & unsafe can be mixed yes, so you're right, there may be mis-predicts. Note that assuming a service is actually doing something useful, there's likely quite a few branches between each call.. so unknown how fresh/accurate the branch history will be. The common case is for many services to be "mt unsafe" (think polling an ethdev queue). In this case, it is nice to try reduce cost. Given this is likely the highest quantity of services, we'd like the performance here to be reduced the least. The branch method achieves that. I did benchmark the "always use atomic" case, and it caused a ~30cycle hit in the "mt unsafe" case, where the atomic is not required (and hence the performance hit can be avoided by branching). Given branch-misses are handled between 15-20 cycles (uarch dependent), attempting to avoid the atomic still makes sense from cycle-cost perspective too I think.. I did spend the morning benchmarking solutions (and hence the patch split, to allow easy benchmarking before & after), so thanks for asking! Regards, -Harry
Re: [PATCH 2/2] service: fix potential stats race-condition on MT services
On 2022-07-08 14:56, Harry van Haaren wrote: > This commit fixes a potential racey-add that could occur if > multiple service-lcores were executing the same MT-safe service > at the same time, with service statistics collection enabled. > > Because multiple threads can run and execute the service, the > stats values can have multiple writer threads, resulting in the > requirement of using atomic addition for correctness. > > Note that when a MT unsafe service is executed, a spinlock is > held, so the stats increments are protected. This fact is used > to avoid executing atomic add instructions when not required. > > This patch causes a 1.25x increase in cycle-cost for polling a > MT safe service when statistics are enabled. No change was seen > for MT unsafe services, or when statistics are disabled. > > Reported-by: Mattias Rönnblom > Suggested-by: Honnappa Nagarahalli > Suggested-by: Morten Brørup > Signed-off-by: Harry van Haaren > > --- > --- > lib/eal/common/rte_service.c | 10 -- > 1 file changed, 8 insertions(+), 2 deletions(-) > > diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c > index ef31b1f63c..f045e74ef3 100644 > --- a/lib/eal/common/rte_service.c > +++ b/lib/eal/common/rte_service.c > @@ -363,9 +363,15 @@ service_runner_do_callback(struct rte_service_spec_impl > *s, > uint64_t start = rte_rdtsc(); > s->spec.callback(userdata); > uint64_t end = rte_rdtsc(); > - s->cycles_spent += end - start; > + uint64_t cycles = end - start; > cs->calls_per_service[service_idx]++; > - s->calls++; > + if (service_mt_safe(s)) { > + __atomic_fetch_add(&s->cycles_spent, cycles, > __ATOMIC_RELAXED); > + __atomic_fetch_add(&s->calls, 1, __ATOMIC_RELAXED); > + } else { > + s->cycles_spent += cycles; > + s->calls++; > + } > } else > s->spec.callback(userdata); > } Reviewed-by: Mattias Rönnblom Thanks for the fix Harry.
[PATCH] vhost/crypto: fix out of bound access
Coverity issue: 379211 Fixes: 4414bb67010d ("vhost/crypto: fix build with GCC 12") Cc: david.march...@redhat.com Signed-off-by: Fan Zhang --- lib/vhost/vhost_crypto.c | 11 +-- 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/lib/vhost/vhost_crypto.c b/lib/vhost/vhost_crypto.c index 54946f46d9..e9c3322d20 100644 --- a/lib/vhost/vhost_crypto.c +++ b/lib/vhost/vhost_crypto.c @@ -574,12 +574,11 @@ copy_data_from_desc(void *dst, struct vhost_crypto_data_req *vc_req, remain = RTE_MIN(desc->len, size); addr = desc->addr; - do { - uint64_t len; - void *src; - len = remain; - src = IOVA_TO_VVA(void *, vc_req, addr, &len, VHOST_ACCESS_RO); +while (remain) { +uint64_t len = remain; + void *src = IOVA_TO_VVA(void *, vc_req, addr, &len, VHOST_ACCESS_RO); + if (unlikely(src == NULL || len == 0)) return 0; @@ -588,7 +587,7 @@ copy_data_from_desc(void *dst, struct vhost_crypto_data_req *vc_req, /* cast is needed for 32-bit architecture */ dst = RTE_PTR_ADD(dst, (size_t)len); addr += len; - } while (unlikely(remain != 0)); +} return RTE_MIN(desc->len, size); } -- 2.34.1
Re: [PATCH 2/2] net: have checksum routines accept unaligned data
On 2022-07-08 15:02, Morten Brørup wrote: >> From: Mattias Rönnblom [mailto:mattias.ronnb...@ericsson.com] >> Sent: Friday, 8 July 2022 14.44 >> >> On 2022-07-07 23:44, Morten Brørup wrote: From: Mattias Rönnblom [mailto:hof...@lysator.liu.se] Sent: Thursday, 7 July 2022 20.35 From: Mattias Rönnblom __rte_raw_cksum() (used by rte_raw_cksum() among others) accessed >> its data through an uint16_t pointer, which allowed the compiler to >> assume the data was 16-bit aligned. This in turn would, with certain architectures and compiler flag combinations, result in code with >> SIMD load or store instructions with restrictions on data alignment. This patch keeps the old algorithm, but data is read using memcpy() instead of direct pointer access, forcing the compiler to always generate code that handles unaligned input. The __may_alias__ GCC attribute is no longer needed. The data on which the Internet checksum functions operates are >> almost always 16-bit aligned, but there are exceptions. In particular, the PDCP protocol header may (literally) have an odd size. Performance impact seems to range from none to a very slight regression. Bugzilla ID: 1035 Cc: sta...@dpdk.org Signed-off-by: Mattias Rönnblom --- lib/net/rte_ip.h | 19 --- 1 file changed, 12 insertions(+), 7 deletions(-) diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h index b502481670..a9e6251f14 100644 --- a/lib/net/rte_ip.h +++ b/lib/net/rte_ip.h @@ -160,18 +160,23 @@ rte_ipv4_hdr_len(const struct rte_ipv4_hdr *ipv4_hdr) static inline uint32_t __rte_raw_cksum(const void *buf, size_t len, uint32_t sum) { - /* extend strict-aliasing rules */ - typedef uint16_t __attribute__((__may_alias__)) u16_p; - const u16_p *u16_buf = (const u16_p *)buf; - const u16_p *end = u16_buf + len / sizeof(*u16_buf); + const void *end; >>> >>> I would set "end" here instead, possibly making the pointer const >> too. And add spaces around '/'. >>> const void * const end = RTE_PTR_ADD(buf, (len / sizeof(uint16_t)) * >> sizeof(uint16_t)); >>> >> >> I don't think that makes the code more readable. > > It's only a matter of taste... Your code, your decision. :-) > > I think the spaces are required by the coding standard; not sure, though. > If it isn't in the coding standard, it should be. But if you add spaces, you have to break the line, to fit into 80 characters. A net loss, IMO. >> - for (; u16_buf != end; ++u16_buf) - sum += *u16_buf; + for (end = RTE_PTR_ADD(buf, (len/sizeof(uint16_t)) * sizeof(uint16_t)); + buf != end; buf = RTE_PTR_ADD(buf, sizeof(uint16_t))) { + uint16_t v; + + memcpy(&v, buf, sizeof(uint16_t)); + sum += v; + } /* if length is odd, keeping it byte order independent */ if (unlikely(len % 2)) { + uint8_t last; uint16_t left = 0; - *(unsigned char *)&left = *(const unsigned char *)end; + + memcpy(&last, end, 1); + *(unsigned char *)&left = last; >>> >>> Couldn't you just memcpy(&left, end, 1), and omit the temporary >> variable "last"? >>> >> >> Good point. >> >> I don't like how this code is clever vis-à-vis byte order, but then I >> also don't have a better suggestion. > > The byte ordering cleverness has its roots in RFC 1071. > > Stephen suggested using a union, although in a slightly different context. > I'm not sure it will be more readable here, because it will require #ifdef to > support byte ordering. Just thought I'd mention it, for your consideration. > > Your patch v2 just reached my inbox, and it looks good. No further response > to this email is expected. >
[PATCH v2] vhost/crypto: fix out of bound access
Coverity issue: 379211 Fixes: 4414bb67010d ("vhost/crypto: fix build with GCC 12") Cc: david.march...@redhat.com Signed-off-by: Fan Zhang --- v2: fix format-warning lib/vhost/vhost_crypto.c | 11 +-- 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/lib/vhost/vhost_crypto.c b/lib/vhost/vhost_crypto.c index 54946f46d9..bf899d2a50 100644 --- a/lib/vhost/vhost_crypto.c +++ b/lib/vhost/vhost_crypto.c @@ -574,12 +574,11 @@ copy_data_from_desc(void *dst, struct vhost_crypto_data_req *vc_req, remain = RTE_MIN(desc->len, size); addr = desc->addr; - do { - uint64_t len; - void *src; - len = remain; - src = IOVA_TO_VVA(void *, vc_req, addr, &len, VHOST_ACCESS_RO); + while (remain) { + uint64_t len = remain; + void *src = IOVA_TO_VVA(void *, vc_req, addr, &len, VHOST_ACCESS_RO); + if (unlikely(src == NULL || len == 0)) return 0; @@ -588,7 +587,7 @@ copy_data_from_desc(void *dst, struct vhost_crypto_data_req *vc_req, /* cast is needed for 32-bit architecture */ dst = RTE_PTR_ADD(dst, (size_t)len); addr += len; - } while (unlikely(remain != 0)); + } return RTE_MIN(desc->len, size); } -- 2.34.1
Re: [PATCH 2/2] net: have checksum routines accept unaligned data
On Fri, Jul 08, 2022 at 01:52:12PM +, Mattias Rönnblom wrote: > On 2022-07-08 15:02, Morten Brørup wrote: > >> From: Mattias Rönnblom [mailto:mattias.ronnb...@ericsson.com] > >> Sent: Friday, 8 July 2022 14.44 > >> > >> On 2022-07-07 23:44, Morten Brørup wrote: > From: Mattias Rönnblom [mailto:hof...@lysator.liu.se] > Sent: Thursday, 7 July 2022 20.35 > > From: Mattias Rönnblom > > __rte_raw_cksum() (used by rte_raw_cksum() among others) accessed > >> its > data through an uint16_t pointer, which allowed the compiler to > >> assume > the data was 16-bit aligned. This in turn would, with certain > architectures and compiler flag combinations, result in code with > >> SIMD > load or store instructions with restrictions on data alignment. > > This patch keeps the old algorithm, but data is read using memcpy() > instead of direct pointer access, forcing the compiler to always > generate code that handles unaligned input. The __may_alias__ GCC > attribute is no longer needed. > > The data on which the Internet checksum functions operates are > >> almost > always 16-bit aligned, but there are exceptions. In particular, the > PDCP protocol header may (literally) have an odd size. > > Performance impact seems to range from none to a very slight > regression. > > Bugzilla ID: 1035 > Cc: sta...@dpdk.org > > Signed-off-by: Mattias Rönnblom > --- > lib/net/rte_ip.h | 19 --- > 1 file changed, 12 insertions(+), 7 deletions(-) > > diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h > index b502481670..a9e6251f14 100644 > --- a/lib/net/rte_ip.h > +++ b/lib/net/rte_ip.h > @@ -160,18 +160,23 @@ rte_ipv4_hdr_len(const struct rte_ipv4_hdr > *ipv4_hdr) > static inline uint32_t > __rte_raw_cksum(const void *buf, size_t len, uint32_t sum) > { > -/* extend strict-aliasing rules */ > -typedef uint16_t __attribute__((__may_alias__)) u16_p; > -const u16_p *u16_buf = (const u16_p *)buf; > -const u16_p *end = u16_buf + len / sizeof(*u16_buf); > +const void *end; > >>> > >>> I would set "end" here instead, possibly making the pointer const > >> too. And add spaces around '/'. > >>> const void * const end = RTE_PTR_ADD(buf, (len / sizeof(uint16_t)) * > >> sizeof(uint16_t)); > >>> > >> > >> I don't think that makes the code more readable. > > > > It's only a matter of taste... Your code, your decision. :-) > > > > I think the spaces are required by the coding standard; not sure, though. > > > > If it isn't in the coding standard, it should be. But if you add spaces, > you have to break the line, to fit into 80 characters. A net loss, IMO. > Just FYI, lines up to 100 chars are ok now. Automated checkpatch checks as shown in patchwork only flag lines longer than that. /Bruce
RE: [PATCH 2/2] service: fix potential stats race-condition on MT services
> From: Van Haaren, Harry [mailto:harry.van.haa...@intel.com] > Sent: Friday, 8 July 2022 15.45 > > > From: Morten Brørup > > Sent: Friday, July 8, 2022 2:23 PM > > > > > > This patch causes a 1.25x increase in cycle-cost for polling a > > > MT safe service when statistics are enabled. No change was seen > > > for MT unsafe services, or when statistics are disabled. > > > > > > Reported-by: Mattias Rönnblom > > > Suggested-by: Honnappa Nagarahalli > > > Suggested-by: Morten Brørup > > > Signed-off-by: Harry van Haaren > > > > > > --- > > > > [...] > > > > > + if (service_mt_safe(s)) { > > > + __atomic_fetch_add(&s->cycles_spent, cycles, > > > __ATOMIC_RELAXED); > > > + __atomic_fetch_add(&s->calls, 1, __ATOMIC_RELAXED); > > > + } else { > > > + s->cycles_spent += cycles; > > > + s->calls++; > > > + } > > > > Have you considered the performance cost of the > > __atomic_fetch_add(__ATOMIC_RELAXED) versus the performance cost of > the > > branch to compare if the service is MT safe? It might be cheaper to > just always use > > the atomic addition. I don't know, just mentioning that the compare- > and-branch also > > has a cost. > > Great question! > > > I'm not familiar with the DPDK services library, so perhaps MT safe > and MT unsafe > > services are never mixed, in which case the branch will always take > the same path, > > so branch prediction will eliminate the cost of branching. > > MT safe & unsafe can be mixed yes, so you're right, there may be mis- > predicts. Note that > assuming a service is actually doing something useful, there's likely > quite a few branches > between each call.. so unknown how fresh/accurate the branch history > will be. > > The common case is for many services to be "mt unsafe" (think polling > an ethdev queue). > In this case, it is nice to try reduce cost. Given this is likely the > highest quantity of services, > we'd like the performance here to be reduced the least. The branch > method achieves that. > > I did benchmark the "always use atomic" case, and it caused a ~30cycle > hit in the "mt unsafe" case, > where the atomic is not required (and hence the performance hit can be > avoided by branching). > Given branch-misses are handled between 15-20 cycles (uarch dependent), > attempting to avoid the > atomic still makes sense from cycle-cost perspective too I think.. > > I did spend the morning benchmarking solutions (and hence the patch > split, > to allow easy benchmarking before & after), so thanks for asking! > > Regards, -Harry Thank you for elaborating, Harry. I am impressed with the considerations you have put into this, and have no further concerns. Reviewed-by: Morten Brørup
[PATCH v10] sched: enable CMAN at runtime
Added changes to enable CMAN (RED or PIE) at init from profile configuration file. By default CMAN code is enable but not in use, when there is no RED or PIE profile configured. Signed-off-by: Marcin Danilewicz --- Log: v2 change in rte_sched.h to avoid ABI breakage. v3 changes from comments v4 rebase to 22.07-rc1 v5 rebase to main latest v6 commit message fixed v7 changes from comments v8 with changes from comments v9 changes from comments tmgr.c cman_params set to null qos_sched/cfg_file.c removed redundant cman_params to NULL assignement subport_params[].cman_params assigned only when CMAN enabled v10 removed ip_pipelin app build error from change in tmgr.c --- config/rte_config.h | 3 - drivers/net/softnic/rte_eth_softnic_tm.c | 12 -- examples/ip_pipeline/tmgr.c | 75 +--- examples/qos_sched/cfg_file.c| 49 +--- examples/qos_sched/cfg_file.h| 5 - examples/qos_sched/init.c| 76 +--- examples/qos_sched/main.h| 2 - examples/qos_sched/profile.cfg | 135 + examples/qos_sched/profile_pie.cfg | 142 ++ examples/qos_sched/profile_red.cfg | 143 +++ lib/sched/rte_sched.c| 47 +--- 11 files changed, 298 insertions(+), 391 deletions(-) create mode 100644 examples/qos_sched/profile_pie.cfg create mode 100644 examples/qos_sched/profile_red.cfg diff --git a/config/rte_config.h b/config/rte_config.h index 46549cb062..ae56a86394 100644 --- a/config/rte_config.h +++ b/config/rte_config.h @@ -88,9 +88,6 @@ /* rte_power defines */ #define RTE_MAX_LCORE_FREQS 64 -/* rte_sched defines */ -// RTE_SCHED_CMAN is not set - /* rte_graph defines */ #define RTE_GRAPH_BURST_SIZE 256 #define RTE_LIBRTE_GRAPH_STATS 1 diff --git a/drivers/net/softnic/rte_eth_softnic_tm.c b/drivers/net/softnic/rte_eth_softnic_tm.c index 6a7766ba1c..3e4bed81e9 100644 --- a/drivers/net/softnic/rte_eth_softnic_tm.c +++ b/drivers/net/softnic/rte_eth_softnic_tm.c @@ -420,11 +420,7 @@ pmd_tm_node_type_get(struct rte_eth_dev *dev, return 0; } -#ifdef RTE_SCHED_CMAN -#define WRED_SUPPORTED 1 -#else #define WRED_SUPPORTED 0 -#endif #define STATS_MASK_DEFAULT \ (RTE_TM_STATS_N_PKTS | \ @@ -2300,8 +2296,6 @@ tm_tc_wred_profile_get(struct rte_eth_dev *dev, uint32_t tc_id) return NULL; } -#ifdef RTE_SCHED_CMAN - static void wred_profiles_set(struct rte_eth_dev *dev, uint32_t subport_id) { @@ -2325,12 +2319,6 @@ wred_profiles_set(struct rte_eth_dev *dev, uint32_t subport_id) } } -#else - -#define wred_profiles_set(dev, subport_id) - -#endif - static struct tm_shared_shaper * tm_tc_shared_shaper_get(struct rte_eth_dev *dev, struct tm_node *tc_node) { diff --git a/examples/ip_pipeline/tmgr.c b/examples/ip_pipeline/tmgr.c index b138e885cf..2432b56aee 100644 --- a/examples/ip_pipeline/tmgr.c +++ b/examples/ip_pipeline/tmgr.c @@ -17,77 +17,6 @@ static uint32_t n_subport_profiles; static struct rte_sched_pipe_params pipe_profile[TMGR_PIPE_PROFILE_MAX]; -#ifdef RTE_SCHED_CMAN -static struct rte_sched_cman_params cman_params = { - .red_params = { - /* Traffic Class 0 Colors Green / Yellow / Red */ - [0][0] = {.min_th = 48, .max_th = 64, .maxp_inv = 10, .wq_log2 = 9}, - [0][1] = {.min_th = 40, .max_th = 64, .maxp_inv = 10, .wq_log2 = 9}, - [0][2] = {.min_th = 32, .max_th = 64, .maxp_inv = 10, .wq_log2 = 9}, - - /* Traffic Class 1 - Colors Green / Yellow / Red */ - [1][0] = {.min_th = 48, .max_th = 64, .maxp_inv = 10, .wq_log2 = 9}, - [1][1] = {.min_th = 40, .max_th = 64, .maxp_inv = 10, .wq_log2 = 9}, - [1][2] = {.min_th = 32, .max_th = 64, .maxp_inv = 10, .wq_log2 = 9}, - - /* Traffic Class 2 - Colors Green / Yellow / Red */ - [2][0] = {.min_th = 48, .max_th = 64, .maxp_inv = 10, .wq_log2 = 9}, - [2][1] = {.min_th = 40, .max_th = 64, .maxp_inv = 10, .wq_log2 = 9}, - [2][2] = {.min_th = 32, .max_th = 64, .maxp_inv = 10, .wq_log2 = 9}, - - /* Traffic Class 3 - Colors Green / Yellow / Red */ - [3][0] = {.min_th = 48, .max_th = 64, .maxp_inv = 10, .wq_log2 = 9}, - [3][1] = {.min_th = 40, .max_th = 64, .maxp_inv = 10, .wq_log2 = 9}, - [3][2] = {.min_th = 32, .max_th = 64, .maxp_inv = 10, .wq_log2 = 9}, - - /* Traffic Class 4 - Colors Green / Yellow / Red */ - [4][0] = {.min_th = 48, .max_th = 64, .maxp_inv = 10, .wq_log2 = 9}, - [4][1] = {.min_th
Re: [PATCH v2] common/mlx5: update DevX error logging
06/07/2022 16:02, Gregory Etelson: > Current PMD logs all DevX errors at error level. > > DevX interface can fail queue counters allocation on some hardware > types. That is a known issue. > PMD fallback to VERB API to allocate queue counters > when it detects the fault. > That DevX failure should not be logged as PMD error. > > The patch provides DevX with flexible API that selects log level. > > cc: sta...@dpdk.org Why is there this cc line? > Signed-off-by: Gregory Etelson > Acked-by: Matan Azrad
Re: [PATCH v2 0/2] DLB2: cq_weight fixes
06/07/2022 23:46, Timothy McDaniel: > This patch series contains the following: > - fix coverity issue 379234 > - improve error reporting for cq_weight feature > > Changes since V1 > - fixed a repeated word in the commit message of patch 1 > > Timothy McDaniel (2): > event/dlb2: fix cq_weight array overflow > event/dlb2: improve cq_weight error messages Should we merge those patches in 22.07-rc4? No risk?
RE: [PATCH 2/2] net: have checksum routines accept unaligned data
> From: Bruce Richardson [mailto:bruce.richard...@intel.com] > Sent: Friday, 8 July 2022 16.10 > > On Fri, Jul 08, 2022 at 01:52:12PM +, Mattias Rönnblom wrote: > > On 2022-07-08 15:02, Morten Brørup wrote: > > >> From: Mattias Rönnblom [mailto:mattias.ronnb...@ericsson.com] > > >> Sent: Friday, 8 July 2022 14.44 > > >> > > >> On 2022-07-07 23:44, Morten Brørup wrote: > > From: Mattias Rönnblom [mailto:hof...@lysator.liu.se] > > Sent: Thursday, 7 July 2022 20.35 > > [...] > > static inline uint32_t > > __rte_raw_cksum(const void *buf, size_t len, uint32_t sum) > > { > > - /* extend strict-aliasing rules */ > > - typedef uint16_t __attribute__((__may_alias__)) u16_p; > > - const u16_p *u16_buf = (const u16_p *)buf; > > - const u16_p *end = u16_buf + len / sizeof(*u16_buf); > > + const void *end; > > >>> > > >>> I would set "end" here instead, possibly making the pointer const > > >> too. And add spaces around '/'. > > >>> const void * const end = RTE_PTR_ADD(buf, (len / > sizeof(uint16_t)) * > > >> sizeof(uint16_t)); > > >>> > > >> > > >> I don't think that makes the code more readable. > > > > > > It's only a matter of taste... Your code, your decision. :-) > > > > > > I think the spaces are required by the coding standard; not sure, > though. > > > > > > > If it isn't in the coding standard, it should be. But if you add > spaces, > > you have to break the line, to fit into 80 characters. A net loss, > IMO. > > > > Just FYI, lines up to 100 chars are ok now. Automated checkpatch checks > as > shown in patchwork only flag lines longer than that. > > /Bruce The coding style [1] recommends max 80 characters, with an (easy to miss) note below, saying that up to 100 characters are acceptable. I think it should be simplified, replacing the 80 characters recommendation by a 100 characters limit, so the additional note can be removed. Why still recommend 80, when we really mean 100? [1] https://doc.dpdk.org/guides/contributing/coding_style.html
Re: [PATCH v2 2/2] net: have checksum routines accept unaligned data
On 7/8/2022 1:56 PM, Mattias Rönnblom wrote: __rte_raw_cksum() (used by rte_raw_cksum() among others) accessed its data through an uint16_t pointer, which allowed the compiler to assume the data was 16-bit aligned. This in turn would, with certain architectures and compiler flag combinations, result in code with SIMD load or store instructions with restrictions on data alignment. This patch keeps the old algorithm, but data is read using memcpy() instead of direct pointer access, forcing the compiler to always generate code that handles unaligned input. The __may_alias__ GCC attribute is no longer needed. The data on which the Internet checksum functions operates are almost always 16-bit aligned, but there are exceptions. In particular, the PDCP protocol header may (literally) have an odd size. Performance impact seems to range from none to a very slight regression. Bugzilla ID: 1035 Cc: sta...@dpdk.org --- v2: * Simplified the odd-length conditional (Morten Brørup). Reviewed-by: Morten Brørup Signed-off-by: Mattias Rönnblom --- lib/net/rte_ip.h | 17 ++--- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h index b502481670..a0334d931e 100644 --- a/lib/net/rte_ip.h +++ b/lib/net/rte_ip.h @@ -160,18 +160,21 @@ rte_ipv4_hdr_len(const struct rte_ipv4_hdr *ipv4_hdr) static inline uint32_t __rte_raw_cksum(const void *buf, size_t len, uint32_t sum) { - /* extend strict-aliasing rules */ - typedef uint16_t __attribute__((__may_alias__)) u16_p; - const u16_p *u16_buf = (const u16_p *)buf; - const u16_p *end = u16_buf + len / sizeof(*u16_buf); + const void *end; - for (; u16_buf != end; ++u16_buf) - sum += *u16_buf; + for (end = RTE_PTR_ADD(buf, (len/sizeof(uint16_t)) * sizeof(uint16_t)); +buf != end; buf = RTE_PTR_ADD(buf, sizeof(uint16_t))) { + uint16_t v; + + memcpy(&v, buf, sizeof(uint16_t)); + sum += v; + } /* if length is odd, keeping it byte order independent */ if (unlikely(len % 2)) { uint16_t left = 0; - *(unsigned char *)&left = *(const unsigned char *)end; + + memcpy(&left, end, 1); sum += left; } Hi Mattias, I got following result [1] with patches on [2]. Can you shed light to some questions I have, 1) For 1500 why 'Unaligned' access gives better performance than 'Aligned' access? 2) Why 21/101 bytes almost doubles 20/100 bytes perf? 3) Why 1501 bytes perf better than 1500 bytes perf? Btw, I don't see any noticeable performance difference between with and without patch. [1] RTE>>cksum_perf_autotest ### rte_raw_cksum() performance ### Alignment Block sizeTSC cycles/block TSC cycles/byte Aligned 2025.1 1.25 Unaligned 2025.1 1.25 Aligned 2151.5 2.45 Unaligned 2151.5 2.45 Aligned 10028.2 0.28 Unaligned10028.2 0.28 Aligned 10154.5 0.54 Unaligned10154.5 0.54 Aligned 1500 188.9 0.13 Unaligned 1500 138.7 0.09 Aligned 1501 114.1 0.08 Unaligned 1501 110.1 0.07 Test OK RTE>> [2] AMD EPYC 7543P
Re: [PATCH v2] examples/link_status_interrupt: fix stats refresh rate
07/07/2022 10:22, Omar Awaysa: > From: Raja Zidane > > TIMER_MILLISECOND is defined as the number of cpu cycles per millisecond, > current definition is correct for cores with frequency of 2GHZ, for cores > with different frequency, it caused different periods between refresh, > (i.e. the definition is about 14ms on ARM cores). > > Use dpdk API to get CPU frequency, to define TIMER_MILLISECOND. > > Fixes: af75078fece3 ("first public release") > Cc: sta...@dpdk.org > > Signed-off-by: Omar Awaysa Applied (with indent fixed), thanks.
Re: [PATCH v9 1/4] ethdev: introduce protocol header API
On 6/13/22 13:25, wenxuanx...@intel.com wrote: From: Wenxuan Wu This patch added new ethdev API to retrieve supported protocol header mask This patch added -> Add of a PMD, which helps to configure protocol header based buffer split. I'd like to see motivation why single mask is considered sufficient. I.e. why don't we follow ptypes approach which is move flexible, but a bit more complicated. Looking at RTE_PTYPE_* defines carefully it looks like below API simply cannot provide information that we can split after TCP or UDP. Signed-off-by: Wenxuan Wu [snip] /** * @internal * Dump private info from device to a file. @@ -1281,6 +1296,9 @@ struct eth_dev_ops { /** Set IP reassembly configuration */ eth_ip_reassembly_conf_set_t ip_reassembly_conf_set; + /** Get supported ptypes to split */ + eth_buffer_split_hdr_ptype_get_t hdrs_supported_ptypes_get; + It is better to be consistent with naming. I.e. just cut prefix "eth_" and suffix "_t". Also the type name sounds like it get current split configuration, not supported one. /** Dump private info from device */ eth_dev_priv_dump_t eth_dev_priv_dump; }; diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c index 29a3d80466..e1f2a0ffe3 100644 --- a/lib/ethdev/rte_ethdev.c +++ b/lib/ethdev/rte_ethdev.c @@ -1636,9 +1636,10 @@ rte_eth_dev_is_removed(uint16_t port_id) } static int -rte_eth_rx_queue_check_split(const struct rte_eth_rxseg_split *rx_seg, -uint16_t n_seg, uint32_t *mbp_buf_size, -const struct rte_eth_dev_info *dev_info) +rte_eth_rx_queue_check_split(uint16_t port_id, + const struct rte_eth_rxseg_split *rx_seg, + int16_t n_seg, uint32_t *mbp_buf_size, + const struct rte_eth_dev_info *dev_info) { const struct rte_eth_rxseg_capa *seg_capa = &dev_info->rx_seg_capa; struct rte_mempool *mp_first; @@ -1694,13 +1695,7 @@ rte_eth_rx_queue_check_split(const struct rte_eth_rxseg_split *rx_seg, } offset += seg_idx != 0 ? 0 : RTE_PKTMBUF_HEADROOM; *mbp_buf_size = rte_pktmbuf_data_room_size(mpl); - length = length != 0 ? length : *mbp_buf_size; - if (*mbp_buf_size < length + offset) { I don't understand why the check goes away completely. - RTE_ETHDEV_LOG(ERR, - "%s mbuf_data_room_size %u < %u (segment length=%u + segment offset=%u)\n", - mpl->name, *mbp_buf_size, - length + offset, length, offset); - return -EINVAL; + Unnecessary empty line } Shouldn't the curly bracket go away as well together with its 'if' } return 0; @@ -1779,7 +1774,7 @@ rte_eth_rx_queue_setup(uint16_t port_id, uint16_t rx_queue_id, n_seg = rx_conf->rx_nseg; if (rx_conf->offloads & RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT) { - ret = rte_eth_rx_queue_check_split(rx_seg, n_seg, + ret = rte_eth_rx_queue_check_split(port_id, rx_seg, n_seg, &mbp_buf_size, &dev_info); if (ret != 0) @@ -5844,6 +5839,20 @@ rte_eth_ip_reassembly_conf_set(uint16_t port_id, (*dev->dev_ops->ip_reassembly_conf_set)(dev, conf)); } +int +rte_eth_supported_hdrs_get(uint16_t port_id, uint32_t *ptypes) +{ + struct rte_eth_dev *dev; + RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV); + dev = &rte_eth_devices[port_id]; ptypes must be checked vs NULL + + RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->hdrs_supported_ptypes_get, + -ENOTSUP); + + return eth_err(port_id, + (*dev->dev_ops->hdrs_supported_ptypes_get)(dev, ptypes)); +} + int rte_eth_dev_priv_dump(uint16_t port_id, FILE *file) { diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index 04cff8ee10..72cac1518e 100644 --- a/lib/ethdev/rte_ethdev.h +++ b/lib/ethdev/rte_ethdev.h @@ -6152,6 +6152,28 @@ rte_eth_tx_buffer(uint16_t port_id, uint16_t queue_id, return rte_eth_tx_buffer_flush(port_id, queue_id, buffer); } + +/** + * @warning + * @b EXPERIMENTAL: this API may change without prior notice + * + * Get supported header protocols to split supported by PMD. "supported" twice above. Get supported header protocols to split on Rx. + * The API will return error if the device is not valid. Above sentence is obvious and does not add any value. Please, remove. + * + * @param port_id + * The port identifier of the device. + * @param ptype Why do you use out annotation for the callback description and does not use it her
Re: [PATCH v9 2/4] ethdev: introduce protocol hdr based buffer split
On 6/13/22 13:25, wenxuanx...@intel.com wrote: From: Wenxuan Wu Currently, Rx buffer split supports length based split. With Rx queue offload RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT enabled and Rx packet segment configured, PMD will be able to split the received packets into multiple segments. However, length based buffer split is not suitable for NICs that do split based on protocol headers. Given an arbitrarily variable length in Rx packet segment, it is almost impossible to pass a fixed protocol header to driver. Besides, the existence of tunneling results in the composition of a packet is various, which makes the situation even worse. This patch extends current buffer split to support protocol header based buffer split. A new proto_hdr field is introduced in the reserved field of rte_eth_rxseg_split structure to specify protocol header. The proto_hdr field defines the split position of packet, splitting will always happens after the protocol header defined in the Rx packet segment. When Rx queue offload RTE_ETH_RX_OFFLOAD_BUFFER_SPLIT is enabled and corresponding protocol header is configured, driver will split the ingress packets into multiple segments. struct rte_eth_rxseg_split { struct rte_mempool *mp; /* memory pools to allocate segment from */ uint16_t length; /* segment maximal data length, configures "split point" */ uint16_t offset; /* data offset from beginning of mbuf data buffer */ uint32_t proto_hdr; /* inner/outer L2/L3/L4 protocol header, configures "split point" */ There is a big problem here that using RTE_PTYPE_* defines I can't request split after either TCP or UDP header. }; If both inner and outer L2/L3/L4 level protocol header split can be supported by a PMD. Corresponding protocol header capability is RTE_PTYPE_L2_ETHER, RTE_PTYPE_L3_IPV4, RTE_PTYPE_L3_IPV6, RTE_PTYPE_L4_TCP, RTE_PTYPE_L4_UDP, RTE_PTYPE_L4_SCTP, RTE_PTYPE_INNER_L2_ETHER, RTE_PTYPE_INNER_L3_IPV4, RTE_PTYPE_INNER_L3_IPV6, RTE_PTYPE_INNER_L4_TCP, RTE_PTYPE_INNER_L4_UDP, RTE_PTYPE_INNER_L4_SCTP. I think there is no point to list above defines here if it is not the only supported defines. For example, let's suppose we configured the Rx queue with the following segments: seg0 - pool0, proto_hdr0=RTE_PTYPE_L3_IPV4, off0=2B seg1 - pool1, proto_hdr1=RTE_PTYPE_L4_UDP, off1=128B seg2 - pool2, off1=0B The packet consists of MAC_IPV4_UDP_PAYLOAD will be split like following: seg0 - ipv4 header @ RTE_PKTMBUF_HEADROOM + 2 in mbuf from pool0 seg1 - udp header @ 128 in mbuf from pool1 seg2 - payload @ 0 in mbuf from pool2 Sorry, but I still see no definition what should happen with, for example, ARP packet with above config. Now buffer split can be configured in two modes. For length based buffer split, the mp, length, offset field in Rx packet segment should be configured, while the proto_hdr field should not be configured. For protocol header based buffer split, the mp, offset, proto_hdr field in Rx packet segment should be configured, while the length field should not be configured. The split limitations imposed by underlying driver is reported in the rte_eth_dev_info->rx_seg_capa field. The memory attributes for the split parts may differ either, dpdk memory and external memory, respectively. Signed-off-by: Xuan Ding Signed-off-by: Wenxuan Wu Signed-off-by: Yuan Wang Reviewed-by: Qi Zhang Acked-by: Ray Kinsella --- lib/ethdev/rte_ethdev.c | 32 +++- lib/ethdev/rte_ethdev.h | 14 +- 2 files changed, 44 insertions(+), 2 deletions(-) Do we need a dedicated feature in doc/guides/nics/features.rst? Or should be just update buffer split to refer to a new supported header split API and callback? Also the feature definitely deserves entry in the release notes. [snip]
RE: [PATCH v2 0/2] DLB2: cq_weight fixes
> -Original Message- > From: Thomas Monjalon > Sent: Friday, July 8, 2022 9:26 AM > To: McDaniel, Timothy > Cc: dev@dpdk.org; jer...@marvell.com; jerinjac...@gmail.com; > sta...@dpdk.org > Subject: Re: [PATCH v2 0/2] DLB2: cq_weight fixes > > 06/07/2022 23:46, Timothy McDaniel: > > This patch series contains the following: > > - fix coverity issue 379234 > > - improve error reporting for cq_weight feature > > > > Changes since V1 > > - fixed a repeated word in the commit message of patch 1 > > > > Timothy McDaniel (2): > > event/dlb2: fix cq_weight array overflow > > event/dlb2: improve cq_weight error messages > > Should we merge those patches in 22.07-rc4? No risk? I see no risk in merging these changes. I retested with them, and one fixes a coverity issue.
RE: [PATCH 2/2] service: fix potential stats race-condition on MT services
> > This commit fixes a potential racey-add that could occur if multiple service- > lcores were executing the same MT-safe service at the same time, with > service statistics collection enabled. > > Because multiple threads can run and execute the service, the stats values > can have multiple writer threads, resulting in the requirement of using > atomic addition for correctness. > > Note that when a MT unsafe service is executed, a spinlock is held, so the > stats increments are protected. This fact is used to avoid executing atomic > add instructions when not required. > > This patch causes a 1.25x increase in cycle-cost for polling a MT safe service > when statistics are enabled. No change was seen for MT unsafe services, or > when statistics are disabled. > > Reported-by: Mattias Rönnblom > Suggested-by: Honnappa Nagarahalli > Suggested-by: Morten Brørup > Signed-off-by: Harry van Haaren > > --- > --- > lib/eal/common/rte_service.c | 10 -- > 1 file changed, 8 insertions(+), 2 deletions(-) > > diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c > index ef31b1f63c..f045e74ef3 100644 > --- a/lib/eal/common/rte_service.c > +++ b/lib/eal/common/rte_service.c > @@ -363,9 +363,15 @@ service_runner_do_callback(struct > rte_service_spec_impl *s, > uint64_t start = rte_rdtsc(); > s->spec.callback(userdata); > uint64_t end = rte_rdtsc(); > - s->cycles_spent += end - start; > + uint64_t cycles = end - start; > cs->calls_per_service[service_idx]++; > - s->calls++; > + if (service_mt_safe(s)) { > + __atomic_fetch_add(&s->cycles_spent, cycles, > __ATOMIC_RELAXED); > + __atomic_fetch_add(&s->calls, 1, > __ATOMIC_RELAXED); > + } else { > + s->cycles_spent += cycles; > + s->calls++; This is still a problem from a reader perspective. It is possible that the writes could be split while a reader is reading the stats. These need to be atomic adds. > + } > } else > s->spec.callback(userdata); > } > -- > 2.32.0
RE: [PATCH 2/2] service: fix potential stats race-condition on MT services
> -Original Message- > From: Honnappa Nagarahalli > Sent: Friday, July 8, 2022 4:16 PM > To: Van Haaren, Harry ; dev@dpdk.org > Cc: mattias.ronnblom ; Morten Brørup > ; nd ; nd > Subject: RE: [PATCH 2/2] service: fix potential stats race-condition on MT > services > > diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c > > index ef31b1f63c..f045e74ef3 100644 > > --- a/lib/eal/common/rte_service.c > > +++ b/lib/eal/common/rte_service.c > > @@ -363,9 +363,15 @@ service_runner_do_callback(struct > > rte_service_spec_impl *s, > > uint64_t start = rte_rdtsc(); > > s->spec.callback(userdata); > > uint64_t end = rte_rdtsc(); > > - s->cycles_spent += end - start; > > + uint64_t cycles = end - start; > > cs->calls_per_service[service_idx]++; > > - s->calls++; > > + if (service_mt_safe(s)) { > > + __atomic_fetch_add(&s->cycles_spent, cycles, > > __ATOMIC_RELAXED); > > + __atomic_fetch_add(&s->calls, 1, > > __ATOMIC_RELAXED); > > + } else { > > + s->cycles_spent += cycles; > > + s->calls++; > This is still a problem from a reader perspective. It is possible that the > writes could be > split while a reader is reading the stats. These need to be atomic adds. Thanks for pointing out; I do "think" in x86 in terms of load/store tearing; and on x86 naturally aligned load/stores will not tear. Apologies for missing the ARM angle here. I'm not sure how to best encode the difference between tearing & "locked instructions" to make things multi-writer safe. But they're not the same thing, and I'd prefer not pay the penalty for LOCK instructions (multi-writer) only to satisfy the non-tearing requirements. Is there an rte_atomic-* type that is guaranteed as non-tearing? In that case, changing the type of the calls/cycles_spent variables to such a type to ensure "non-tearing" single-reader, single-writer behaviour is enough, instead of forcing __atomic_fetch_add() everywhere?
Re: [PATCH 2/2] service: fix potential stats race-condition on MT services
On Fri, Jul 08, 2022 at 03:31:01PM +, Van Haaren, Harry wrote: > > -Original Message- > > From: Honnappa Nagarahalli > > Sent: Friday, July 8, 2022 4:16 PM > > To: Van Haaren, Harry ; dev@dpdk.org > > Cc: mattias.ronnblom ; Morten Brørup > > ; nd ; nd > > Subject: RE: [PATCH 2/2] service: fix potential stats race-condition on MT > > services > > > > > > diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c > > > index ef31b1f63c..f045e74ef3 100644 > > > --- a/lib/eal/common/rte_service.c > > > +++ b/lib/eal/common/rte_service.c > > > @@ -363,9 +363,15 @@ service_runner_do_callback(struct > > > rte_service_spec_impl *s, > > > uint64_t start = rte_rdtsc(); > > > s->spec.callback(userdata); > > > uint64_t end = rte_rdtsc(); > > > - s->cycles_spent += end - start; > > > + uint64_t cycles = end - start; > > > cs->calls_per_service[service_idx]++; > > > - s->calls++; > > > + if (service_mt_safe(s)) { > > > + __atomic_fetch_add(&s->cycles_spent, cycles, > > > __ATOMIC_RELAXED); > > > + __atomic_fetch_add(&s->calls, 1, > > > __ATOMIC_RELAXED); > > > + } else { > > > + s->cycles_spent += cycles; > > > + s->calls++; > > This is still a problem from a reader perspective. It is possible that the > > writes could be > > split while a reader is reading the stats. These need to be atomic adds. > > Thanks for pointing out; I do "think" in x86 in terms of load/store tearing; > and on x86 > naturally aligned load/stores will not tear. Apologies for missing the ARM > angle here. > > I'm not sure how to best encode the difference between tearing & "locked > instructions" > to make things multi-writer safe. But they're not the same thing, and I'd > prefer not pay > the penalty for LOCK instructions (multi-writer) only to satisfy the > non-tearing requirements. > > Is there an rte_atomic-* type that is guaranteed as non-tearing? > > In that case, changing the type of the calls/cycles_spent variables to such a > type to ensure "non-tearing" > single-reader, single-writer behaviour is enough, instead of forcing > __atomic_fetch_add() everywhere? Regular read, increment and then atomic store should work without locks where alignment is correct on most architectures, and doing the store atomically should prevent any tearing.
RE: [PATCH 2/2] service: fix potential stats race-condition on MT services
> From: Honnappa Nagarahalli [mailto:honnappa.nagaraha...@arm.com] > Sent: Friday, 8 July 2022 17.16 > > > > > > This commit fixes a potential racey-add that could occur if multiple > service- > > lcores were executing the same MT-safe service at the same time, with > > service statistics collection enabled. > > > > Because multiple threads can run and execute the service, the stats > values > > can have multiple writer threads, resulting in the requirement of > using > > atomic addition for correctness. > > > > Note that when a MT unsafe service is executed, a spinlock is held, > so the > > stats increments are protected. This fact is used to avoid executing > atomic > > add instructions when not required. > > > > This patch causes a 1.25x increase in cycle-cost for polling a MT > safe service > > when statistics are enabled. No change was seen for MT unsafe > services, or > > when statistics are disabled. > > > > Reported-by: Mattias Rönnblom > > Suggested-by: Honnappa Nagarahalli > > Suggested-by: Morten Brørup > > Signed-off-by: Harry van Haaren > > > > --- > > --- > > lib/eal/common/rte_service.c | 10 -- > > 1 file changed, 8 insertions(+), 2 deletions(-) > > > > diff --git a/lib/eal/common/rte_service.c > b/lib/eal/common/rte_service.c > > index ef31b1f63c..f045e74ef3 100644 > > --- a/lib/eal/common/rte_service.c > > +++ b/lib/eal/common/rte_service.c > > @@ -363,9 +363,15 @@ service_runner_do_callback(struct > > rte_service_spec_impl *s, > > uint64_t start = rte_rdtsc(); > > s->spec.callback(userdata); > > uint64_t end = rte_rdtsc(); > > - s->cycles_spent += end - start; > > + uint64_t cycles = end - start; > > cs->calls_per_service[service_idx]++; > > - s->calls++; > > + if (service_mt_safe(s)) { > > + __atomic_fetch_add(&s->cycles_spent, cycles, > > __ATOMIC_RELAXED); > > + __atomic_fetch_add(&s->calls, 1, > > __ATOMIC_RELAXED); > > + } else { > > + s->cycles_spent += cycles; > > + s->calls++; > This is still a problem from a reader perspective. It is possible that > the writes could be split while a reader is reading the stats. These > need to be atomic adds. I don't understand what you suggest can go wrong here, Honnappa. If you talking about 64 bit counters on 32 bit architectures, then I understand the problem (and have many years of direct experience with it myself). Otherwise, I hope you can elaborate or direct me to educational material about the issue, considering this a learning opportunity. :-) > > > + } > > } else > > s->spec.callback(userdata); > > } > > -- > > 2.32.0
RE: [PATCH 2/2] service: fix potential stats race-condition on MT services
> > > > > > > > diff --git a/lib/eal/common/rte_service.c > > > > b/lib/eal/common/rte_service.c index ef31b1f63c..f045e74ef3 100644 > > > > --- a/lib/eal/common/rte_service.c > > > > +++ b/lib/eal/common/rte_service.c > > > > @@ -363,9 +363,15 @@ service_runner_do_callback(struct > > > > rte_service_spec_impl *s, > > > > uint64_t start = rte_rdtsc(); > > > > s->spec.callback(userdata); > > > > uint64_t end = rte_rdtsc(); > > > > - s->cycles_spent += end - start; > > > > + uint64_t cycles = end - start; > > > > cs->calls_per_service[service_idx]++; > > > > - s->calls++; > > > > + if (service_mt_safe(s)) { > > > > + __atomic_fetch_add(&s->cycles_spent, cycles, > > > > __ATOMIC_RELAXED); > > > > + __atomic_fetch_add(&s->calls, 1, > > > > __ATOMIC_RELAXED); > > > > + } else { > > > > + s->cycles_spent += cycles; > > > > + s->calls++; > > > This is still a problem from a reader perspective. It is possible > > > that the writes could be split while a reader is reading the stats. These > need to be atomic adds. > > > > Thanks for pointing out; I do "think" in x86 in terms of load/store > > tearing; and on x86 naturally aligned load/stores will not tear. Apologies > > for > missing the ARM angle here. Arm architecture has similar things as well. I believe compiler does not provide any guarantees that it will only generate non-tearing instructions. Refer to a discussion in the past [1] [2] where it was thought that the compiler would generate a non-tearing store (this is a slightly different scenario). [1] http://inbox.dpdk.org/dev/d5d563ab-0411-3faf-39ec-4994f2bc9...@intel.com/ [2] Refer to commit '316095eb41ed22da808b5826404c398917c83c89' > > > > I'm not sure how to best encode the difference between tearing & "locked > instructions" > > to make things multi-writer safe. But they're not the same thing, and > > I'd prefer not pay the penalty for LOCK instructions (multi-writer) only to > satisfy the non-tearing requirements. > > > > Is there an rte_atomic-* type that is guaranteed as non-tearing? Nothing that I know of. > > > > In that case, changing the type of the calls/cycles_spent variables to such > > a > type to ensure "non-tearing" > > single-reader, single-writer behaviour is enough, instead of forcing > __atomic_fetch_add() everywhere? > > Regular read, increment and then atomic store should work without locks > where alignment is correct on most architectures, and doing the store > atomically should prevent any tearing. Agree, we could do this, will provide more flexibility for the micro-architecture to work with. Good to understand the perf benefits vs complexity and the branch cost.
RE: [PATCH 2/2] service: fix potential stats race-condition on MT services
> > > > > > This commit fixes a potential racey-add that could occur if multiple > > service- > > > lcores were executing the same MT-safe service at the same time, > > > with service statistics collection enabled. > > > > > > Because multiple threads can run and execute the service, the stats > > values > > > can have multiple writer threads, resulting in the requirement of > > using > > > atomic addition for correctness. > > > > > > Note that when a MT unsafe service is executed, a spinlock is held, > > so the > > > stats increments are protected. This fact is used to avoid executing > > atomic > > > add instructions when not required. > > > > > > This patch causes a 1.25x increase in cycle-cost for polling a MT > > safe service > > > when statistics are enabled. No change was seen for MT unsafe > > services, or > > > when statistics are disabled. > > > > > > Reported-by: Mattias Rönnblom > > > Suggested-by: Honnappa Nagarahalli > > > Suggested-by: Morten Brørup > > > Signed-off-by: Harry van Haaren > > > > > > --- > > > --- > > > lib/eal/common/rte_service.c | 10 -- > > > 1 file changed, 8 insertions(+), 2 deletions(-) > > > > > > diff --git a/lib/eal/common/rte_service.c > > b/lib/eal/common/rte_service.c > > > index ef31b1f63c..f045e74ef3 100644 > > > --- a/lib/eal/common/rte_service.c > > > +++ b/lib/eal/common/rte_service.c > > > @@ -363,9 +363,15 @@ service_runner_do_callback(struct > > > rte_service_spec_impl *s, > > > uint64_t start = rte_rdtsc(); > > > s->spec.callback(userdata); > > > uint64_t end = rte_rdtsc(); > > > - s->cycles_spent += end - start; > > > + uint64_t cycles = end - start; > > > cs->calls_per_service[service_idx]++; > > > - s->calls++; > > > + if (service_mt_safe(s)) { > > > + __atomic_fetch_add(&s->cycles_spent, cycles, > > > __ATOMIC_RELAXED); > > > + __atomic_fetch_add(&s->calls, 1, > > > __ATOMIC_RELAXED); > > > + } else { > > > + s->cycles_spent += cycles; > > > + s->calls++; > > This is still a problem from a reader perspective. It is possible that > > the writes could be split while a reader is reading the stats. These > > need to be atomic adds. > > I don't understand what you suggest can go wrong here, Honnappa. If you > talking about 64 bit counters on 32 bit architectures, then I understand the > problem (and have many years of direct experience with it myself). > Otherwise, I hope you can elaborate or direct me to educational material > about the issue, considering this a learning opportunity. :-) I am thinking of the case where the 64b write is split into two 32b (or more) write operations either by the compiler or the micro-architecture. If this were to happen, it causes race conditions with the reader. As far as I understand, the compiler does not provide any guarantees on generating non-tearing stores unless an atomic builtin/function is used. If we have to ensure the micro-architecture does not generate split writes, we need to be careful that future code additions do not change the alignment of the stats. > > > > > > + } > > > } else > > > s->spec.callback(userdata); > > > } > > > -- > > > 2.32.0
RE: [PATCH 2/2] service: fix potential stats race-condition on MT services
> From: Honnappa Nagarahalli [mailto:honnappa.nagaraha...@arm.com] > Sent: Friday, 8 July 2022 18.46 > > > > > > > > > > This commit fixes a potential racey-add that could occur if > multiple > > > service- > > > > lcores were executing the same MT-safe service at the same time, > > > > with service statistics collection enabled. > > > > > > > > Because multiple threads can run and execute the service, the > stats > > > values > > > > can have multiple writer threads, resulting in the requirement of > > > using > > > > atomic addition for correctness. > > > > > > > > Note that when a MT unsafe service is executed, a spinlock is > held, > > > so the > > > > stats increments are protected. This fact is used to avoid > executing > > > atomic > > > > add instructions when not required. > > > > > > > > This patch causes a 1.25x increase in cycle-cost for polling a MT > > > safe service > > > > when statistics are enabled. No change was seen for MT unsafe > > > services, or > > > > when statistics are disabled. > > > > > > > > Reported-by: Mattias Rönnblom > > > > Suggested-by: Honnappa Nagarahalli > > > > Suggested-by: Morten Brørup > > > > Signed-off-by: Harry van Haaren > > > > > > > > --- > > > > --- > > > > lib/eal/common/rte_service.c | 10 -- > > > > 1 file changed, 8 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/lib/eal/common/rte_service.c > > > b/lib/eal/common/rte_service.c > > > > index ef31b1f63c..f045e74ef3 100644 > > > > --- a/lib/eal/common/rte_service.c > > > > +++ b/lib/eal/common/rte_service.c > > > > @@ -363,9 +363,15 @@ service_runner_do_callback(struct > > > > rte_service_spec_impl *s, > > > > uint64_t start = rte_rdtsc(); > > > > s->spec.callback(userdata); > > > > uint64_t end = rte_rdtsc(); > > > > - s->cycles_spent += end - start; > > > > + uint64_t cycles = end - start; > > > > cs->calls_per_service[service_idx]++; > > > > - s->calls++; > > > > + if (service_mt_safe(s)) { > > > > + __atomic_fetch_add(&s->cycles_spent, cycles, > > > > __ATOMIC_RELAXED); > > > > + __atomic_fetch_add(&s->calls, 1, > > > > __ATOMIC_RELAXED); > > > > + } else { > > > > + s->cycles_spent += cycles; > > > > + s->calls++; > > > This is still a problem from a reader perspective. It is possible > that > > > the writes could be split while a reader is reading the stats. > These > > > need to be atomic adds. > > > > I don't understand what you suggest can go wrong here, Honnappa. If > you > > talking about 64 bit counters on 32 bit architectures, then I > understand the > > problem (and have many years of direct experience with it myself). > > Otherwise, I hope you can elaborate or direct me to educational > material > > about the issue, considering this a learning opportunity. :-) > I am thinking of the case where the 64b write is split into two 32b (or > more) write operations either by the compiler or the micro- > architecture. If this were to happen, it causes race conditions with > the reader. > > As far as I understand, the compiler does not provide any guarantees on > generating non-tearing stores unless an atomic builtin/function is > used. This seems like a generic problem for all 64b statistics counters in DPDK, and any other C code using 64 bit counters. Being a generic C problem, there is probably a generic solution to it. Is any compiler going to do something that stupid (i.e. tearing a store into multiple write operations) to a simple 64b counter on any 64 bit architecture (assuming the counter is 64b aligned)? Otherwise, we might only need to take special precautions for 32 bit architectures. > If we have to ensure the micro-architecture does not generate > split writes, we need to be careful that future code additions do not > change the alignment of the stats. Unless the structure containing the stats counters is packed, the contained 64b counters will be 64b aligned (on 64 bit architecture). So we should not worry about alignment, except perhaps on 32 bit architectures.
RE: [PATCH 2/2] service: fix potential stats race-condition on MT services
> > > > > > > > > > This commit fixes a potential racey-add that could occur if > > multiple > > > > service- > > > > > lcores were executing the same MT-safe service at the same time, > > > > > with service statistics collection enabled. > > > > > > > > > > Because multiple threads can run and execute the service, the > > stats > > > > values > > > > > can have multiple writer threads, resulting in the requirement > > > > > of > > > > using > > > > > atomic addition for correctness. > > > > > > > > > > Note that when a MT unsafe service is executed, a spinlock is > > held, > > > > so the > > > > > stats increments are protected. This fact is used to avoid > > executing > > > > atomic > > > > > add instructions when not required. > > > > > > > > > > This patch causes a 1.25x increase in cycle-cost for polling a > > > > > MT > > > > safe service > > > > > when statistics are enabled. No change was seen for MT unsafe > > > > services, or > > > > > when statistics are disabled. > > > > > > > > > > Reported-by: Mattias Rönnblom > > > > > Suggested-by: Honnappa Nagarahalli > > > > > > > > > > Suggested-by: Morten Brørup > > > > > Signed-off-by: Harry van Haaren > > > > > > > > > > --- > > > > > --- > > > > > lib/eal/common/rte_service.c | 10 -- > > > > > 1 file changed, 8 insertions(+), 2 deletions(-) > > > > > > > > > > diff --git a/lib/eal/common/rte_service.c > > > > b/lib/eal/common/rte_service.c > > > > > index ef31b1f63c..f045e74ef3 100644 > > > > > --- a/lib/eal/common/rte_service.c > > > > > +++ b/lib/eal/common/rte_service.c > > > > > @@ -363,9 +363,15 @@ service_runner_do_callback(struct > > > > > rte_service_spec_impl *s, > > > > > uint64_t start = rte_rdtsc(); > > > > > s->spec.callback(userdata); > > > > > uint64_t end = rte_rdtsc(); > > > > > - s->cycles_spent += end - start; > > > > > + uint64_t cycles = end - start; > > > > > cs->calls_per_service[service_idx]++; > > > > > - s->calls++; > > > > > + if (service_mt_safe(s)) { > > > > > + __atomic_fetch_add(&s->cycles_spent, > cycles, > > > > > __ATOMIC_RELAXED); > > > > > + __atomic_fetch_add(&s->calls, 1, > > > > > __ATOMIC_RELAXED); > > > > > + } else { > > > > > + s->cycles_spent += cycles; > > > > > + s->calls++; > > > > This is still a problem from a reader perspective. It is possible > > that > > > > the writes could be split while a reader is reading the stats. > > These > > > > need to be atomic adds. > > > > > > I don't understand what you suggest can go wrong here, Honnappa. If > > you > > > talking about 64 bit counters on 32 bit architectures, then I > > understand the > > > problem (and have many years of direct experience with it myself). > > > Otherwise, I hope you can elaborate or direct me to educational > > material > > > about the issue, considering this a learning opportunity. :-) > > I am thinking of the case where the 64b write is split into two 32b > > (or > > more) write operations either by the compiler or the micro- > > architecture. If this were to happen, it causes race conditions with > > the reader. > > > > As far as I understand, the compiler does not provide any guarantees > > on generating non-tearing stores unless an atomic builtin/function is > > used. > > This seems like a generic problem for all 64b statistics counters in DPDK, and > any other C code using 64 bit counters. Being a generic C problem, there is > probably a generic solution to it. Browsing through the code, I see similar problems elsewhere. > > Is any compiler going to do something that stupid (i.e. tearing a store into > multiple write operations) to a simple 64b counter on any 64 bit architecture > (assuming the counter is 64b aligned)? Otherwise, we might only need to > take special precautions for 32 bit architectures. It is always a debate on who is stupid, compiler or programmer 😊 Though not the same case, you can look at this discussion where compiler generated torn stores [1] when we all thought it has been generating a 64b store. [1] http://inbox.dpdk.org/dev/d5d563ab-0411-3faf-39ec-4994f2bc9...@intel.com/ > > > If we have to ensure the micro-architecture does not generate split > > writes, we need to be careful that future code additions do not change > > the alignment of the stats. > > Unless the structure containing the stats counters is packed, the contained > 64b counters will be 64b aligned (on 64 bit architecture). So we should not > worry about alignment, except perhaps on 32 bit architectures. Agree, future code changes need to be aware of these issues and DPDK supports 32b architectures.
RE: clarification on RTE_HASH_EXTRA_FLAGS_RW_CONCURRENCY_LF flag
> On Wed, 6 Jul 2022 19:07:54 +0530 > venkatesh bs wrote: > > > Hi All, > > > > In multithreaded/Multicore environment can we use > > RTE_HASH_EXTRA_FLAGS_RW_CONCURRENCY_LF > > independently, This flag is about reader-writer concurrency (not writer-writer concurrency). Reader-writer concurrency will use lock-free algorithm allowing for the data plane to scale well. > > or this flag should always be used with > > RTE_HASH_EXTRA_FLAGS_MULTI_WRITER_ADD. If you have multiple writers, you need to enable this flag. This is unrelated to RTE_HASH_EXTRA_FLAGS_RW_CONCURRENCY_LF. > > > > We are trying to create and access the hash table with > > RTE_HASH_EXTRA_FLAGS_RW_CONCURRENCY_LF > > only. > > We are getting crashes in multi core environments , we debugged > > nothing wrong in the application , everything looks good. If the crash is happening while adding/deleting keys and there are multiple writers, enabling RTE_HASH_EXTRA_FLAGS_MULTI_WRITER_ADD might solve the problem. > > > > We call rte_hash_del_key() first and from the returned position we are > > calling rte_hash_free_key_with_position(). > > > > Please let me know if we missed something. > > > > Thanks, > > Venkatesh. > > Repeating same question doesn't get answer faster. > > Read the code, it is fairly straightforward. > > The multi-writer add means that writers take a lock. > If doing lock free support then: > 1. It is up to your application to use a single writer and/or > wrap writer calls in a lock. > > 2. You need to use RCU mechanism to guarantee that no reader > will access a deleted entry. Something like: > > rte_hash_del_key() > synchronize_rcu() > rte_hash_free_key_with_position() > > You can use either the DPDK RCU library or the userspace RCU library. > Read that documentation, RCU is non-trivial change. RCU is integrated in the hash library. You could use that as well. Look at ' rte_hash_rcu_qsbr_add' API for more details. >
RE: [PATCH 2/2] service: fix potential stats race-condition on MT services
> From: Honnappa Nagarahalli [mailto:honnappa.nagaraha...@arm.com] > Sent: Friday, 8 July 2022 19.40 > > > > > > > > > > > > > > This commit fixes a potential racey-add that could occur if > > > multiple > > > > > service- > > > > > > lcores were executing the same MT-safe service at the same > time, > > > > > > with service statistics collection enabled. > > > > > > > > > > > > Because multiple threads can run and execute the service, the > > > stats > > > > > values > > > > > > can have multiple writer threads, resulting in the > requirement > > > > > > of > > > > > using > > > > > > atomic addition for correctness. > > > > > > > > > > > > Note that when a MT unsafe service is executed, a spinlock is > > > held, > > > > > so the > > > > > > stats increments are protected. This fact is used to avoid > > > executing > > > > > atomic > > > > > > add instructions when not required. > > > > > > > > > > > > This patch causes a 1.25x increase in cycle-cost for polling > a > > > > > > MT > > > > > safe service > > > > > > when statistics are enabled. No change was seen for MT unsafe > > > > > services, or > > > > > > when statistics are disabled. > > > > > > > > > > > > Reported-by: Mattias Rönnblom > > > > > > Suggested-by: Honnappa Nagarahalli > > > > > > > > > > > > Suggested-by: Morten Brørup > > > > > > Signed-off-by: Harry van Haaren > > > > > > > > > > > > --- > > > > > > --- > > > > > > lib/eal/common/rte_service.c | 10 -- > > > > > > 1 file changed, 8 insertions(+), 2 deletions(-) > > > > > > > > > > > > diff --git a/lib/eal/common/rte_service.c > > > > > b/lib/eal/common/rte_service.c > > > > > > index ef31b1f63c..f045e74ef3 100644 > > > > > > --- a/lib/eal/common/rte_service.c > > > > > > +++ b/lib/eal/common/rte_service.c > > > > > > @@ -363,9 +363,15 @@ service_runner_do_callback(struct > > > > > > rte_service_spec_impl *s, > > > > > > uint64_t start = rte_rdtsc(); > > > > > > s->spec.callback(userdata); > > > > > > uint64_t end = rte_rdtsc(); > > > > > > - s->cycles_spent += end - start; > > > > > > + uint64_t cycles = end - start; > > > > > > cs->calls_per_service[service_idx]++; > > > > > > - s->calls++; > > > > > > + if (service_mt_safe(s)) { > > > > > > + __atomic_fetch_add(&s->cycles_spent, > > cycles, > > > > > > __ATOMIC_RELAXED); > > > > > > + __atomic_fetch_add(&s->calls, 1, > > > > > > __ATOMIC_RELAXED); > > > > > > + } else { > > > > > > + s->cycles_spent += cycles; > > > > > > + s->calls++; > > > > > This is still a problem from a reader perspective. It is > possible > > > that > > > > > the writes could be split while a reader is reading the stats. > > > These > > > > > need to be atomic adds. > > > > > > > > I don't understand what you suggest can go wrong here, Honnappa. > If > > > you > > > > talking about 64 bit counters on 32 bit architectures, then I > > > understand the > > > > problem (and have many years of direct experience with it > myself). > > > > Otherwise, I hope you can elaborate or direct me to educational > > > material > > > > about the issue, considering this a learning opportunity. :-) > > > I am thinking of the case where the 64b write is split into two 32b > > > (or > > > more) write operations either by the compiler or the micro- > > > architecture. If this were to happen, it causes race conditions > with > > > the reader. > > > > > > As far as I understand, the compiler does not provide any > guarantees > > > on generating non-tearing stores unless an atomic builtin/function > is > > > used. > > > > This seems like a generic problem for all 64b statistics counters in > DPDK, and > > any other C code using 64 bit counters. Being a generic C problem, > there is > > probably a generic solution to it. > Browsing through the code, I see similar problems elsewhere. > > > > > Is any compiler going to do something that stupid (i.e. tearing a > store into > > multiple write operations) to a simple 64b counter on any 64 bit > architecture > > (assuming the counter is 64b aligned)? Otherwise, we might only need > to > > take special precautions for 32 bit architectures. > It is always a debate on who is stupid, compiler or programmer 😊 Compilers will never stop surprising me. Thankfully, they are not so unreliable and full of bugs as they were 25 years ago. :-) > > Though not the same case, you can look at this discussion where > compiler generated torn stores [1] when we all thought it has been > generating a 64b store. > > [1] http://inbox.dpdk.org/dev/d5d563ab-0411-3faf-39ec- > 4994f2bc9...@intel.com/ Good reference. Technically, this sets a bunch of fields in the rte_lpm_tbl_entry structure (which happens to be 32b in total size), so it is not completely unreasonable for the compiler to store those fields individually. I wonder if using a union to cast the rte_lpm_tbl_entry
RE: [RFC] rwlock: prevent readers from starving writers
> > The original reader/writer lock in DPDK can cause a stream of readers to > starve writers. > > The new version uses an additional bit to indicate that a writer is waiting > and > which keeps readers from starving the writer. This addition makes sense. I am wondering if we should create a new lock. Is it possible that some applications are dependent on the current behavior? > > Signed-off-by: Stephen Hemminger > --- > Would like this to be in 22.11, but needs some more review > > lib/eal/include/generic/rte_rwlock.h | 93 ++-- > 1 file changed, 61 insertions(+), 32 deletions(-) > > diff --git a/lib/eal/include/generic/rte_rwlock.h > b/lib/eal/include/generic/rte_rwlock.h > index da9bc3e9c0e2..725cd19ffb27 100644 > --- a/lib/eal/include/generic/rte_rwlock.h > +++ b/lib/eal/include/generic/rte_rwlock.h > @@ -13,7 +13,7 @@ > * This file defines an API for read-write locks. The lock is used to > * protect data that allows multiple readers in parallel, but only > * one writer. All readers are blocked until the writer is finished > - * writing. > + * writing. This version will not starve writers. > * > */ > > @@ -28,10 +28,17 @@ extern "C" { > /** > * The rte_rwlock_t type. > * > - * cnt is -1 when write lock is held, and > 0 when read locks are held. > + * Readers increment the counter by RW_READ (4) > + * Writers set the RWLOCK_WRITE bit when lock is held > + * and set the RWLOCK_WAIT bit while waiting. > */ > + > +#define RTE_RWLOCK_WAIT 0x1/* Writer is waiting */ > +#define RTE_RWLOCK_WRITE 0x2 /* Writer has the lock */ > +#define RTE_RWLOCK_READ 0x4/* Reader increment */ > + > typedef struct { > - volatile int32_t cnt; /**< -1 when W lock held, > 0 when R locks held. > */ > + volatile int32_t cnt; > } rte_rwlock_t; > > /** > @@ -61,17 +68,24 @@ static inline void > rte_rwlock_read_lock(rte_rwlock_t *rwl) { > int32_t x; > - int success = 0; > > - while (success == 0) { > + while (1) { > x = __atomic_load_n(&rwl->cnt, __ATOMIC_RELAXED); > /* write lock is held */ > - if (x < 0) { > + if (x & (RTE_RWLOCK_WAIT | RTE_RWLOCK_WRITE)) { > rte_pause(); > continue; > } > - success = __atomic_compare_exchange_n(&rwl->cnt, &x, x > + 1, 1, > - __ATOMIC_ACQUIRE, > __ATOMIC_RELAXED); > + > + /* Try to get read lock */ > + x = __atomic_add_fetch(&rwl->cnt, RTE_RWLOCK_READ, > +__ATOMIC_ACQUIRE); > + if (!(x & (RTE_RWLOCK_WAIT | RTE_RWLOCK_WRITE))) > + return; > + > + /* Undo */ > + __atomic_fetch_sub(&rwl->cnt, RTE_RWLOCK_READ, > +__ATOMIC_RELEASE); > } > } > > @@ -93,17 +107,23 @@ static inline int > rte_rwlock_read_trylock(rte_rwlock_t *rwl) { > int32_t x; > - int success = 0; > > - while (success == 0) { > - x = __atomic_load_n(&rwl->cnt, __ATOMIC_RELAXED); > - /* write lock is held */ > - if (x < 0) > - return -EBUSY; > - success = __atomic_compare_exchange_n(&rwl->cnt, &x, x > + 1, 1, > - __ATOMIC_ACQUIRE, > __ATOMIC_RELAXED); > - } > + x = __atomic_load_n(&rwl->cnt, __ATOMIC_RELAXED); > + > + /* write lock is held */ > + if (x & (RTE_RWLOCK_WAIT | RTE_RWLOCK_WRITE)) > + return -EBUSY; > + > + /* Try to get read lock */ > + x = __atomic_add_fetch(&rwl->cnt, RTE_RWLOCK_READ, > +__ATOMIC_ACQUIRE); > + > + if (x & (RTE_RWLOCK_WAIT | RTE_RWLOCK_WRITE)) { > + __atomic_fetch_sub(&rwl->cnt, RTE_RWLOCK_READ, > +__ATOMIC_RELEASE); > > + return -EBUSY; > + } > return 0; > } > > @@ -116,7 +136,7 @@ rte_rwlock_read_trylock(rte_rwlock_t *rwl) static > inline void rte_rwlock_read_unlock(rte_rwlock_t *rwl) { > - __atomic_fetch_sub(&rwl->cnt, 1, __ATOMIC_RELEASE); > + __atomic_fetch_sub(&rwl->cnt, RTE_RWLOCK_READ, > __ATOMIC_RELEASE); > } > > /** > @@ -139,11 +159,12 @@ rte_rwlock_write_trylock(rte_rwlock_t *rwl) > int32_t x; > > x = __atomic_load_n(&rwl->cnt, __ATOMIC_RELAXED); > - if (x != 0 || __atomic_compare_exchange_n(&rwl->cnt, &x, -1, 1, > - __ATOMIC_ACQUIRE, __ATOMIC_RELAXED) == 0) > + if (x < RTE_RWLOCK_WRITE && > + __atomic_compare_exchange_n(&rwl->cnt, &x, x + > RTE_RWLOCK_WRITE, > + 1, __ATOMIC_ACQUIRE, > __ATOMIC_RELAXED)) > + return 0; > + else > return -EBUSY; > - > - return 0; > } > > /** > @@ -156,18 +177,26 @@ static inline void > rte_rwlock_write_lock(rte_rwlock_t *rwl) { > int32_t x; > - int
Re: [PATCH 2/2] service: fix potential stats race-condition on MT services
On 2022-07-08 18:21, Bruce Richardson wrote: On Fri, Jul 08, 2022 at 03:31:01PM +, Van Haaren, Harry wrote: -Original Message- From: Honnappa Nagarahalli Sent: Friday, July 8, 2022 4:16 PM To: Van Haaren, Harry ; dev@dpdk.org Cc: mattias.ronnblom ; Morten Brørup ; nd ; nd Subject: RE: [PATCH 2/2] service: fix potential stats race-condition on MT services diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c index ef31b1f63c..f045e74ef3 100644 --- a/lib/eal/common/rte_service.c +++ b/lib/eal/common/rte_service.c @@ -363,9 +363,15 @@ service_runner_do_callback(struct rte_service_spec_impl *s, uint64_t start = rte_rdtsc(); s->spec.callback(userdata); uint64_t end = rte_rdtsc(); - s->cycles_spent += end - start; + uint64_t cycles = end - start; cs->calls_per_service[service_idx]++; - s->calls++; + if (service_mt_safe(s)) { + __atomic_fetch_add(&s->cycles_spent, cycles, __ATOMIC_RELAXED); + __atomic_fetch_add(&s->calls, 1, __ATOMIC_RELAXED); + } else { + s->cycles_spent += cycles; + s->calls++; This is still a problem from a reader perspective. It is possible that the writes could be split while a reader is reading the stats. These need to be atomic adds. Thanks for pointing out; I do "think" in x86 in terms of load/store tearing; and on x86 naturally aligned load/stores will not tear. Apologies for missing the ARM angle here. I'm not sure how to best encode the difference between tearing & "locked instructions" to make things multi-writer safe. But they're not the same thing, and I'd prefer not pay the penalty for LOCK instructions (multi-writer) only to satisfy the non-tearing requirements. Is there an rte_atomic-* type that is guaranteed as non-tearing? In that case, changing the type of the calls/cycles_spent variables to such a type to ensure "non-tearing" single-reader, single-writer behaviour is enough, instead of forcing __atomic_fetch_add() everywhere? Regular read, increment and then atomic store should work without locks where alignment is correct on most architectures, and doing the store atomically should prevent any tearing. This is a good pattern, and provides a noticeable performance benefit compared to using an atomic add (which is an overkill in single-writer scenarios), in my experience. The DPDK seqcount implementation increments the count in this manner.
RE: [RFC] rwlock: prevent readers from starving writers
> From: Honnappa Nagarahalli [mailto:honnappa.nagaraha...@arm.com] > Sent: Friday, 8 July 2022 21.22 > > > > > > The original reader/writer lock in DPDK can cause a stream of readers > to > > starve writers. > > > > The new version uses an additional bit to indicate that a writer is > waiting and > > which keeps readers from starving the writer. > This addition makes sense. > I am wondering if we should create a new lock. Is it possible that some > applications are dependent on the current behavior? Any reader risks having to wait a while for a writer to finish its work. In my opinion, this implementation only increases the probability of that risk occurring, but it doesn't change the writer's impact on the readers. Therefore, I think this improved implementation can replace the old rwlock. > > > > > Signed-off-by: Stephen Hemminger > > --- > > Would like this to be in 22.11, but needs some more review > > > > lib/eal/include/generic/rte_rwlock.h | 93 ++ > -- > > 1 file changed, 61 insertions(+), 32 deletions(-) > > > > diff --git a/lib/eal/include/generic/rte_rwlock.h > > b/lib/eal/include/generic/rte_rwlock.h > > index da9bc3e9c0e2..725cd19ffb27 100644 > > --- a/lib/eal/include/generic/rte_rwlock.h > > +++ b/lib/eal/include/generic/rte_rwlock.h > > @@ -13,7 +13,7 @@ > > * This file defines an API for read-write locks. The lock is used > to > > * protect data that allows multiple readers in parallel, but only > > * one writer. All readers are blocked until the writer is finished > > - * writing. > > + * writing. This version will not starve writers. > > * > > */ > > > > @@ -28,10 +28,17 @@ extern "C" { > > /** > > * The rte_rwlock_t type. > > * > > - * cnt is -1 when write lock is held, and > 0 when read locks are > held. > > + * Readers increment the counter by RW_READ (4) > > + * Writers set the RWLOCK_WRITE bit when lock is held > > + * and set the RWLOCK_WAIT bit while waiting. > > */ > > + > > +#define RTE_RWLOCK_WAIT 0x1/* Writer is waiting */ > > +#define RTE_RWLOCK_WRITE 0x2 /* Writer has the lock */ > > +#define RTE_RWLOCK_READ 0x4/* Reader increment */ > > + > > typedef struct { > > - volatile int32_t cnt; /**< -1 when W lock held, > 0 when R locks > held. > > */ > > + volatile int32_t cnt; Not signed anymore, so consider uint32_t. Suggest also rename to cnt_state or similar, since it is not just a counter anymore. > > } rte_rwlock_t; > > > > /** > > @@ -61,17 +68,24 @@ static inline void > > rte_rwlock_read_lock(rte_rwlock_t *rwl) { > > int32_t x; > > - int success = 0; > > > > - while (success == 0) { > > + while (1) { > > x = __atomic_load_n(&rwl->cnt, __ATOMIC_RELAXED); > > /* write lock is held */ Held -> Held or pending, not just held. Add question mark, or move inside the if block. > > - if (x < 0) { > > + if (x & (RTE_RWLOCK_WAIT | RTE_RWLOCK_WRITE)) { > > rte_pause(); > > continue; > > } > > - success = __atomic_compare_exchange_n(&rwl->cnt, &x, x > > + 1, 1, > > - __ATOMIC_ACQUIRE, > > __ATOMIC_RELAXED); > > + > > + /* Try to get read lock */ > > + x = __atomic_add_fetch(&rwl->cnt, RTE_RWLOCK_READ, > > + __ATOMIC_ACQUIRE); > > + if (!(x & (RTE_RWLOCK_WAIT | RTE_RWLOCK_WRITE))) > > + return; > > + > > + /* Undo */ Undo -> Unable, so release the read lock. > > + __atomic_fetch_sub(&rwl->cnt, RTE_RWLOCK_READ, > > + __ATOMIC_RELEASE); > > } > > } > > > > @@ -93,17 +107,23 @@ static inline int > > rte_rwlock_read_trylock(rte_rwlock_t *rwl) { > > int32_t x; > > - int success = 0; > > > > - while (success == 0) { > > - x = __atomic_load_n(&rwl->cnt, __ATOMIC_RELAXED); > > - /* write lock is held */ > > - if (x < 0) > > - return -EBUSY; > > - success = __atomic_compare_exchange_n(&rwl->cnt, &x, x > > + 1, 1, > > - __ATOMIC_ACQUIRE, > > __ATOMIC_RELAXED); > > - } > > + x = __atomic_load_n(&rwl->cnt, __ATOMIC_RELAXED); > > + > > + /* write lock is held */ Same comment as above. > > + if (x & (RTE_RWLOCK_WAIT | RTE_RWLOCK_WRITE)) > > + return -EBUSY; > > + > > + /* Try to get read lock */ > > + x = __atomic_add_fetch(&rwl->cnt, RTE_RWLOCK_READ, > > + __ATOMIC_ACQUIRE); > > + > > + if (x & (RTE_RWLOCK_WAIT | RTE_RWLOCK_WRITE)) { Add a comment, e.g.: Unable, so release the read lock. > > + __atomic_fetch_sub(&rwl->cnt, RTE_RWLOCK_READ, > > + __ATOMIC_RELEASE); > > > > + return -EBUSY; > > + } > > return 0; > > } > > > > @@ -116,7 +136,7 @@ rte_rwlock_read_trylock(rte_rwlock_t *rwl) > static > > inline void rte_rwlock_
[Patch v4 00/17] Introduce Microsoft Azure Network Adatper (MANA) PMD
From: Long Li MANA is a network interface card to be used in the Azure cloud environment. MANA provides safe access to user memory through memory registration. It has IOMMU built into the hardware. MANA uses IB verbs and RDMA layer to configure hardware resources. It requires the corresponding RDMA kernel-mode and user-mode drivers. The MANA RDMA kernel-mode driver is being reviewed at: https://patchwork.kernel.org/project/netdevbpf/cover/1655345240-26411-1-git-send-email-lon...@linuxonhyperv.com/ The MANA RDMA user-mode driver is being reviewed at: https://github.com/linux-rdma/rdma-core/pull/1177 Long Li (17): net/mana: add basic driver, build environment and doc net/mana: add device configuration and stop net/mana: add function to report support ptypes net/mana: add link update net/mana: add function for device removal interrupts net/mana: add device info net/mana: add function to configure RSS net/mana: add function to configure RX queues net/mana: add function to configure TX queues net/mana: implement memory registration net/mana: implement the hardware layer operations net/mana: add function to start/stop TX queues net/mana: add function to start/stop RX queues net/mana: add function to receive packets net/mana: add function to send packets net/mana: add function to start/stop device net/mana: add function to report queue stats MAINTAINERS |6 + doc/guides/nics/features/mana.ini | 20 + doc/guides/nics/index.rst |1 + doc/guides/nics/mana.rst | 66 ++ drivers/net/mana/gdma.c | 284 ++ drivers/net/mana/mana.c | 1351 + drivers/net/mana/mana.h | 544 drivers/net/mana/meson.build | 48 + drivers/net/mana/mp.c | 323 +++ drivers/net/mana/mr.c | 324 +++ drivers/net/mana/rx.c | 450 ++ drivers/net/mana/tx.c | 398 + drivers/net/mana/version.map |3 + drivers/net/meson.build |1 + 14 files changed, 3819 insertions(+) create mode 100644 doc/guides/nics/features/mana.ini create mode 100644 doc/guides/nics/mana.rst create mode 100644 drivers/net/mana/gdma.c create mode 100644 drivers/net/mana/mana.c create mode 100644 drivers/net/mana/mana.h create mode 100644 drivers/net/mana/meson.build create mode 100644 drivers/net/mana/mp.c create mode 100644 drivers/net/mana/mr.c create mode 100644 drivers/net/mana/rx.c create mode 100644 drivers/net/mana/tx.c create mode 100644 drivers/net/mana/version.map -- 2.17.1
[Patch v4 01/17] net/mana: add basic driver, build environment and doc
From: Long Li MANA is a PCI device. It uses IB verbs to access hardware through the kernel RDMA layer. This patch introduces build environment and basic device probe functions. Signed-off-by: Long Li --- Change log: v2: Fix typos. Make the driver build only on x86-64 and Linux. Remove unused header files. Change port definition to uint16_t or uint8_t (for IB). Use getline() in place of fgets() to read and truncate a line. v3: Add meson build check for required functions from RDMA direct verb header file v4: Remove extra "\n" in logging code. Use "r" in place of "rb" in fopen() to read text files. MAINTAINERS | 6 + doc/guides/nics/features/mana.ini | 10 + doc/guides/nics/index.rst | 1 + doc/guides/nics/mana.rst | 66 +++ drivers/net/mana/mana.c | 704 ++ drivers/net/mana/mana.h | 210 + drivers/net/mana/meson.build | 44 ++ drivers/net/mana/mp.c | 235 ++ drivers/net/mana/version.map | 3 + drivers/net/meson.build | 1 + 10 files changed, 1280 insertions(+) create mode 100644 doc/guides/nics/features/mana.ini create mode 100644 doc/guides/nics/mana.rst create mode 100644 drivers/net/mana/mana.c create mode 100644 drivers/net/mana/mana.h create mode 100644 drivers/net/mana/meson.build create mode 100644 drivers/net/mana/mp.c create mode 100644 drivers/net/mana/version.map diff --git a/MAINTAINERS b/MAINTAINERS index 18d9edaf88..b8bda48a33 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -837,6 +837,12 @@ F: buildtools/options-ibverbs-static.sh F: doc/guides/nics/mlx5.rst F: doc/guides/nics/features/mlx5.ini +Microsoft mana +M: Long Li +F: drivers/net/mana +F: doc/guides/nics/mana.rst +F: doc/guides/nics/features/mana.ini + Microsoft vdev_netvsc - EXPERIMENTAL M: Matan Azrad F: drivers/net/vdev_netvsc/ diff --git a/doc/guides/nics/features/mana.ini b/doc/guides/nics/features/mana.ini new file mode 100644 index 00..b92a27374c --- /dev/null +++ b/doc/guides/nics/features/mana.ini @@ -0,0 +1,10 @@ +; +; Supported features of the 'mana' network poll mode driver. +; +; Refer to default.ini for the full list of available PMD features. +; +[Features] +Linux= Y +Multiprocess aware = Y +Usage doc= Y +x86-64 = Y diff --git a/doc/guides/nics/index.rst b/doc/guides/nics/index.rst index 1c94caccea..2725d1d9f0 100644 --- a/doc/guides/nics/index.rst +++ b/doc/guides/nics/index.rst @@ -41,6 +41,7 @@ Network Interface Controller Drivers intel_vf kni liquidio +mana memif mlx4 mlx5 diff --git a/doc/guides/nics/mana.rst b/doc/guides/nics/mana.rst new file mode 100644 index 00..40e18fe810 --- /dev/null +++ b/doc/guides/nics/mana.rst @@ -0,0 +1,66 @@ +.. SPDX-License-Identifier: BSD-3-Clause +Copyright 2022 Microsoft Corporation + +MANA poll mode driver library += + +The MANA poll mode driver library (**librte_net_mana**) implements support +for Microsoft Azure Network Adapter VF in SR-IOV context. + +Features + + +Features of the MANA Ethdev PMD are: + +Prerequisites +- + +This driver relies on external libraries and kernel drivers for resources +allocations and initialization. The following dependencies are not part of +DPDK and must be installed separately: + +- **libibverbs** (provided by rdma-core package) + + User space verbs framework used by librte_net_mana. This library provides + a generic interface between the kernel and low-level user space drivers + such as libmana. + + It allows slow and privileged operations (context initialization, hardware + resources allocations) to be managed by the kernel and fast operations to + never leave user space. + +- **libmana** (provided by rdma-core package) + + Low-level user space driver library for Microsoft Azure Network Adapter + devices, it is automatically loaded by libibverbs. + +- **Kernel modules** + + They provide the kernel-side verbs API and low level device drivers that + manage actual hardware initialization and resources sharing with user + space processes. + + Unlike most other PMDs, these modules must remain loaded and bound to + their devices: + + - mana: Ethernet device driver that provides kernel network interfaces. + - mana_ib: InifiniBand device driver. + - ib_uverbs: user space driver for verbs (entry point for libibverbs). + +Driver compilation and testing +-- + +Refer to the document :ref:`compiling and testing a PMD for a NIC ` +for details. + +Netvsc PMD arguments + + +The user can specify below argument in devargs. + +#. ``mac``: + +Specify the MAC address for this device. If it is set, the driver +probes and loads the NIC with a matching mac address. If it is not +set, the driver probes on all the NICs on the PCI device. The default +value is not set, meaning all
[Patch v4 02/17] net/mana: add device configuration and stop
From: Long Li MANA defines its memory allocation functions to override IB layer default functions to allocate device queues. This patch adds the code for device configuration and stop. Signed-off-by: Long Li --- Change log: v2: Removed validation for offload settings in mana_dev_configure(). drivers/net/mana/mana.c | 75 +++-- drivers/net/mana/mana.h | 3 ++ 2 files changed, 76 insertions(+), 2 deletions(-) diff --git a/drivers/net/mana/mana.c b/drivers/net/mana/mana.c index cb59eb6882..147ab144d5 100644 --- a/drivers/net/mana/mana.c +++ b/drivers/net/mana/mana.c @@ -40,7 +40,79 @@ static rte_spinlock_t mana_shared_data_lock = RTE_SPINLOCK_INITIALIZER; int mana_logtype_driver; int mana_logtype_init; +void *mana_alloc_verbs_buf(size_t size, void *data) +{ + void *ret; + size_t alignment = rte_mem_page_size(); + int socket = (int)(uintptr_t)data; + + DRV_LOG(DEBUG, "size=%zu socket=%d", size, socket); + + if (alignment == (size_t)-1) { + DRV_LOG(ERR, "Failed to get mem page size"); + rte_errno = ENOMEM; + return NULL; + } + + ret = rte_zmalloc_socket("mana_verb_buf", size, alignment, socket); + if (!ret && size) + rte_errno = ENOMEM; + return ret; +} + +void mana_free_verbs_buf(void *ptr, void *data __rte_unused) +{ + rte_free(ptr); +} + +static int mana_dev_configure(struct rte_eth_dev *dev) +{ + struct mana_priv *priv = dev->data->dev_private; + struct rte_eth_conf *dev_conf = &dev->data->dev_conf; + + if (dev_conf->rxmode.mq_mode & ETH_MQ_RX_RSS_FLAG) + dev_conf->rxmode.offloads |= DEV_RX_OFFLOAD_RSS_HASH; + + if (dev->data->nb_rx_queues != dev->data->nb_tx_queues) { + DRV_LOG(ERR, "Only support equal number of rx/tx queues"); + return -EINVAL; + } + + if (!rte_is_power_of_2(dev->data->nb_rx_queues)) { + DRV_LOG(ERR, "number of TX/RX queues must be power of 2"); + return -EINVAL; + } + + priv->num_queues = dev->data->nb_rx_queues; + + manadv_set_context_attr(priv->ib_ctx, MANADV_CTX_ATTR_BUF_ALLOCATORS, + (void *)((uintptr_t)&(struct manadv_ctx_allocators){ + .alloc = &mana_alloc_verbs_buf, + .free = &mana_free_verbs_buf, + .data = 0, + })); + + return 0; +} + +static int +mana_dev_close(struct rte_eth_dev *dev) +{ + struct mana_priv *priv = dev->data->dev_private; + int ret; + + ret = ibv_close_device(priv->ib_ctx); + if (ret) { + ret = errno; + return ret; + } + + return 0; +} + const struct eth_dev_ops mana_dev_ops = { + .dev_configure = mana_dev_configure, + .dev_close = mana_dev_close, }; const struct eth_dev_ops mana_dev_sec_ops = { @@ -627,8 +699,7 @@ static int mana_pci_probe(struct rte_pci_driver *pci_drv __rte_unused, static int mana_dev_uninit(struct rte_eth_dev *dev) { - RTE_SET_USED(dev); - return 0; + return mana_dev_close(dev); } static int mana_pci_remove(struct rte_pci_device *pci_dev) diff --git a/drivers/net/mana/mana.h b/drivers/net/mana/mana.h index e30c030b4e..66873394b9 100644 --- a/drivers/net/mana/mana.h +++ b/drivers/net/mana/mana.h @@ -207,4 +207,7 @@ int mana_mp_req_verbs_cmd_fd(struct rte_eth_dev *dev); void mana_mp_req_on_rxtx(struct rte_eth_dev *dev, enum mana_mp_req_type type); +void *mana_alloc_verbs_buf(size_t size, void *data); +void mana_free_verbs_buf(void *ptr, void *data __rte_unused); + #endif -- 2.17.1
[Patch v4 03/17] net/mana: add function to report support ptypes
From: Long Li Report supported protocol types. Signed-off-by: Long Li --- drivers/net/mana/mana.c | 16 drivers/net/mana/mana.h | 2 -- 2 files changed, 16 insertions(+), 2 deletions(-) diff --git a/drivers/net/mana/mana.c b/drivers/net/mana/mana.c index 147ab144d5..4559632056 100644 --- a/drivers/net/mana/mana.c +++ b/drivers/net/mana/mana.c @@ -110,9 +110,25 @@ mana_dev_close(struct rte_eth_dev *dev) return 0; } +static const uint32_t *mana_supported_ptypes(struct rte_eth_dev *dev __rte_unused) +{ + static const uint32_t ptypes[] = { + RTE_PTYPE_L2_ETHER, + RTE_PTYPE_L3_IPV4_EXT_UNKNOWN, + RTE_PTYPE_L3_IPV6_EXT_UNKNOWN, + RTE_PTYPE_L4_FRAG, + RTE_PTYPE_L4_TCP, + RTE_PTYPE_L4_UDP, + RTE_PTYPE_UNKNOWN + }; + + return ptypes; +} + const struct eth_dev_ops mana_dev_ops = { .dev_configure = mana_dev_configure, .dev_close = mana_dev_close, + .dev_supported_ptypes_get = mana_supported_ptypes, }; const struct eth_dev_ops mana_dev_sec_ops = { diff --git a/drivers/net/mana/mana.h b/drivers/net/mana/mana.h index 66873394b9..c433940022 100644 --- a/drivers/net/mana/mana.h +++ b/drivers/net/mana/mana.h @@ -168,8 +168,6 @@ extern int mana_logtype_init; #define PMD_INIT_FUNC_TRACE() PMD_INIT_LOG(DEBUG, " >>") -const uint32_t *mana_supported_ptypes(struct rte_eth_dev *dev); - uint16_t mana_rx_burst_removed(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n); -- 2.17.1
[Patch v4 04/17] net/mana: add link update
From: Long Li The carrier state is managed by the Azure host. MANA runs as a VF and always reports "up". Signed-off-by: Long Li --- doc/guides/nics/features/mana.ini | 1 + drivers/net/mana/mana.c | 17 + 2 files changed, 18 insertions(+) diff --git a/doc/guides/nics/features/mana.ini b/doc/guides/nics/features/mana.ini index b92a27374c..62554b0a0a 100644 --- a/doc/guides/nics/features/mana.ini +++ b/doc/guides/nics/features/mana.ini @@ -4,6 +4,7 @@ ; Refer to default.ini for the full list of available PMD features. ; [Features] +Link status = P Linux= Y Multiprocess aware = Y Usage doc= Y diff --git a/drivers/net/mana/mana.c b/drivers/net/mana/mana.c index 4559632056..b77d0c29b0 100644 --- a/drivers/net/mana/mana.c +++ b/drivers/net/mana/mana.c @@ -125,10 +125,27 @@ static const uint32_t *mana_supported_ptypes(struct rte_eth_dev *dev __rte_unuse return ptypes; } +static int mana_dev_link_update(struct rte_eth_dev *dev, + int wait_to_complete __rte_unused) +{ + struct rte_eth_link link; + + /* MANA has no concept of carrier state, always reporting UP */ + link = (struct rte_eth_link) { + .link_duplex = RTE_ETH_LINK_FULL_DUPLEX, + .link_autoneg = RTE_ETH_LINK_SPEED_FIXED, + .link_speed = RTE_ETH_SPEED_NUM_200G, + .link_status = RTE_ETH_LINK_UP, + }; + + return rte_eth_linkstatus_set(dev, &link); +} + const struct eth_dev_ops mana_dev_ops = { .dev_configure = mana_dev_configure, .dev_close = mana_dev_close, .dev_supported_ptypes_get = mana_supported_ptypes, + .link_update= mana_dev_link_update, }; const struct eth_dev_ops mana_dev_sec_ops = { -- 2.17.1
[Patch v4 05/17] net/mana: add function for device removal interrupts
From: Long Li MANA supports PCI hot plug events. Add this interrupt to DPDK core so its parent PMD can detect device removal during Azure servicing or live migration. Signed-off-by: Long Li --- doc/guides/nics/features/mana.ini | 1 + drivers/net/mana/mana.c | 97 +++ drivers/net/mana/mana.h | 1 + 3 files changed, 99 insertions(+) diff --git a/doc/guides/nics/features/mana.ini b/doc/guides/nics/features/mana.ini index 62554b0a0a..8043e11f99 100644 --- a/doc/guides/nics/features/mana.ini +++ b/doc/guides/nics/features/mana.ini @@ -7,5 +7,6 @@ Link status = P Linux= Y Multiprocess aware = Y +Removal event= Y Usage doc= Y x86-64 = Y diff --git a/drivers/net/mana/mana.c b/drivers/net/mana/mana.c index b77d0c29b0..c9591035ac 100644 --- a/drivers/net/mana/mana.c +++ b/drivers/net/mana/mana.c @@ -95,12 +95,18 @@ static int mana_dev_configure(struct rte_eth_dev *dev) return 0; } +static int mana_intr_uninstall(struct mana_priv *priv); + static int mana_dev_close(struct rte_eth_dev *dev) { struct mana_priv *priv = dev->data->dev_private; int ret; + ret = mana_intr_uninstall(priv); + if (ret) + return ret; + ret = ibv_close_device(priv->ib_ctx); if (ret) { ret = errno; @@ -327,6 +333,90 @@ static int mana_ibv_device_to_pci_addr(const struct ibv_device *device, return 0; } +static void mana_intr_handler(void *arg) +{ + struct mana_priv *priv = arg; + struct ibv_context *ctx = priv->ib_ctx; + struct ibv_async_event event; + + /* Read and ack all messages from IB device */ + while (true) { + if (ibv_get_async_event(ctx, &event)) + break; + + if (event.event_type == IBV_EVENT_DEVICE_FATAL) { + struct rte_eth_dev *dev; + + dev = &rte_eth_devices[priv->port_id]; + if (dev->data->dev_conf.intr_conf.rmv) + rte_eth_dev_callback_process(dev, + RTE_ETH_EVENT_INTR_RMV, NULL); + } + + ibv_ack_async_event(&event); + } +} + +static int mana_intr_uninstall(struct mana_priv *priv) +{ + int ret; + + ret = rte_intr_callback_unregister(priv->intr_handle, + mana_intr_handler, priv); + if (ret <= 0) { + DRV_LOG(ERR, "Failed to unregister intr callback ret %d", ret); + return ret; + } + + rte_intr_instance_free(priv->intr_handle); + + return 0; +} + +static int mana_intr_install(struct mana_priv *priv) +{ + int ret, flags; + struct ibv_context *ctx = priv->ib_ctx; + + priv->intr_handle = rte_intr_instance_alloc(RTE_INTR_INSTANCE_F_SHARED); + if (!priv->intr_handle) { + DRV_LOG(ERR, "Failed to allocate intr_handle"); + rte_errno = ENOMEM; + return -ENOMEM; + } + + rte_intr_fd_set(priv->intr_handle, -1); + + flags = fcntl(ctx->async_fd, F_GETFL); + ret = fcntl(ctx->async_fd, F_SETFL, flags | O_NONBLOCK); + if (ret) { + DRV_LOG(ERR, "Failed to change async_fd to NONBLOCK"); + goto free_intr; + } + + rte_intr_fd_set(priv->intr_handle, ctx->async_fd); + rte_intr_type_set(priv->intr_handle, RTE_INTR_HANDLE_EXT); + + ret = rte_intr_callback_register(priv->intr_handle, +mana_intr_handler, priv); + if (ret) { + DRV_LOG(ERR, "Failed to register intr callback"); + rte_intr_fd_set(priv->intr_handle, -1); + goto restore_fd; + } + + return 0; + +restore_fd: + fcntl(ctx->async_fd, F_SETFL, flags); + +free_intr: + rte_intr_instance_free(priv->intr_handle); + priv->intr_handle = NULL; + + return ret; +} + static int mana_proc_priv_init(struct rte_eth_dev *dev) { struct mana_process_priv *priv; @@ -640,6 +730,13 @@ static int mana_pci_probe_mac(struct rte_pci_driver *pci_drv __rte_unused, name, priv->max_rx_queues, priv->max_rx_desc, priv->max_send_sge); + /* Create async interrupt handler */ + ret = mana_intr_install(priv); + if (ret) { + DRV_LOG(ERR, "Failed to install intr handler"); + goto failed; + } + rte_spinlock_lock(&mana_shared_data->lock); mana_shared_data->primary_cnt++; rte_spinlock_unlock(&mana_shared_data->lock); diff --git a/drivers/net/mana/mana.h b/drivers/net/mana/mana.h index c433940022..f97eed2e81 100644 --- a
[Patch v4 06/17] net/mana: add device info
From: Long Li Add the function to get device info. Signed-off-by: Long Li --- doc/guides/nics/features/mana.ini | 1 + drivers/net/mana/mana.c | 82 +++ 2 files changed, 83 insertions(+) diff --git a/doc/guides/nics/features/mana.ini b/doc/guides/nics/features/mana.ini index 8043e11f99..566b3e8770 100644 --- a/doc/guides/nics/features/mana.ini +++ b/doc/guides/nics/features/mana.ini @@ -8,5 +8,6 @@ Link status = P Linux= Y Multiprocess aware = Y Removal event= Y +Speed capabilities = P Usage doc= Y x86-64 = Y diff --git a/drivers/net/mana/mana.c b/drivers/net/mana/mana.c index c9591035ac..e1550b3c08 100644 --- a/drivers/net/mana/mana.c +++ b/drivers/net/mana/mana.c @@ -116,6 +116,86 @@ mana_dev_close(struct rte_eth_dev *dev) return 0; } +static int mana_dev_info_get(struct rte_eth_dev *dev, +struct rte_eth_dev_info *dev_info) +{ + struct mana_priv *priv = dev->data->dev_private; + + dev_info->max_mtu = RTE_ETHER_MTU; + + /* RX params */ + dev_info->min_rx_bufsize = MIN_RX_BUF_SIZE; + dev_info->max_rx_pktlen = MAX_FRAME_SIZE; + + dev_info->max_rx_queues = priv->max_rx_queues; + dev_info->max_tx_queues = priv->max_tx_queues; + + dev_info->max_mac_addrs = BNIC_MAX_MAC_ADDR; + dev_info->max_hash_mac_addrs = 0; + + dev_info->max_vfs = 1; + + /* Offload params */ + dev_info->rx_offload_capa = BNIC_DEV_RX_OFFLOAD_SUPPORT; + + dev_info->tx_offload_capa = BNIC_DEV_TX_OFFLOAD_SUPPORT; + + /* RSS */ + dev_info->reta_size = INDIRECTION_TABLE_NUM_ELEMENTS; + dev_info->hash_key_size = TOEPLITZ_HASH_KEY_SIZE_IN_BYTES; + dev_info->flow_type_rss_offloads = BNIC_ETH_RSS_SUPPORT; + + /* Thresholds */ + dev_info->default_rxconf = (struct rte_eth_rxconf){ + .rx_thresh = { + .pthresh = 8, + .hthresh = 8, + .wthresh = 0, + }, + .rx_free_thresh = 32, + /* If no descriptors available, pkts are dropped by default */ + .rx_drop_en = 1, + }; + + dev_info->default_txconf = (struct rte_eth_txconf){ + .tx_thresh = { + .pthresh = 32, + .hthresh = 0, + .wthresh = 0, + }, + .tx_rs_thresh = 32, + .tx_free_thresh = 32, + }; + + /* Buffer limits */ + dev_info->rx_desc_lim.nb_min = MIN_BUFFERS_PER_QUEUE; + dev_info->rx_desc_lim.nb_max = priv->max_rx_desc; + dev_info->rx_desc_lim.nb_align = MIN_BUFFERS_PER_QUEUE; + dev_info->rx_desc_lim.nb_seg_max = priv->max_recv_sge; + dev_info->rx_desc_lim.nb_mtu_seg_max = priv->max_recv_sge; + + dev_info->tx_desc_lim.nb_min = MIN_BUFFERS_PER_QUEUE; + dev_info->tx_desc_lim.nb_max = priv->max_tx_desc; + dev_info->tx_desc_lim.nb_align = MIN_BUFFERS_PER_QUEUE; + dev_info->tx_desc_lim.nb_seg_max = priv->max_send_sge; + dev_info->rx_desc_lim.nb_mtu_seg_max = priv->max_recv_sge; + + /* Speed */ + dev_info->speed_capa = ETH_LINK_SPEED_100G; + + /* RX params */ + dev_info->default_rxportconf.burst_size = 1; + dev_info->default_rxportconf.ring_size = MAX_RECEIVE_BUFFERS_PER_QUEUE; + dev_info->default_rxportconf.nb_queues = 1; + + /* TX params */ + dev_info->default_txportconf.burst_size = 1; + dev_info->default_txportconf.ring_size = MAX_SEND_BUFFERS_PER_QUEUE; + dev_info->default_txportconf.nb_queues = 1; + + return 0; +} + static const uint32_t *mana_supported_ptypes(struct rte_eth_dev *dev __rte_unused) { static const uint32_t ptypes[] = { @@ -150,11 +230,13 @@ static int mana_dev_link_update(struct rte_eth_dev *dev, const struct eth_dev_ops mana_dev_ops = { .dev_configure = mana_dev_configure, .dev_close = mana_dev_close, + .dev_infos_get = mana_dev_info_get, .dev_supported_ptypes_get = mana_supported_ptypes, .link_update= mana_dev_link_update, }; const struct eth_dev_ops mana_dev_sec_ops = { + .dev_infos_get = mana_dev_info_get, }; uint16_t -- 2.17.1
[Patch v4 07/17] net/mana: add function to configure RSS
From: Long Li Currently this PMD supports RSS configuration when the device is stopped. Configuring RSS in running state will be supported in the future. Signed-off-by: Long Li --- doc/guides/nics/features/mana.ini | 1 + drivers/net/mana/mana.c | 61 +++ drivers/net/mana/mana.h | 1 + 3 files changed, 63 insertions(+) diff --git a/doc/guides/nics/features/mana.ini b/doc/guides/nics/features/mana.ini index 566b3e8770..a59c21cc10 100644 --- a/doc/guides/nics/features/mana.ini +++ b/doc/guides/nics/features/mana.ini @@ -8,6 +8,7 @@ Link status = P Linux= Y Multiprocess aware = Y Removal event= Y +RSS hash = Y Speed capabilities = P Usage doc= Y x86-64 = Y diff --git a/drivers/net/mana/mana.c b/drivers/net/mana/mana.c index e1550b3c08..cb136f24c1 100644 --- a/drivers/net/mana/mana.c +++ b/drivers/net/mana/mana.c @@ -211,6 +211,65 @@ static const uint32_t *mana_supported_ptypes(struct rte_eth_dev *dev __rte_unuse return ptypes; } +static int mana_rss_hash_update(struct rte_eth_dev *dev, + struct rte_eth_rss_conf *rss_conf) +{ + struct mana_priv *priv = dev->data->dev_private; + + /* Currently can only update RSS hash when device is stopped */ + if (dev->data->dev_started) { + DRV_LOG(ERR, "Can't update RSS after device has started"); + return -ENODEV; + } + + if (rss_conf->rss_hf & ~BNIC_ETH_RSS_SUPPORT) { + DRV_LOG(ERR, "Port %u invalid RSS HF 0x%" PRIx64, + dev->data->port_id, rss_conf->rss_hf); + return -EINVAL; + } + + if (rss_conf->rss_key && rss_conf->rss_key_len) { + if (rss_conf->rss_key_len != TOEPLITZ_HASH_KEY_SIZE_IN_BYTES) { + DRV_LOG(ERR, "Port %u key len must be %u long", + dev->data->port_id, + TOEPLITZ_HASH_KEY_SIZE_IN_BYTES); + return -EINVAL; + } + + priv->rss_conf.rss_key_len = rss_conf->rss_key_len; + priv->rss_conf.rss_key = + rte_zmalloc("mana_rss", rss_conf->rss_key_len, + RTE_CACHE_LINE_SIZE); + if (!priv->rss_conf.rss_key) + return -ENOMEM; + memcpy(priv->rss_conf.rss_key, rss_conf->rss_key, + rss_conf->rss_key_len); + } + priv->rss_conf.rss_hf = rss_conf->rss_hf; + + return 0; +} + +static int mana_rss_hash_conf_get(struct rte_eth_dev *dev, + struct rte_eth_rss_conf *rss_conf) +{ + struct mana_priv *priv = dev->data->dev_private; + + if (!rss_conf) + return -EINVAL; + + if (rss_conf->rss_key && + rss_conf->rss_key_len >= priv->rss_conf.rss_key_len) { + memcpy(rss_conf->rss_key, priv->rss_conf.rss_key, + priv->rss_conf.rss_key_len); + } + + rss_conf->rss_key_len = priv->rss_conf.rss_key_len; + rss_conf->rss_hf = priv->rss_conf.rss_hf; + + return 0; +} + static int mana_dev_link_update(struct rte_eth_dev *dev, int wait_to_complete __rte_unused) { @@ -232,6 +291,8 @@ const struct eth_dev_ops mana_dev_ops = { .dev_close = mana_dev_close, .dev_infos_get = mana_dev_info_get, .dev_supported_ptypes_get = mana_supported_ptypes, + .rss_hash_update= mana_rss_hash_update, + .rss_hash_conf_get = mana_rss_hash_conf_get, .link_update= mana_dev_link_update, }; diff --git a/drivers/net/mana/mana.h b/drivers/net/mana/mana.h index f97eed2e81..33f68b3d1b 100644 --- a/drivers/net/mana/mana.h +++ b/drivers/net/mana/mana.h @@ -72,6 +72,7 @@ struct mana_priv { uint8_t ind_table_key[40]; struct ibv_qp *rwq_qp; void *db_page; + struct rte_eth_rss_conf rss_conf; struct rte_intr_handle *intr_handle; int max_rx_queues; int max_tx_queues; -- 2.17.1
[Patch v4 08/17] net/mana: add function to configure RX queues
From: Long Li RX hardware queue is allocated when starting the queue. This function is for queue configuration pre starting. Signed-off-by: Long Li --- drivers/net/mana/mana.c | 68 + 1 file changed, 68 insertions(+) diff --git a/drivers/net/mana/mana.c b/drivers/net/mana/mana.c index cb136f24c1..d03adab041 100644 --- a/drivers/net/mana/mana.c +++ b/drivers/net/mana/mana.c @@ -196,6 +196,16 @@ static int mana_dev_info_get(struct rte_eth_dev *dev, return 0; } +static void mana_dev_rx_queue_info(struct rte_eth_dev *dev, uint16_t queue_id, + struct rte_eth_rxq_info *qinfo) +{ + struct mana_rxq *rxq = dev->data->rx_queues[queue_id]; + + qinfo->mp = rxq->mp; + qinfo->nb_desc = rxq->num_desc; + qinfo->conf.offloads = dev->data->dev_conf.rxmode.offloads; +} + static const uint32_t *mana_supported_ptypes(struct rte_eth_dev *dev __rte_unused) { static const uint32_t ptypes[] = { @@ -270,6 +280,61 @@ static int mana_rss_hash_conf_get(struct rte_eth_dev *dev, return 0; } +static int mana_dev_rx_queue_setup(struct rte_eth_dev *dev, + uint16_t queue_idx, uint16_t nb_desc, + unsigned int socket_id, + const struct rte_eth_rxconf *rx_conf __rte_unused, + struct rte_mempool *mp) +{ + struct mana_priv *priv = dev->data->dev_private; + struct mana_rxq *rxq; + int ret; + + rxq = rte_zmalloc_socket("mana_rxq", sizeof(*rxq), 0, socket_id); + if (!rxq) { + DRV_LOG(ERR, "failed to allocate rxq"); + return -ENOMEM; + } + + DRV_LOG(DEBUG, "idx %u nb_desc %u socket %u", + queue_idx, nb_desc, socket_id); + + rxq->socket = socket_id; + + rxq->desc_ring = rte_zmalloc_socket("mana_rx_mbuf_ring", + sizeof(struct mana_rxq_desc) * + nb_desc, + RTE_CACHE_LINE_SIZE, socket_id); + + if (!rxq->desc_ring) { + DRV_LOG(ERR, "failed to allocate rxq desc_ring"); + ret = -ENOMEM; + goto fail; + } + + rxq->num_desc = nb_desc; + + rxq->priv = priv; + rxq->num_desc = nb_desc; + rxq->mp = mp; + dev->data->rx_queues[queue_idx] = rxq; + + return 0; + +fail: + rte_free(rxq->desc_ring); + rte_free(rxq); + return ret; +} + +static void mana_dev_rx_queue_release(struct rte_eth_dev *dev, uint16_t qid) +{ + struct mana_rxq *rxq = dev->data->rx_queues[qid]; + + rte_free(rxq->desc_ring); + rte_free(rxq); +} + static int mana_dev_link_update(struct rte_eth_dev *dev, int wait_to_complete __rte_unused) { @@ -290,9 +355,12 @@ const struct eth_dev_ops mana_dev_ops = { .dev_configure = mana_dev_configure, .dev_close = mana_dev_close, .dev_infos_get = mana_dev_info_get, + .rxq_info_get = mana_dev_rx_queue_info, .dev_supported_ptypes_get = mana_supported_ptypes, .rss_hash_update= mana_rss_hash_update, .rss_hash_conf_get = mana_rss_hash_conf_get, + .rx_queue_setup = mana_dev_rx_queue_setup, + .rx_queue_release = mana_dev_rx_queue_release, .link_update= mana_dev_link_update, }; -- 2.17.1
[Patch v4 09/17] net/mana: add function to configure TX queues
From: Long Li TX hardware queue is allocated when starting the queue, this is for pre configuration. Signed-off-by: Long Li --- drivers/net/mana/mana.c | 65 + 1 file changed, 65 insertions(+) diff --git a/drivers/net/mana/mana.c b/drivers/net/mana/mana.c index d03adab041..490686f404 100644 --- a/drivers/net/mana/mana.c +++ b/drivers/net/mana/mana.c @@ -196,6 +196,15 @@ static int mana_dev_info_get(struct rte_eth_dev *dev, return 0; } +static void mana_dev_tx_queue_info(struct rte_eth_dev *dev, uint16_t queue_id, + struct rte_eth_txq_info *qinfo) +{ + struct mana_txq *txq = dev->data->tx_queues[queue_id]; + + qinfo->conf.offloads = dev->data->dev_conf.txmode.offloads; + qinfo->nb_desc = txq->num_desc; +} + static void mana_dev_rx_queue_info(struct rte_eth_dev *dev, uint16_t queue_id, struct rte_eth_rxq_info *qinfo) { @@ -280,6 +289,59 @@ static int mana_rss_hash_conf_get(struct rte_eth_dev *dev, return 0; } +static int mana_dev_tx_queue_setup(struct rte_eth_dev *dev, + uint16_t queue_idx, uint16_t nb_desc, + unsigned int socket_id, + const struct rte_eth_txconf *tx_conf __rte_unused) + +{ + struct mana_priv *priv = dev->data->dev_private; + struct mana_txq *txq; + int ret; + + txq = rte_zmalloc_socket("mana_txq", sizeof(*txq), 0, socket_id); + if (!txq) { + DRV_LOG(ERR, "failed to allocate txq"); + return -ENOMEM; + } + + txq->socket = socket_id; + + txq->desc_ring = rte_malloc_socket("mana_tx_desc_ring", + sizeof(struct mana_txq_desc) * + nb_desc, + RTE_CACHE_LINE_SIZE, socket_id); + if (!txq->desc_ring) { + DRV_LOG(ERR, "failed to allocate txq desc_ring"); + ret = -ENOMEM; + goto fail; + } + + DRV_LOG(DEBUG, "idx %u nb_desc %u socket %u txq->desc_ring %p", + queue_idx, nb_desc, socket_id, txq->desc_ring); + + txq->desc_ring_head = 0; + txq->desc_ring_tail = 0; + txq->priv = priv; + txq->num_desc = nb_desc; + dev->data->tx_queues[queue_idx] = txq; + + return 0; + +fail: + rte_free(txq->desc_ring); + rte_free(txq); + return ret; +} + +static void mana_dev_tx_queue_release(struct rte_eth_dev *dev, uint16_t qid) +{ + struct mana_txq *txq = dev->data->tx_queues[qid]; + + rte_free(txq->desc_ring); + rte_free(txq); +} + static int mana_dev_rx_queue_setup(struct rte_eth_dev *dev, uint16_t queue_idx, uint16_t nb_desc, unsigned int socket_id, @@ -355,10 +417,13 @@ const struct eth_dev_ops mana_dev_ops = { .dev_configure = mana_dev_configure, .dev_close = mana_dev_close, .dev_infos_get = mana_dev_info_get, + .txq_info_get = mana_dev_tx_queue_info, .rxq_info_get = mana_dev_rx_queue_info, .dev_supported_ptypes_get = mana_supported_ptypes, .rss_hash_update= mana_rss_hash_update, .rss_hash_conf_get = mana_rss_hash_conf_get, + .tx_queue_setup = mana_dev_tx_queue_setup, + .tx_queue_release = mana_dev_tx_queue_release, .rx_queue_setup = mana_dev_rx_queue_setup, .rx_queue_release = mana_dev_rx_queue_release, .link_update= mana_dev_link_update, -- 2.17.1
[Patch v4 10/17] net/mana: implement memory registration
From: Long Li MANA hardware has iommu built-in, that provides hardware safe access to user memory through memory registration. Since memory registration is an expensive operation, this patch implements a two level memory registration cache mechanisum for each queue and for each port. Signed-off-by: Long Li --- Change log: v2: Change all header file functions to start with mana_. Use spinlock in place of rwlock to memory cache access. Remove unused header files. v4: Remove extra "\n" in logging function. drivers/net/mana/mana.c | 20 +++ drivers/net/mana/mana.h | 39 + drivers/net/mana/meson.build | 1 + drivers/net/mana/mp.c| 85 + drivers/net/mana/mr.c| 324 +++ 5 files changed, 469 insertions(+) create mode 100644 drivers/net/mana/mr.c diff --git a/drivers/net/mana/mana.c b/drivers/net/mana/mana.c index 490686f404..d18cc4ab0e 100644 --- a/drivers/net/mana/mana.c +++ b/drivers/net/mana/mana.c @@ -103,6 +103,8 @@ mana_dev_close(struct rte_eth_dev *dev) struct mana_priv *priv = dev->data->dev_private; int ret; + mana_remove_all_mr(priv); + ret = mana_intr_uninstall(priv); if (ret) return ret; @@ -317,6 +319,13 @@ static int mana_dev_tx_queue_setup(struct rte_eth_dev *dev, goto fail; } + ret = mana_mr_btree_init(&txq->mr_btree, +MANA_MR_BTREE_PER_QUEUE_N, socket_id); + if (ret) { + DRV_LOG(ERR, "Failed to init TXQ MR btree"); + goto fail; + } + DRV_LOG(DEBUG, "idx %u nb_desc %u socket %u txq->desc_ring %p", queue_idx, nb_desc, socket_id, txq->desc_ring); @@ -338,6 +347,8 @@ static void mana_dev_tx_queue_release(struct rte_eth_dev *dev, uint16_t qid) { struct mana_txq *txq = dev->data->tx_queues[qid]; + mana_mr_btree_free(&txq->mr_btree); + rte_free(txq->desc_ring); rte_free(txq); } @@ -374,6 +385,13 @@ static int mana_dev_rx_queue_setup(struct rte_eth_dev *dev, goto fail; } + ret = mana_mr_btree_init(&rxq->mr_btree, +MANA_MR_BTREE_PER_QUEUE_N, socket_id); + if (ret) { + DRV_LOG(ERR, "Failed to init RXQ MR btree"); + goto fail; + } + rxq->num_desc = nb_desc; rxq->priv = priv; @@ -393,6 +411,8 @@ static void mana_dev_rx_queue_release(struct rte_eth_dev *dev, uint16_t qid) { struct mana_rxq *rxq = dev->data->rx_queues[qid]; + mana_mr_btree_free(&rxq->mr_btree); + rte_free(rxq->desc_ring); rte_free(rxq); } diff --git a/drivers/net/mana/mana.h b/drivers/net/mana/mana.h index 33f68b3d1b..9e15b43275 100644 --- a/drivers/net/mana/mana.h +++ b/drivers/net/mana/mana.h @@ -50,6 +50,22 @@ struct mana_shared_data { #define MAX_RECEIVE_BUFFERS_PER_QUEUE 256 #define MAX_SEND_BUFFERS_PER_QUEUE 256 +struct mana_mr_cache { + uint32_tlkey; + uintptr_t addr; + size_t len; + void*verb_obj; +}; + +#define MANA_MR_BTREE_CACHE_N 512 +struct mana_mr_btree { + uint16_tlen;/* Used entries */ + uint16_tsize; /* Total entries */ + int overflow; + int socket; + struct mana_mr_cache *table; +}; + struct mana_process_priv { void *db_page; }; @@ -82,6 +98,8 @@ struct mana_priv { int max_recv_sge; int max_mr; uint64_t max_mr_size; + struct mana_mr_btree mr_btree; + rte_spinlock_t mr_btree_lock; }; struct mana_txq_desc { @@ -131,6 +149,7 @@ struct mana_txq { uint32_t desc_ring_head, desc_ring_tail; struct mana_stats stats; + struct mana_mr_btree mr_btree; unsigned int socket; }; @@ -153,6 +172,7 @@ struct mana_rxq { struct mana_gdma_queue gdma_cq; struct mana_stats stats; + struct mana_mr_btree mr_btree; unsigned int socket; }; @@ -176,6 +196,24 @@ uint16_t mana_rx_burst_removed(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t mana_tx_burst_removed(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n); +struct mana_mr_cache *mana_find_pmd_mr(struct mana_mr_btree *local_tree, + struct mana_priv *priv, + struct rte_mbuf *mbuf); +int mana_new_pmd_mr(struct mana_mr_btree *local_tree, struct mana_priv *priv, + struct rte_mempool *pool); +void mana_remove_all_mr(struct mana_priv *priv); +void mana_del_pmd_mr(struct mana_mr_cache *mr); + +void mana_mempool_chunk_cb(struct rte_mempool *mp, void *opaque, + struct rte_mempool_memhdr *memhdr, unsigned int idx); + +struct mana_mr_cache *mana_mr_btree_lookup(struct mana_mr_btree *bt, + uint16_t *idx,
[Patch v4 11/17] net/mana: implement the hardware layer operations
From: Long Li The hardware layer of MANA understands the device queue and doorbell formats. Those functions are implemented for use by packet RX/TX code. Signed-off-by: Long Li --- Change log: v2: Remove unused header files. Rename a camel case. drivers/net/mana/gdma.c | 284 +++ drivers/net/mana/mana.h | 183 ++ drivers/net/mana/meson.build | 1 + 3 files changed, 468 insertions(+) create mode 100644 drivers/net/mana/gdma.c diff --git a/drivers/net/mana/gdma.c b/drivers/net/mana/gdma.c new file mode 100644 index 00..077ac7744b --- /dev/null +++ b/drivers/net/mana/gdma.c @@ -0,0 +1,284 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright 2022 Microsoft Corporation + */ + +#include +#include + +#include "mana.h" + +uint8_t *gdma_get_wqe_pointer(struct mana_gdma_queue *queue) +{ + uint32_t offset_in_bytes = + (queue->head * GDMA_WQE_ALIGNMENT_UNIT_SIZE) & + (queue->size - 1); + + DRV_LOG(DEBUG, "txq sq_head %u sq_size %u offset_in_bytes %u", + queue->head, queue->size, offset_in_bytes); + + if (offset_in_bytes + GDMA_WQE_ALIGNMENT_UNIT_SIZE > queue->size) + DRV_LOG(ERR, "fatal error: offset_in_bytes %u too big", + offset_in_bytes); + + return ((uint8_t *)queue->buffer) + offset_in_bytes; +} + +static uint32_t +write_dma_client_oob(uint8_t *work_queue_buffer_pointer, +const struct gdma_work_request *work_request, +uint32_t client_oob_size) +{ + uint8_t *p = work_queue_buffer_pointer; + + struct gdma_wqe_dma_oob *header = (struct gdma_wqe_dma_oob *)p; + + memset(header, 0, sizeof(struct gdma_wqe_dma_oob)); + header->num_sgl_entries = work_request->num_sgl_elements; + header->inline_client_oob_size_in_dwords = + client_oob_size / sizeof(uint32_t); + header->client_data_unit = work_request->client_data_unit; + + DRV_LOG(DEBUG, "queue buf %p sgl %u oob_h %u du %u oob_buf %p oob_b %u", + work_queue_buffer_pointer, header->num_sgl_entries, + header->inline_client_oob_size_in_dwords, + header->client_data_unit, work_request->inline_oob_data, + work_request->inline_oob_size_in_bytes); + + p += sizeof(struct gdma_wqe_dma_oob); + if (work_request->inline_oob_data && + work_request->inline_oob_size_in_bytes > 0) { + memcpy(p, work_request->inline_oob_data, + work_request->inline_oob_size_in_bytes); + if (client_oob_size > work_request->inline_oob_size_in_bytes) + memset(p + work_request->inline_oob_size_in_bytes, 0, + client_oob_size - + work_request->inline_oob_size_in_bytes); + } + + return sizeof(struct gdma_wqe_dma_oob) + client_oob_size; +} + +static uint32_t +write_scatter_gather_list(uint8_t *work_queue_head_pointer, + uint8_t *work_queue_end_pointer, + uint8_t *work_queue_cur_pointer, + struct gdma_work_request *work_request) +{ + struct gdma_sgl_element *sge_list; + struct gdma_sgl_element dummy_sgl[1]; + uint8_t *address; + uint32_t size; + uint32_t num_sge; + uint32_t size_to_queue_end; + uint32_t sge_list_size; + + DRV_LOG(DEBUG, "work_queue_cur_pointer %p work_request->flags %x", + work_queue_cur_pointer, work_request->flags); + + num_sge = work_request->num_sgl_elements; + sge_list = work_request->sgl; + size_to_queue_end = (uint32_t)(work_queue_end_pointer - + work_queue_cur_pointer); + + if (num_sge == 0) { + /* Per spec, the case of an empty SGL should be handled as +* follows to avoid corrupted WQE errors: +* Write one dummy SGL entry +* Set the address to 1, leave the rest as 0 +*/ + dummy_sgl[num_sge].address = 1; + dummy_sgl[num_sge].size = 0; + dummy_sgl[num_sge].memory_key = 0; + num_sge++; + sge_list = dummy_sgl; + } + + sge_list_size = 0; + { + address = (uint8_t *)sge_list; + size = sizeof(struct gdma_sgl_element) * num_sge; + if (size_to_queue_end < size) { + memcpy(work_queue_cur_pointer, address, + size_to_queue_end); + work_queue_cur_pointer = work_queue_head_pointer; + address += size_to_queue_end; + size -= size_to_queue_end; + } + + memcpy(work_queue_cur_pointer, address, size); + sge_list_size = size; + } + + DRV_LOG(DEBUG, "sge
[Patch v4 12/17] net/mana: add function to start/stop TX queues
From: Long Li MANA allocate device queues through the IB layer when starting TX queues. When device is stopped all the queues are unmapped and freed. Signed-off-by: Long Li --- Change log: v2: Add prefix mana_ to all function names. Remove unused header files. doc/guides/nics/features/mana.ini | 1 + drivers/net/mana/mana.h | 4 + drivers/net/mana/meson.build | 1 + drivers/net/mana/tx.c | 157 ++ 4 files changed, 163 insertions(+) create mode 100644 drivers/net/mana/tx.c diff --git a/doc/guides/nics/features/mana.ini b/doc/guides/nics/features/mana.ini index a59c21cc10..821443b292 100644 --- a/doc/guides/nics/features/mana.ini +++ b/doc/guides/nics/features/mana.ini @@ -7,6 +7,7 @@ Link status = P Linux= Y Multiprocess aware = Y +Queue start/stop = Y Removal event= Y RSS hash = Y Speed capabilities = P diff --git a/drivers/net/mana/mana.h b/drivers/net/mana/mana.h index d87358ab15..3613ba7ca2 100644 --- a/drivers/net/mana/mana.h +++ b/drivers/net/mana/mana.h @@ -379,6 +379,10 @@ uint16_t mana_tx_burst_removed(void *dpdk_rxq, struct rte_mbuf **pkts, int gdma_poll_completion_queue(struct mana_gdma_queue *cq, struct gdma_comp *comp); +int mana_start_tx_queues(struct rte_eth_dev *dev); + +int mana_stop_tx_queues(struct rte_eth_dev *dev); + struct mana_mr_cache *mana_find_pmd_mr(struct mana_mr_btree *local_tree, struct mana_priv *priv, struct rte_mbuf *mbuf); diff --git a/drivers/net/mana/meson.build b/drivers/net/mana/meson.build index 364d57a619..031f443d16 100644 --- a/drivers/net/mana/meson.build +++ b/drivers/net/mana/meson.build @@ -11,6 +11,7 @@ deps += ['pci', 'bus_pci', 'net', 'eal', 'kvargs'] sources += files( 'mana.c', + 'tx.c', 'mr.c', 'gdma.c', 'mp.c', diff --git a/drivers/net/mana/tx.c b/drivers/net/mana/tx.c new file mode 100644 index 00..db7859c8c4 --- /dev/null +++ b/drivers/net/mana/tx.c @@ -0,0 +1,157 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright 2022 Microsoft Corporation + */ + +#include + +#include +#include + +#include "mana.h" + +int mana_stop_tx_queues(struct rte_eth_dev *dev) +{ + struct mana_priv *priv = dev->data->dev_private; + int i; + + for (i = 0; i < priv->num_queues; i++) { + struct mana_txq *txq = dev->data->tx_queues[i]; + + if (txq->qp) { + ibv_destroy_qp(txq->qp); + txq->qp = NULL; + } + + if (txq->cq) { + ibv_destroy_cq(txq->cq); + txq->cq = NULL; + } + + /* Drain and free posted WQEs */ + while (txq->desc_ring_tail != txq->desc_ring_head) { + struct mana_txq_desc *desc = + &txq->desc_ring[txq->desc_ring_tail]; + + rte_pktmbuf_free(desc->pkt); + + txq->desc_ring_tail = + (txq->desc_ring_tail + 1) % txq->num_desc; + } + txq->desc_ring_head = 0; + txq->desc_ring_tail = 0; + + memset(&txq->gdma_sq, 0, sizeof(txq->gdma_sq)); + memset(&txq->gdma_cq, 0, sizeof(txq->gdma_cq)); + } + + return 0; +} + +int mana_start_tx_queues(struct rte_eth_dev *dev) +{ + struct mana_priv *priv = dev->data->dev_private; + int ret, i; + + /* start TX queues */ + for (i = 0; i < priv->num_queues; i++) { + struct mana_txq *txq; + struct ibv_qp_init_attr qp_attr = { 0 }; + struct manadv_obj obj = {}; + struct manadv_qp dv_qp; + struct manadv_cq dv_cq; + + txq = dev->data->tx_queues[i]; + + manadv_set_context_attr(priv->ib_ctx, + MANADV_CTX_ATTR_BUF_ALLOCATORS, + (void *)((uintptr_t)&(struct manadv_ctx_allocators){ + .alloc = &mana_alloc_verbs_buf, + .free = &mana_free_verbs_buf, + .data = (void *)(uintptr_t)txq->socket, + })); + + txq->cq = ibv_create_cq(priv->ib_ctx, txq->num_desc, + NULL, NULL, 0); + if (!txq->cq) { + DRV_LOG(ERR, "failed to create cq queue index %d", i); + ret = -errno; + goto fail; + } + + qp_attr.send_cq = txq->cq; + qp_attr.recv_cq = txq->cq; + qp_attr.cap.max_send_wr = txq->num_desc; + qp_attr.cap.max_send_sge = priv->max_send_sge; + + /* Skip setting qp_attr.cap.max_inline_data */ + +
[Patch v4 13/17] net/mana: add function to start/stop RX queues
From: Long Li MANA allocates device queues through the IB layer when starting RX queues. When device is stopped all the queues are unmapped and freed. Signed-off-by: Long Li --- Change log: v2: Add prefix mana_ to all function names. Remove unused header files. v4: Move defition "uint32_t i" from inside "for ()" to outside drivers/net/mana/mana.h | 3 + drivers/net/mana/meson.build | 1 + drivers/net/mana/rx.c| 346 +++ 3 files changed, 350 insertions(+) create mode 100644 drivers/net/mana/rx.c diff --git a/drivers/net/mana/mana.h b/drivers/net/mana/mana.h index 3613ba7ca2..dc808d363f 100644 --- a/drivers/net/mana/mana.h +++ b/drivers/net/mana/mana.h @@ -364,6 +364,7 @@ extern int mana_logtype_init; int mana_ring_doorbell(void *db_page, enum gdma_queue_types queue_type, uint32_t queue_id, uint32_t tail); +int mana_rq_ring_doorbell(struct mana_rxq *rxq); int gdma_post_work_request(struct mana_gdma_queue *queue, struct gdma_work_request *work_req, @@ -379,8 +380,10 @@ uint16_t mana_tx_burst_removed(void *dpdk_rxq, struct rte_mbuf **pkts, int gdma_poll_completion_queue(struct mana_gdma_queue *cq, struct gdma_comp *comp); +int mana_start_rx_queues(struct rte_eth_dev *dev); int mana_start_tx_queues(struct rte_eth_dev *dev); +int mana_stop_rx_queues(struct rte_eth_dev *dev); int mana_stop_tx_queues(struct rte_eth_dev *dev); struct mana_mr_cache *mana_find_pmd_mr(struct mana_mr_btree *local_tree, diff --git a/drivers/net/mana/meson.build b/drivers/net/mana/meson.build index 031f443d16..62e103a510 100644 --- a/drivers/net/mana/meson.build +++ b/drivers/net/mana/meson.build @@ -11,6 +11,7 @@ deps += ['pci', 'bus_pci', 'net', 'eal', 'kvargs'] sources += files( 'mana.c', + 'rx.c', 'tx.c', 'mr.c', 'gdma.c', diff --git a/drivers/net/mana/rx.c b/drivers/net/mana/rx.c new file mode 100644 index 00..41d0fc9f11 --- /dev/null +++ b/drivers/net/mana/rx.c @@ -0,0 +1,346 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright 2022 Microsoft Corporation + */ +#include + +#include +#include + +#include "mana.h" + +static uint8_t mana_rss_hash_key_default[TOEPLITZ_HASH_KEY_SIZE_IN_BYTES] = { + 0x2c, 0xc6, 0x81, 0xd1, + 0x5b, 0xdb, 0xf4, 0xf7, + 0xfc, 0xa2, 0x83, 0x19, + 0xdb, 0x1a, 0x3e, 0x94, + 0x6b, 0x9e, 0x38, 0xd9, + 0x2c, 0x9c, 0x03, 0xd1, + 0xad, 0x99, 0x44, 0xa7, + 0xd9, 0x56, 0x3d, 0x59, + 0x06, 0x3c, 0x25, 0xf3, + 0xfc, 0x1f, 0xdc, 0x2a, +}; + +int mana_rq_ring_doorbell(struct mana_rxq *rxq) +{ + struct mana_priv *priv = rxq->priv; + int ret; + void *db_page = priv->db_page; + + if (rte_eal_process_type() == RTE_PROC_SECONDARY) { + struct rte_eth_dev *dev = + &rte_eth_devices[priv->dev_data->port_id]; + struct mana_process_priv *process_priv = dev->process_private; + + db_page = process_priv->db_page; + } + + ret = mana_ring_doorbell(db_page, gdma_queue_receive, +rxq->gdma_rq.id, +rxq->gdma_rq.head * + GDMA_WQE_ALIGNMENT_UNIT_SIZE); + + if (ret) + DRV_LOG(ERR, "failed to ring RX doorbell ret %d", ret); + + return ret; +} + +static int mana_alloc_and_post_rx_wqe(struct mana_rxq *rxq) +{ + struct rte_mbuf *mbuf = NULL; + struct gdma_sgl_element sgl[1]; + struct gdma_work_request request = {0}; + struct gdma_posted_wqe_info wqe_info = {0}; + struct mana_priv *priv = rxq->priv; + int ret; + struct mana_mr_cache *mr; + + mbuf = rte_pktmbuf_alloc(rxq->mp); + if (!mbuf) { + rxq->stats.nombuf++; + return -ENOMEM; + } + + mr = mana_find_pmd_mr(&rxq->mr_btree, priv, mbuf); + if (!mr) { + DRV_LOG(ERR, "failed to register RX MR"); + rte_pktmbuf_free(mbuf); + return -ENOMEM; + } + + request.gdma_header.struct_size = sizeof(request); + wqe_info.gdma_header.struct_size = sizeof(wqe_info); + + sgl[0].address = rte_cpu_to_le_64(rte_pktmbuf_mtod(mbuf, uint64_t)); + sgl[0].memory_key = mr->lkey; + sgl[0].size = + rte_pktmbuf_data_room_size(rxq->mp) - + RTE_PKTMBUF_HEADROOM; + + request.sgl = sgl; + request.num_sgl_elements = 1; + request.inline_oob_data = NULL; + request.inline_oob_size_in_bytes = 0; + request.flags = 0; + request.client_data_unit = NOT_USING_CLIENT_DATA_UNIT; + + ret = gdma_post_work_request(&rxq->gdma_rq, &request, &wqe_info); + if (!ret) { + struct mana_rxq_desc *desc = + &rxq->desc_ring[rxq->desc_ring_head]; + + /* update queue for
[Patch v4 14/17] net/mana: add function to receive packets
From: Long Li With all the RX queues created, MANA can use those queues to receive packets. Signed-off-by: Long Li --- Change log: v2: Add mana_ to all function names. Rename a camel case. doc/guides/nics/features/mana.ini | 2 + drivers/net/mana/mana.c | 2 + drivers/net/mana/mana.h | 37 +++ drivers/net/mana/mp.c | 2 + drivers/net/mana/rx.c | 104 ++ 5 files changed, 147 insertions(+) diff --git a/doc/guides/nics/features/mana.ini b/doc/guides/nics/features/mana.ini index 821443b292..fdbf22d335 100644 --- a/doc/guides/nics/features/mana.ini +++ b/doc/guides/nics/features/mana.ini @@ -6,6 +6,8 @@ [Features] Link status = P Linux= Y +L3 checksum offload = Y +L4 checksum offload = Y Multiprocess aware = Y Queue start/stop = Y Removal event= Y diff --git a/drivers/net/mana/mana.c b/drivers/net/mana/mana.c index d18cc4ab0e..c349822991 100644 --- a/drivers/net/mana/mana.c +++ b/drivers/net/mana/mana.c @@ -950,6 +950,8 @@ static int mana_pci_probe_mac(struct rte_pci_driver *pci_drv __rte_unused, /* fd is no not used after mapping doorbell */ close(fd); + eth_dev->rx_pkt_burst = mana_rx_burst; + rte_spinlock_lock(&mana_shared_data->lock); mana_shared_data->secondary_cnt++; mana_local_data.secondary_cnt++; diff --git a/drivers/net/mana/mana.h b/drivers/net/mana/mana.h index dc808d363f..bafc4d6082 100644 --- a/drivers/net/mana/mana.h +++ b/drivers/net/mana/mana.h @@ -178,6 +178,11 @@ struct gdma_work_request { enum mana_cqe_type { CQE_INVALID = 0, + + CQE_RX_OKAY = 1, + CQE_RX_COALESCED_4 = 2, + CQE_RX_OBJECT_FENCE = 3, + CQE_RX_TRUNCATED= 4, }; struct mana_cqe_header { @@ -203,6 +208,35 @@ struct mana_cqe_header { (NDIS_HASH_TCP_IPV4 | NDIS_HASH_UDP_IPV4 | NDIS_HASH_TCP_IPV6 | \ NDIS_HASH_UDP_IPV6 | NDIS_HASH_TCP_IPV6_EX | NDIS_HASH_UDP_IPV6_EX) +struct mana_rx_comp_per_packet_info { + uint32_t packet_length : 16; + uint32_t reserved0 : 16; + uint32_t reserved1; + uint32_t packet_hash; +}; /* HW DATA */ +#define RX_COM_OOB_NUM_PACKETINFO_SEGMENTS 4 + +struct mana_rx_comp_oob { + struct mana_cqe_header cqe_hdr; + + uint32_t rx_vlan_id : 12; + uint32_t rx_vlan_tag_present: 1; + uint32_t rx_outer_ip_header_checksum_succeeded : 1; + uint32_t rx_outer_ip_header_checksum_failed : 1; + uint32_t reserved : 1; + uint32_t rx_hash_type : 9; + uint32_t rx_ip_header_checksum_succeeded: 1; + uint32_t rx_ip_header_checksum_failed : 1; + uint32_t rx_tcp_checksum_succeeded : 1; + uint32_t rx_tcp_checksum_failed : 1; + uint32_t rx_udp_checksum_succeeded : 1; + uint32_t rx_udp_checksum_failed : 1; + uint32_t reserved1 : 1; + struct mana_rx_comp_per_packet_info + packet_info[RX_COM_OOB_NUM_PACKETINFO_SEGMENTS]; + uint32_t received_wqe_offset; +}; /* HW DATA */ + struct gdma_wqe_dma_oob { uint32_t reserved:24; uint32_t last_v_bytes:8; @@ -371,6 +405,9 @@ int gdma_post_work_request(struct mana_gdma_queue *queue, struct gdma_posted_wqe_info *wqe_info); uint8_t *gdma_get_wqe_pointer(struct mana_gdma_queue *queue); +uint16_t mana_rx_burst(void *dpdk_rxq, struct rte_mbuf **rx_pkts, + uint16_t pkts_n); + uint16_t mana_rx_burst_removed(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n); diff --git a/drivers/net/mana/mp.c b/drivers/net/mana/mp.c index f4f78d2787..36a88c561a 100644 --- a/drivers/net/mana/mp.c +++ b/drivers/net/mana/mp.c @@ -138,6 +138,8 @@ static int mana_mp_secondary_handle(const struct rte_mp_msg *mp_msg, case MANA_MP_REQ_START_RXTX: DRV_LOG(INFO, "Port %u starting datapath", dev->data->port_id); + dev->rx_pkt_burst = mana_rx_burst; + rte_mb(); res->result = 0; diff --git a/drivers/net/mana/rx.c b/drivers/net/mana/rx.c index 41d0fc9f11..f2573a6d06 100644 --- a/drivers/net/mana/rx.c +++ b/drivers/net/mana/rx.c @@ -344,3 +344,107 @@ int mana_start_rx_queues(struct rte_eth_dev *dev) mana_stop_rx_queues(dev); return ret; } + +uint16_t mana_rx_burst(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n) +{ + uint16_t pkt_received = 0, cqe_processed = 0; + struct mana_rxq *rxq = dpdk_rxq; + struc
[Patch v4 15/17] net/mana: add function to send packets
From: Long Li With all the TX queues created, MANA can send packets over those queues. Signed-off-by: Long Li --- Change log: v2: Rename all camel cases. doc/guides/nics/features/mana.ini | 1 + drivers/net/mana/mana.c | 1 + drivers/net/mana/mana.h | 65 drivers/net/mana/mp.c | 1 + drivers/net/mana/tx.c | 241 ++ 5 files changed, 309 insertions(+) diff --git a/doc/guides/nics/features/mana.ini b/doc/guides/nics/features/mana.ini index fdbf22d335..7922816d66 100644 --- a/doc/guides/nics/features/mana.ini +++ b/doc/guides/nics/features/mana.ini @@ -4,6 +4,7 @@ ; Refer to default.ini for the full list of available PMD features. ; [Features] +Free Tx mbuf on demand = Y Link status = P Linux= Y L3 checksum offload = Y diff --git a/drivers/net/mana/mana.c b/drivers/net/mana/mana.c index c349822991..0dcd3f3124 100644 --- a/drivers/net/mana/mana.c +++ b/drivers/net/mana/mana.c @@ -950,6 +950,7 @@ static int mana_pci_probe_mac(struct rte_pci_driver *pci_drv __rte_unused, /* fd is no not used after mapping doorbell */ close(fd); + eth_dev->tx_pkt_burst = mana_tx_burst; eth_dev->rx_pkt_burst = mana_rx_burst; rte_spinlock_lock(&mana_shared_data->lock); diff --git a/drivers/net/mana/mana.h b/drivers/net/mana/mana.h index bafc4d6082..b4056bd50b 100644 --- a/drivers/net/mana/mana.h +++ b/drivers/net/mana/mana.h @@ -62,6 +62,47 @@ struct mana_shared_data { #define NOT_USING_CLIENT_DATA_UNIT 0 +enum tx_packet_format_v2 { + short_packet_format = 0, + long_packet_format = 1 +}; + +struct transmit_short_oob_v2 { + enum tx_packet_format_v2 packet_format : 2; + uint32_t tx_is_outer_ipv4 : 1; + uint32_t tx_is_outer_ipv6 : 1; + uint32_t tx_compute_IP_header_checksum : 1; + uint32_t tx_compute_TCP_checksum : 1; + uint32_t tx_compute_UDP_checksum : 1; + uint32_t suppress_tx_CQE_generation : 1; + uint32_t VCQ_number : 24; + uint32_t tx_transport_header_offset : 10; + uint32_t VSQ_frame_num : 14; + uint32_t short_vport_offset : 8; +}; + +struct transmit_long_oob_v2 { + uint32_t tx_is_encapsulated_packet : 1; + uint32_t tx_inner_is_ipv6 : 1; + uint32_t tx_inner_TCP_options_present : 1; + uint32_t inject_vlan_prior_tag : 1; + uint32_t reserved1 : 12; + uint32_t priority_code_point : 3; + uint32_t drop_eligible_indicator : 1; + uint32_t vlan_identifier : 12; + uint32_t tx_inner_frame_offset : 10; + uint32_t tx_inner_IP_header_relative_offset : 6; + uint32_t long_vport_offset : 12; + uint32_t reserved3 : 4; + uint32_t reserved4 : 32; + uint32_t reserved5 : 32; +}; + +struct transmit_oob_v2 { + struct transmit_short_oob_v2 short_oob; + struct transmit_long_oob_v2 long_oob; +}; + enum gdma_queue_types { gdma_queue_type_invalid = 0, gdma_queue_send, @@ -183,6 +224,17 @@ enum mana_cqe_type { CQE_RX_COALESCED_4 = 2, CQE_RX_OBJECT_FENCE = 3, CQE_RX_TRUNCATED= 4, + + CQE_TX_OKAY = 32, + CQE_TX_SA_DROP = 33, + CQE_TX_MTU_DROP = 34, + CQE_TX_INVALID_OOB = 35, + CQE_TX_INVALID_ETH_TYPE = 36, + CQE_TX_HDR_PROCESSING_ERROR = 37, + CQE_TX_VF_DISABLED = 38, + CQE_TX_VPORT_IDX_OUT_OF_RANGE = 39, + CQE_TX_VPORT_DISABLED = 40, + CQE_TX_VLAN_TAGGING_VIOLATION = 41, }; struct mana_cqe_header { @@ -191,6 +243,17 @@ struct mana_cqe_header { uint32_t vendor_err : 24; }; /* HW DATA */ +struct mana_tx_comp_oob { + struct mana_cqe_header cqe_hdr; + + uint32_t tx_data_offset; + + uint32_t tx_sgl_offset : 5; + uint32_t tx_wqe_offset : 27; + + uint32_t reserved[12]; +}; /* HW DATA */ + /* NDIS HASH Types */ #define BIT(nr)(1 << (nr)) #define NDIS_HASH_IPV4 BIT(0) @@ -407,6 +470,8 @@ uint8_t *gdma_get_wqe_pointer(struct mana_gdma_queue *queue); uint16_t mana_rx_burst(void *dpdk_rxq, struct rte_mbuf **rx_pkts, uint16_t pkts_n); +uint16_t mana_tx_burst(void *dpdk_txq, struct rte_mbuf **tx_pkts, + uint16_t pkts_n); uint16_t mana_rx_burst_removed(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n); diff --git a/drivers/net/mana/mp.c b/drivers/net/mana/mp.c index 36a88c561a..da9c0f36a1 100644 --- a/drivers/net/mana/mp.c +++ b/drivers/net/mana/mp.c @@ -138,6 +138,7 @@ static int mana_mp_secondary_handle(const struct rte_mp_msg *mp_msg, case MANA_MP_REQ_START_RXTX: DRV_LOG(
[Patch v4 16/17] net/mana: add function to start/stop device
From: Long Li Add support for starting/stopping the device. Signed-off-by: Long Li --- Change log: v2: Use spinlock for memory registration cache. Add prefix mana_ to all function names. drivers/net/mana/mana.c | 70 + 1 file changed, 70 insertions(+) diff --git a/drivers/net/mana/mana.c b/drivers/net/mana/mana.c index 0dcd3f3124..119fdfd8bc 100644 --- a/drivers/net/mana/mana.c +++ b/drivers/net/mana/mana.c @@ -97,6 +97,74 @@ static int mana_dev_configure(struct rte_eth_dev *dev) static int mana_intr_uninstall(struct mana_priv *priv); +static int +mana_dev_start(struct rte_eth_dev *dev) +{ + int ret; + struct mana_priv *priv = dev->data->dev_private; + + rte_spinlock_init(&priv->mr_btree_lock); + ret = mana_mr_btree_init(&priv->mr_btree, MANA_MR_BTREE_CACHE_N, +dev->device->numa_node); + if (ret) { + DRV_LOG(ERR, "Failed to init device MR btree %d", ret); + return ret; + } + + ret = mana_start_tx_queues(dev); + if (ret) { + DRV_LOG(ERR, "failed to start tx queues %d", ret); + return ret; + } + + ret = mana_start_rx_queues(dev); + if (ret) { + DRV_LOG(ERR, "failed to start rx queues %d", ret); + mana_stop_tx_queues(dev); + return ret; + } + + rte_wmb(); + + dev->tx_pkt_burst = mana_tx_burst; + dev->rx_pkt_burst = mana_rx_burst; + + DRV_LOG(INFO, "TX/RX queues have started"); + + /* Enable datapath for secondary processes */ + mana_mp_req_on_rxtx(dev, MANA_MP_REQ_START_RXTX); + + return 0; +} + +static int +mana_dev_stop(struct rte_eth_dev *dev __rte_unused) +{ + int ret; + + dev->tx_pkt_burst = mana_tx_burst_removed; + dev->rx_pkt_burst = mana_rx_burst_removed; + + /* Stop datapath on secondary processes */ + mana_mp_req_on_rxtx(dev, MANA_MP_REQ_STOP_RXTX); + + rte_wmb(); + + ret = mana_stop_tx_queues(dev); + if (ret) { + DRV_LOG(ERR, "failed to stop tx queues"); + return ret; + } + + ret = mana_stop_rx_queues(dev); + if (ret) { + DRV_LOG(ERR, "failed to stop tx queues"); + return ret; + } + + return 0; +} + static int mana_dev_close(struct rte_eth_dev *dev) { @@ -435,6 +503,8 @@ static int mana_dev_link_update(struct rte_eth_dev *dev, const struct eth_dev_ops mana_dev_ops = { .dev_configure = mana_dev_configure, + .dev_start = mana_dev_start, + .dev_stop = mana_dev_stop, .dev_close = mana_dev_close, .dev_infos_get = mana_dev_info_get, .txq_info_get = mana_dev_tx_queue_info, -- 2.17.1
[Patch v4 17/17] net/mana: add function to report queue stats
From: Long Li Report packet statistics. Signed-off-by: Long Li --- doc/guides/nics/features/mana.ini | 2 + drivers/net/mana/mana.c | 77 +++ 2 files changed, 79 insertions(+) diff --git a/doc/guides/nics/features/mana.ini b/doc/guides/nics/features/mana.ini index 7922816d66..b2729aba3a 100644 --- a/doc/guides/nics/features/mana.ini +++ b/doc/guides/nics/features/mana.ini @@ -4,6 +4,7 @@ ; Refer to default.ini for the full list of available PMD features. ; [Features] +Basic stats = Y Free Tx mbuf on demand = Y Link status = P Linux= Y @@ -14,5 +15,6 @@ Queue start/stop = Y Removal event= Y RSS hash = Y Speed capabilities = P +Stats per queue = Y Usage doc= Y x86-64 = Y diff --git a/drivers/net/mana/mana.c b/drivers/net/mana/mana.c index 119fdfd8bc..b86f79de0f 100644 --- a/drivers/net/mana/mana.c +++ b/drivers/net/mana/mana.c @@ -501,6 +501,79 @@ static int mana_dev_link_update(struct rte_eth_dev *dev, return rte_eth_linkstatus_set(dev, &link); } +static int mana_dev_stats_get(struct rte_eth_dev *dev, + struct rte_eth_stats *stats) +{ + unsigned int i; + + for (i = 0; i < dev->data->nb_tx_queues; i++) { + struct mana_txq *txq = dev->data->tx_queues[i]; + + if (!txq) + continue; + + stats->opackets = txq->stats.packets; + stats->obytes = txq->stats.bytes; + stats->oerrors = txq->stats.errors; + + if (i < RTE_ETHDEV_QUEUE_STAT_CNTRS) { + stats->q_opackets[i] = txq->stats.packets; + stats->q_obytes[i] = txq->stats.bytes; + } + } + + stats->rx_nombuf = 0; + for (i = 0; i < dev->data->nb_rx_queues; i++) { + struct mana_rxq *rxq = dev->data->rx_queues[i]; + + if (!rxq) + continue; + + stats->ipackets = rxq->stats.packets; + stats->ibytes = rxq->stats.bytes; + stats->ierrors = rxq->stats.errors; + + /* There is no good way to get stats->imissed, not setting it */ + + if (i < RTE_ETHDEV_QUEUE_STAT_CNTRS) { + stats->q_ipackets[i] = rxq->stats.packets; + stats->q_ibytes[i] = rxq->stats.bytes; + } + + stats->rx_nombuf += rxq->stats.nombuf; + } + + return 0; +} + +static int +mana_dev_stats_reset(struct rte_eth_dev *dev __rte_unused) +{ + unsigned int i; + + PMD_INIT_FUNC_TRACE(); + + for (i = 0; i < dev->data->nb_tx_queues; i++) { + struct mana_txq *txq = dev->data->tx_queues[i]; + + if (!txq) + continue; + + memset(&txq->stats, 0, sizeof(txq->stats)); + } + + for (i = 0; i < dev->data->nb_rx_queues; i++) { + struct mana_rxq *rxq = dev->data->rx_queues[i]; + + if (!rxq) + continue; + + memset(&rxq->stats, 0, sizeof(rxq->stats)); + } + + return 0; +} + const struct eth_dev_ops mana_dev_ops = { .dev_configure = mana_dev_configure, .dev_start = mana_dev_start, @@ -517,9 +590,13 @@ const struct eth_dev_ops mana_dev_ops = { .rx_queue_setup = mana_dev_rx_queue_setup, .rx_queue_release = mana_dev_rx_queue_release, .link_update= mana_dev_link_update, + .stats_get = mana_dev_stats_get, + .stats_reset= mana_dev_stats_reset, }; const struct eth_dev_ops mana_dev_sec_ops = { + .stats_get = mana_dev_stats_get, + .stats_reset = mana_dev_stats_reset, .dev_infos_get = mana_dev_info_get, }; -- 2.17.1