Hi, Feifei Sorry, I do not follow why we should get rid of the last (after dev_gen update) wmb. We've rebuilt the global cache, we should notify other agents it's happened and they should flush local caches. So, dev_gen change should be made visible to other agents to trigger this activity and the second wmb is here to ensure this.
One more point, due to registering new/destroying existing MR involves FW (via kernel) calls, it takes so many CPU cycles that we could neglect wmb overhead at all. Also, regarding this: > > Another question suddenly occurred to me, in order to keep the > > order that rebuilding global cache before updating ”dev_gen“, the > > wmb should be before updating "dev_gen" rather than after it. > > Otherwise, in the out-of-order platforms, current order cannot be kept. it is not clear why ordering is important - global cache update and dev_gen change happen under spinlock protection, so only the last wmb is meaningful. To summarize, in my opinion: - if you see some issue with ordering of global cache update/dev_gen signalling, could you, please, elaborate? I'm not sure we should maintain an order (due to spinlock protection) - the last rte_smp_wmb() after dev_gen incrementing should be kept intact With best regards, Slava > -----Original Message----- > From: Feifei Wang <feifei.wa...@arm.com> > Sent: Thursday, May 6, 2021 5:52 > To: Slava Ovsiienko <viachesl...@nvidia.com>; Matan Azrad > <ma...@nvidia.com>; Shahaf Shuler <shah...@nvidia.com> > Cc: dev@dpdk.org; nd <n...@arm.com>; sta...@dpdk.org; Ruifeng Wang > <ruifeng.w...@arm.com>; nd <n...@arm.com> > Subject: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region > cache > > Hi, Slava > > Would you have more comments about this patch? > For my sight, only one wmb before "dev_gen" updating is enough to > synchronize. > > Thanks very much for your attention. > > > Best Regards > Feifei > > > -----邮件原件----- > > 发件人: Feifei Wang > > 发送时间: 2021年4月20日 16:42 > > 收件人: Slava Ovsiienko <viachesl...@nvidia.com>; Matan Azrad > > <ma...@nvidia.com>; Shahaf Shuler <shah...@nvidia.com> > > 抄送: dev@dpdk.org; nd <n...@arm.com>; sta...@dpdk.org; Ruifeng Wang > > <ruifeng.w...@arm.com>; nd <n...@arm.com> > > 主题: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region > > cache > > > > Hi, Slava > > > > I think the second wmb can be removed. > > As I know, wmb is just a barrier to keep the order between write and write. > > and it cannot tell the CPU when it should commit the changes. > > > > It is usually used before guard variable to keep the order that > > updating guard variable after some changes, which you want to release, > have been done. > > > > For example, for the wmb after global cache update/before altering > > dev_gen, it can ensure the order that updating global cache before > > altering > > dev_gen: > > 1)If other agent load the changed "dev_gen", it can know the global > > cache has been updated. > > 2)If other agents load the unchanged, "dev_gen", it means the global > > cache has not been updated, and the local cache will not be flushed. > > > > As a result, we use wmb and guard variable "dev_gen" to ensure the > > global cache updating is "visible". > > The "visible" means when updating guard variable "dev_gen" is known by > > other agents, they also can confirm global cache has been updated in > > the meanwhile. Thus, just one wmb before altering dev_gen can ensure > this. > > > > Best Regards > > Feifei > > > > > -----邮件原件----- > > > 发件人: Slava Ovsiienko <viachesl...@nvidia.com> > > > 发送时间: 2021年4月20日 15:54 > > > 收件人: Feifei Wang <feifei.wa...@arm.com>; Matan Azrad > > > <ma...@nvidia.com>; Shahaf Shuler <shah...@nvidia.com> > > > 抄送: dev@dpdk.org; nd <n...@arm.com>; sta...@dpdk.org; Ruifeng > Wang > > > <ruifeng.w...@arm.com>; nd <n...@arm.com>; nd <n...@arm.com>; nd > > > <n...@arm.com> > > > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region > > > cache > > > > > > Hi, Feifei > > > > > > In my opinion, there should be 2 barriers: > > > - after global cache update/before altering dev_gen, to ensure the > > > correct order > > > - after altering dev_gen to make this change visible for other > > > agents and to trigger local cache update > > > > > > With best regards, > > > Slava > > > > > > > -----Original Message----- > > > > From: Feifei Wang <feifei.wa...@arm.com> > > > > Sent: Tuesday, April 20, 2021 10:30 > > > > To: Slava Ovsiienko <viachesl...@nvidia.com>; Matan Azrad > > > > <ma...@nvidia.com>; Shahaf Shuler <shah...@nvidia.com> > > > > Cc: dev@dpdk.org; nd <n...@arm.com>; sta...@dpdk.org; Ruifeng > Wang > > > > <ruifeng.w...@arm.com>; nd <n...@arm.com>; nd <n...@arm.com>; > nd > > > > <n...@arm.com> > > > > Subject: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory > > > > Region cache > > > > > > > > Hi, Slava > > > > > > > > Another question suddenly occurred to me, in order to keep the > > > > order that rebuilding global cache before updating ”dev_gen“, the > > > > wmb should be before updating "dev_gen" rather than after it. > > > > Otherwise, in the out-of-order platforms, current order cannot be kept. > > > > > > > > Thus, we should change the code as: > > > > a) rebuild global cache; > > > > b) rte_smp_wmb(); > > > > c) updating dev_gen > > > > > > > > Best Regards > > > > Feifei > > > > > -----邮件原件----- > > > > > 发件人: Feifei Wang > > > > > 发送时间: 2021年4月20日 13:54 > > > > > 收件人: Slava Ovsiienko <viachesl...@nvidia.com>; Matan Azrad > > > > > <ma...@nvidia.com>; Shahaf Shuler <shah...@nvidia.com> > > > > > 抄送: dev@dpdk.org; nd <n...@arm.com>; sta...@dpdk.org; Ruifeng > > > Wang > > > > > <ruifeng.w...@arm.com>; nd <n...@arm.com>; nd <n...@arm.com> > > > > > 主题: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory > > Region > > > > > cache > > > > > > > > > > Hi, Slava > > > > > > > > > > Thanks very much for your explanation. > > > > > > > > > > I can understand the app can wait all mbufs are returned to the > > > > > memory pool, and then it can free this mbufs, I agree with this. > > > > > > > > > > As a result, I will remove the bug fix patch from this series > > > > > and just replace the smp barrier with C11 thread fence. Thanks > > > > > very much for your patient explanation again. > > > > > > > > > > Best Regards > > > > > Feifei > > > > > > > > > > > -----邮件原件----- > > > > > > 发件人: Slava Ovsiienko <viachesl...@nvidia.com> > > > > > > 发送时间: 2021年4月20日 2:51 > > > > > > 收件人: Feifei Wang <feifei.wa...@arm.com>; Matan Azrad > > > > > > <ma...@nvidia.com>; Shahaf Shuler <shah...@nvidia.com> > > > > > > 抄送: dev@dpdk.org; nd <n...@arm.com>; sta...@dpdk.org; > Ruifeng > > > > Wang > > > > > > <ruifeng.w...@arm.com>; nd <n...@arm.com> > > > > > > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory > > > > > > Region cache > > > > > > > > > > > > Hi, Feifei > > > > > > > > > > > > Please, see below > > > > > > > > > > > > .... > > > > > > > > > > > > > > Hi, Feifei > > > > > > > > > > > > > > > > Sorry, I do not follow what this patch fixes. Do we have > > > > > > > > some issue/bug with MR cache in practice? > > > > > > > > > > > > > > This patch fixes the bug which is based on logical > > > > > > > deduction, and it doesn't actually happen. > > > > > > > > > > > > > > > > > > > > > > > Each Tx queue has its own dedicated "local" cache for MRs > > > > > > > > to convert buffer address in mbufs being transmitted to > > > > > > > > LKeys (HW-related entity > > > > > > > > handle) and the "global" cache for all MR registered on the > device. > > > > > > > > > > > > > > > > AFAIK, how conversion happens in datapath: > > > > > > > > - check the local queue cache flush request > > > > > > > > - lookup in local cache > > > > > > > > - if not found: > > > > > > > > - acquire lock for global cache read access > > > > > > > > - lookup in global cache > > > > > > > > - release lock for global cache > > > > > > > > > > > > > > > > How cache update on memory freeing/unregistering happens: > > > > > > > > - acquire lock for global cache write access > > > > > > > > - [a] remove relevant MRs from the global cache > > > > > > > > - [b] set local caches flush request > > > > > > > > - free global cache lock > > > > > > > > > > > > > > > > If I understand correctly, your patch swaps [a] and [b], > > > > > > > > and local caches flush is requested earlier. What problem does > > > > > > > > it > solve? > > > > > > > > It is not supposed there are in datapath some mbufs > > > > > > > > referencing to the memory being freed. Application must > > > > > > > > ensure this and must not allocate new mbufs from this > > > > > > > > memory regions > > > being freed. > > > > > > > > Hence, the lookups for these MRs in caches should not occur. > > > > > > > > > > > > > > For your first point that, application can take charge of > > > > > > > preventing MR freed memory being allocated to data path. > > > > > > > > > > > > > > Does it means that If there is an emergency of MR fragment, > > > > > > > such as hotplug, the application must inform thedata path in > > > > > > > advance, and this memory will not be allocated, and then the > > > > > > > control path will free this memory? If application can do > > > > > > > like this, I agree that this bug > > > > > cannot happen. > > > > > > > > > > > > Actually, this is the only correct way for application to operate. > > > > > > Let's suppose we have some memory area that application wants > > > > > > to > > > free. > > > > > > ALL references to this area must be removed. If we have some > > > > > > mbufs allocated from this area, it means that we have memory > > > > > > pool created > > > > there. > > > > > > > > > > > > What application should do: > > > > > > - notify all its components/agents the memory area is going to > > > > > > be freed > > > > > > - all components/agents free the mbufs they might own > > > > > > - PMD might not support freeing for some mbufs (for example > > > > > > being sent and awaiting for completion), so app should just > > > > > > wait > > > > > > - wait till all mbufs are returned to the memory pool (by > > > > > > monitoring available obj == pool size) > > > > > > > > > > > > Otherwise - it is dangerous to free the memory. There are just > > > > > > some mbufs still allocated, it is regardless to buf address to > > > > > > MR translation. We just can't free the memory - the mapping > > > > > > will be destroyed and might cause the segmentation fault by SW > > > > > > or some HW issues on DMA access to unmapped memory. It is > > > > > > very generic safety approach - do not free the memory that is still > > > > > > in > use. > > > > > > Hence, at the moment of freeing and unregistering the MR, > > > > > > there MUST BE NO any > > > > > mbufs in flight referencing to the addresses being freed. > > > > > > No translation to MR being invalidated can happen. > > > > > > > > > > > > > > > > > > > > > For other side, the cache flush has negative effect - the > > > > > > > > local cache is getting empty and can't provide translation > > > > > > > > for other valid (not being removed) MRs, and the > > > > > > > > translation has to look up in the global cache, that is > > > > > > > > locked now for rebuilding, this causes the delays in > > > > > > > > datapatch > > > > > > > on acquiring global cache lock. > > > > > > > > So, I see some potential performance impact. > > > > > > > > > > > > > > If above assumption is true, we can go to your second point. > > > > > > > I think this is a problem of the tradeoff between cache > > > > > > > coherence and > > > > > > performance. > > > > > > > > > > > > > > I can understand your meaning that though global cache has > > > > > > > been changed, we should keep the valid MR in local cache as > > > > > > > long as possible to ensure the fast searching speed. > > > > > > > In the meanwhile, the local cache can be rebuilt later to > > > > > > > reduce its waiting time for acquiring the global cache lock. > > > > > > > > > > > > > > However, this mechanism just ensures the performance > > > > > > > unchanged for the first few mbufs. > > > > > > > During the next mbufs lkey searching after 'dev_gen' > > > > > > > updated, it is still necessary to update the local cache. > > > > > > > And the performance can firstly reduce and then returns. > > > > > > > Thus, no matter whether there is this patch or not, the > > > > > > > performance will jitter in a certain period of > > > > time. > > > > > > > > > > > > Local cache should be updated to remove MRs no longer valid. > > > > > > But we just flush the entire cache. > > > > > > Let's suppose we have valid MR0, MR1, and not valid MRX in > > > > > > local > > cache. > > > > > > And there are traffic in the datapath for MR0 and MR1, and no > > > > > > traffic for MRX anymore. > > > > > > > > > > > > 1) If we do as you propose: > > > > > > a) take a lock > > > > > > b) request flush local cache first - all MR0, MR1, MRX will be > > > > > > removed on translation in datapath > > > > > > c) update global cache, > > > > > > d) free lock > > > > > > All the traffic for valid MR0, MR1 ALWAYS will be blocked on > > > > > > lock taken for cache update since point b) till point d). > > > > > > > > > > > > 2) If we do as it is implemented now: > > > > > > a) take a lock > > > > > > b) update global cache > > > > > > c) request flush local cache > > > > > > d) free lock > > > > > > The traffic MIGHT be locked ONLY for MRs non-existing in local > > > > > > cache (not happens for MR0 and MR1, must not happen for MRX), > > > > > > and probability should be minor. And lock might happen since > > > > > > c) till > > > > > > d) > > > > > > - quite short period of time > > > > > > > > > > > > Summary, the difference between 1) and 2) > > > > > > > > > > > > Lock probability: > > > > > > - 1) lock ALWAYS happen for ANY MR translation after b), > > > > > > 2) lock MIGHT happen, for cache miss ONLY, after c) > > > > > > > > > > > > Lock duration: > > > > > > - 1) lock since b) till d), > > > > > > 2) lock since c) till d), that seems to be much shorter. > > > > > > > > > > > > > > > > > > > > Finally, in conclusion, I tend to think that the bottom > > > > > > > layer can do more things to ensure the correct execution of > > > > > > > the program, which may have a negative impact on the > > > > > > > performance in a short time, but in the long run, the > > > > > > > performance will eventually > > > come back. > > > > > > > Furthermore, maybe we should pay attention to the > > > > > > > performance in the stable period, and try our best to ensure > > > > > > > the correctness of the program in case of > > > > > > emergencies. > > > > > > > > > > > > If we have some mbufs still allocated in memory being freed - > > > > > > there is nothing to say about correctness, it is totally > > > > > > incorrect. In my opinion, we should not think how to mitigate > > > > > > this incorrect behavior, we should not encourage application > > > > > > developers to follow the wrong > > > > > approaches. > > > > > > > > > > > > With best regards, > > > > > > Slava > > > > > > > > > > > > > > > > > > > > Best Regards > > > > > > > Feifei > > > > > > > > > > > > > > > With best regards, > > > > > > > > Slava > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > > From: Feifei Wang <feifei.wa...@arm.com> > > > > > > > > > Sent: Thursday, March 18, 2021 9:19 > > > > > > > > > To: Matan Azrad <ma...@nvidia.com>; Shahaf Shuler > > > > > > > > > <shah...@nvidia.com>; Slava Ovsiienko > > > > > > > > > <viachesl...@nvidia.com>; Yongseok Koh > > > > > > > > > <ys...@mellanox.com> > > > > > > > > > Cc: dev@dpdk.org; n...@arm.com; Feifei Wang > > > > > > <feifei.wa...@arm.com>; > > > > > > > > > sta...@dpdk.org; Ruifeng Wang <ruifeng.w...@arm.com> > > > > > > > > > Subject: [PATCH v1 3/4] net/mlx5: fix rebuild bug for > > > > > > > > > Memory Region cache > > > > > > > > > > > > > > > > > > 'dev_gen' is a variable to inform other cores to flush > > > > > > > > > their local cache when global cache is rebuilt. > > > > > > > > > > > > > > > > > > However, if 'dev_gen' is updated after global cache is > > > > > > > > > rebuilt, other cores may load a wrong memory region lkey > > > > > > > > > value from old local > > > > > > > cache. > > > > > > > > > > > > > > > > > > Timeslot main core worker core > > > > > > > > > 1 rebuild global cache > > > > > > > > > 2 load unchanged dev_gen > > > > > > > > > 3 update dev_gen > > > > > > > > > 4 look up old local cache > > > > > > > > > > > > > > > > > > From the example above, we can see that though global > > > > > > > > > cache is rebuilt, due to that dev_gen is not updated, > > > > > > > > > the worker core may look up old cache table and receive > > > > > > > > > a wrong memory region > > > > lkey value. > > > > > > > > > > > > > > > > > > To fix this, updating 'dev_gen' should be moved before > > > > > > > > > rebuilding global cache to inform worker cores to flush > > > > > > > > > their local cache when global cache start rebuilding. > > > > > > > > > And wmb can ensure the sequence of this > > > > > > > > process. > > > > > > > > > > > > > > > > > > Fixes: 974f1e7ef146 ("net/mlx5: add new memory region > > > > > > > > > support") > > > > > > > > > Cc: sta...@dpdk.org > > > > > > > > > > > > > > > > > > Suggested-by: Ruifeng Wang <ruifeng.w...@arm.com> > > > > > > > > > Signed-off-by: Feifei Wang <feifei.wa...@arm.com> > > > > > > > > > Reviewed-by: Ruifeng Wang <ruifeng.w...@arm.com> > > > > > > > > > --- > > > > > > > > > drivers/net/mlx5/mlx5_mr.c | 37 > > > > > > > > > +++++++++++++++++-------------------- > > > > > > > > > 1 file changed, 17 insertions(+), 20 deletions(-) > > > > > > > > > > > > > > > > > > diff --git a/drivers/net/mlx5/mlx5_mr.c > > > > > > > > > b/drivers/net/mlx5/mlx5_mr.c index > > > > > > > > > da4e91fc2..7ce1d3e64 100644 > > > > > > > > > --- a/drivers/net/mlx5/mlx5_mr.c > > > > > > > > > +++ b/drivers/net/mlx5/mlx5_mr.c > > > > > > > > > @@ -103,20 +103,18 @@ > mlx5_mr_mem_event_free_cb(struct > > > > > > > > > mlx5_dev_ctx_shared *sh, > > > > > > > > > rebuild = 1; > > > > > > > > > } > > > > > > > > > if (rebuild) { > > > > > > > > > - mlx5_mr_rebuild_cache(&sh->share_cache); > > > > > > > > > + ++sh->share_cache.dev_gen; > > > > > > > > > + DEBUG("broadcasting local cache flush, > gen=%d", > > > > > > > > > + sh->share_cache.dev_gen); > > > > > > > > > + > > > > > > > > > /* > > > > > > > > > * Flush local caches by propagating > > > > > > > > > invalidation > > > > > across cores. > > > > > > > > > - * rte_smp_wmb() is enough to synchronize this > > > > > event. If > > > > > > > > > one of > > > > > > > > > - * freed memsegs is seen by other core, that > > > > > > > > > means > > > > > the > > > > > > > > > memseg > > > > > > > > > - * has been allocated by allocator, which will > > > > > > > > > come > > > > > after this > > > > > > > > > - * free call. Therefore, this store instruction > > > > > (incrementing > > > > > > > > > - * generation below) will be guaranteed to be > > > > > > > > > seen > > > > > by other > > > > > > > > > core > > > > > > > > > - * before the core sees the newly allocated > > > > > > > > > memory. > > > > > > > > > + * rte_smp_wmb() is to keep the order that > dev_gen > > > > > > > > > updated before > > > > > > > > > + * rebuilding global cache. Therefore, other > core can > > > > > flush > > > > > > > > > their > > > > > > > > > + * local cache on time. > > > > > > > > > */ > > > > > > > > > - ++sh->share_cache.dev_gen; > > > > > > > > > - DEBUG("broadcasting local cache flush, gen=%d", > > > > > > > > > - sh->share_cache.dev_gen); > > > > > > > > > rte_smp_wmb(); > > > > > > > > > + mlx5_mr_rebuild_cache(&sh->share_cache); > > > > > > > > > } > > > > > > > > > rte_rwlock_write_unlock(&sh->share_cache.rwlock); > > > > > > > > > } > > > > > > > > > @@ -407,20 +405,19 @@ mlx5_dma_unmap(struct > > rte_pci_device > > > > > > > *pdev, > > > > > > > > void > > > > > > > > > *addr, > > > > > > > > > mlx5_mr_free(mr, sh->share_cache.dereg_mr_cb); > > > > > > > > > DEBUG("port %u remove MR(%p) from list", dev->data- > > > > > >port_id, > > > > > > > > > (void *)mr); > > > > > > > > > - mlx5_mr_rebuild_cache(&sh->share_cache); > > > > > > > > > + > > > > > > > > > + ++sh->share_cache.dev_gen; > > > > > > > > > + DEBUG("broadcasting local cache flush, gen=%d", > > > > > > > > > + sh->share_cache.dev_gen); > > > > > > > > > + > > > > > > > > > /* > > > > > > > > > * Flush local caches by propagating invalidation > > > > > > > > > across cores. > > > > > > > > > - * rte_smp_wmb() is enough to synchronize this event. If > > > > > one of > > > > > > > > > - * freed memsegs is seen by other core, that means the > > > > > memseg > > > > > > > > > - * has been allocated by allocator, which will come > > > > > > > > > after this > > > > > > > > > - * free call. Therefore, this store instruction > > > > > > > > > (incrementing > > > > > > > > > - * generation below) will be guaranteed to be seen by > > > > > > > > > other > > > > > core > > > > > > > > > - * before the core sees the newly allocated memory. > > > > > > > > > + * rte_smp_wmb() is to keep the order that dev_gen > > > > > updated > > > > > > > > > before > > > > > > > > > + * rebuilding global cache. Therefore, other core can > > > > > > > > > +flush > > > > > their > > > > > > > > > + * local cache on time. > > > > > > > > > */ > > > > > > > > > - ++sh->share_cache.dev_gen; > > > > > > > > > - DEBUG("broadcasting local cache flush, gen=%d", > > > > > > > > > - sh->share_cache.dev_gen); > > > > > > > > > rte_smp_wmb(); > > > > > > > > > + mlx5_mr_rebuild_cache(&sh->share_cache); > > > > > > > > > rte_rwlock_read_unlock(&sh->share_cache.rwlock); > > > > > > > > > return 0; > > > > > > > > > } > > > > > > > > > -- > > > > > > > > > 2.25.1