Hi, Feifei We should consider the locks in your scenario - it is crucial for the complete model description:
How agent_1 (in your terms) rebuilds global cache: 1a) lock() 1b) rebuild(global cache) 1c) update(dev_gen) 1d) wmb() 1e) unlock() How agent_2 checks: 2a) check(dev_gen) (assume positive - changed) 2b) clear(local_cache) 2c) miss(on empty local_cache) - > eventually it goes to mr_lookup_caches() 2d) lock() 2e) get(new MR) 2f) unlock() 2g) update(local cache with obtained new MR) Hence, even if 1c) becomes visible in 2a) before 1b) committed (say, due to out-of-order Arch) - the agent 2 would be blocked on 2d) and scenario depicted on your Fig2 would not happen (agent_2 will wait before step 3 till agent 1 unlocks after its step 5). With best regards, Slava > -----Original Message----- > From: Feifei Wang <feifei.wa...@arm.com> > Sent: Friday, May 7, 2021 9:36> To: Slava Ovsiienko <viachesl...@nvidia.com>; > Matan Azrad > <ma...@nvidia.com>; Shahaf Shuler <shah...@nvidia.com> > Cc: dev@dpdk.org; nd <n...@arm.com>; sta...@dpdk.org; Ruifeng Wang > <ruifeng.w...@arm.com>; nd <n...@arm.com> > Subject: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region > cache > > Hi, Slava > > Thanks very much for your reply. > > > -----邮件原件----- > > 发件人: Slava Ovsiienko <viachesl...@nvidia.com> > > 发送时间: 2021年5月6日 19:22 > > 收件人: Feifei Wang <feifei.wa...@arm.com>; Matan Azrad > > <ma...@nvidia.com>; Shahaf Shuler <shah...@nvidia.com> > > 抄送: dev@dpdk.org; nd <n...@arm.com>; sta...@dpdk.org; Ruifeng Wang > > <ruifeng.w...@arm.com>; nd <n...@arm.com> > > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region > > cache > > > > Hi, Feifei > > > > Sorry, I do not follow why we should get rid of the last (after > > dev_gen update) wmb. > > We've rebuilt the global cache, we should notify other agents it's > > happened and they should flush local caches. So, dev_gen change should > > be made visible to other agents to trigger this activity and the > > second wmb is here to ensure this. > > 1. For the first problem why we should get rid of the last wmb and move it > before dev_gen updated, I think our attention is how the wmb implements > the synchronization between multiple agents. > Fig1 > ---------------------------------------------------------------------------------------------- > ------- > Timeslot agent_1 agent_2 > 1 rebuild global cache > 2 wmb > 3 update dev_gen ----------------------- load changed > dev_gen > 4 > rebuild local cache > ---------------------------------------------------------------------------------------------- > ------- > > First, wmb is only for local thread to keep the order between local write- > write : > Based on the picture above, for agent_1, wmb keeps the order that > rebuilding global cache is always before updating dev_gen. > > Second, agent_1 communicates with agent_2 by the global variable > "dev_gen" : > If agent_1 updates dev_gen, agent_2 will load it and then it knows it should > rebuild local cache > > Finally, agent_2 rebuilds local cache according to whether agent_1 has rebuilt > global cache, and agent_2 knows this information by the variable "dev_gen". > Fig2 > ---------------------------------------------------------------------------------------------- > ------- > Timeslot agent_1 agent_2 > 1 update dev_gen > 2 load changed dev_gen > 3 rebuild local cache > 4 rebuild global cache > 5 wmb > ---------------------------------------------------------------------------------------------- > ------- > > However, in arm platform, if wmb is after dev_gen updated, "dev_gen" may > be updated before agent_1 rebuilding global cache, then agent_2 maybe > receive error message and rebuild its local cache in advance. > > To summarize, it is not important which time other agents can see the > changed global variable "dev_gen". > (Actually, wmb after "dev_gen" cannot ensure changed "dev_gen" is > committed to the global). > It is more important that if other agents see the changed "dev_gen", they > also can know global cache has been updated. > > > One more point, due to registering new/destroying existing MR involves > > FW (via kernel) calls, it takes so many CPU cycles that we could > > neglect wmb overhead at all. > > We just move the last wmb into the right place, and not delete it for > performance. > > > > > Also, regarding this: > > > > > > Another question suddenly occurred to me, in order to keep the > > > > order that rebuilding global cache before updating ”dev_gen“, the > > > > wmb should be before updating "dev_gen" rather than after it. > > > > Otherwise, in the out-of-order platforms, current order cannot be > kept. > > > > it is not clear why ordering is important - global cache update and > > dev_gen change happen under spinlock protection, so only the last wmb > > is meaningful. > > > > 2. The second function of wmb before "dev_gen" updated is for > performance according to our previous discussion. > According to Fig2, if there is no wmb between "global cache updated" and > "dev_gen updated", "dev_gen" may update before global cache updated. > > Then agent_2 may see the changed "dev_gen" and flush entire local cache in > advance. > > This entire flush can degrade the performance: > "the local cache is getting empty and can't provide translation for other > valid > (not being removed) MRs, and the translation has to look up in the global > cache, that is locked now for rebuilding, this causes the delays in data path > on > acquiring global cache lock." > > Furthermore, spinlock is just for global cache, not for dev_gen and local > cache. > > > To summarize, in my opinion: > > - if you see some issue with ordering of global cache update/dev_gen > > signalling, > > could you, please, elaborate? I'm not sure we should maintain an > > order (due to spinlock protection) > > - the last rte_smp_wmb() after dev_gen incrementing should be kept > > intact > > > > At last, for my view, there are two functions that moving wmb before > "dev_gen" > for the write-write order: > -------------------------------- > a) rebuild global cache; > b) rte_smp_wmb(); > c) updating dev_gen > -------------------------------- > 1. Achieve synchronization between multiple threads in the right way 2. > Prevent other agents from flushing local cache early to ensure performance > > Best Regards > Feifei > > > With best regards, > > Slava > > > > > -----Original Message----- > > > From: Feifei Wang <feifei.wa...@arm.com> > > > Sent: Thursday, May 6, 2021 5:52 > > > To: Slava Ovsiienko <viachesl...@nvidia.com>; Matan Azrad > > > <ma...@nvidia.com>; Shahaf Shuler <shah...@nvidia.com> > > > Cc: dev@dpdk.org; nd <n...@arm.com>; sta...@dpdk.org; Ruifeng Wang > > > <ruifeng.w...@arm.com>; nd <n...@arm.com> > > > Subject: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory > > > Region cache > > > > > > Hi, Slava > > > > > > Would you have more comments about this patch? > > > For my sight, only one wmb before "dev_gen" updating is enough to > > > synchronize. > > > > > > Thanks very much for your attention. > > > > > > > > > Best Regards > > > Feifei > > > > > > > -----邮件原件----- > > > > 发件人: Feifei Wang > > > > 发送时间: 2021年4月20日 16:42 > > > > 收件人: Slava Ovsiienko <viachesl...@nvidia.com>; Matan Azrad > > > > <ma...@nvidia.com>; Shahaf Shuler <shah...@nvidia.com> > > > > 抄送: dev@dpdk.org; nd <n...@arm.com>; sta...@dpdk.org; Ruifeng > > Wang > > > > <ruifeng.w...@arm.com>; nd <n...@arm.com> > > > > 主题: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory > Region > > > > cache > > > > > > > > Hi, Slava > > > > > > > > I think the second wmb can be removed. > > > > As I know, wmb is just a barrier to keep the order between write > > > > and > > write. > > > > and it cannot tell the CPU when it should commit the changes. > > > > > > > > It is usually used before guard variable to keep the order that > > > > updating guard variable after some changes, which you want to > > > > release, > > > have been done. > > > > > > > > For example, for the wmb after global cache update/before > > > > altering dev_gen, it can ensure the order that updating global > > > > cache before altering > > > > dev_gen: > > > > 1)If other agent load the changed "dev_gen", it can know the > > > > global cache has been updated. > > > > 2)If other agents load the unchanged, "dev_gen", it means the > > > > global cache has not been updated, and the local cache will not be > flushed. > > > > > > > > As a result, we use wmb and guard variable "dev_gen" to ensure > > > > the global cache updating is "visible". > > > > The "visible" means when updating guard variable "dev_gen" is > > > > known by other agents, they also can confirm global cache has been > > > > updated in the meanwhile. Thus, just one wmb before altering > > > > dev_gen can ensure > > > this. > > > > > > > > Best Regards > > > > Feifei > > > > > > > > > -----邮件原件----- > > > > > 发件人: Slava Ovsiienko <viachesl...@nvidia.com> > > > > > 发送时间: 2021年4月20日 15:54 > > > > > 收件人: Feifei Wang <feifei.wa...@arm.com>; Matan Azrad > > > > > <ma...@nvidia.com>; Shahaf Shuler <shah...@nvidia.com> > > > > > 抄送: dev@dpdk.org; nd <n...@arm.com>; sta...@dpdk.org; Ruifeng > > > Wang > > > > > <ruifeng.w...@arm.com>; nd <n...@arm.com>; nd <n...@arm.com>; > > nd > > > > > <n...@arm.com> > > > > > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory > > > > > Region cache > > > > > > > > > > Hi, Feifei > > > > > > > > > > In my opinion, there should be 2 barriers: > > > > > - after global cache update/before altering dev_gen, to ensure > > > > > the correct order > > > > > - after altering dev_gen to make this change visible for other > > > > > agents and to trigger local cache update > > > > > > > > > > With best regards, > > > > > Slava > > > > > > > > > > > -----Original Message----- > > > > > > From: Feifei Wang <feifei.wa...@arm.com> > > > > > > Sent: Tuesday, April 20, 2021 10:30 > > > > > > To: Slava Ovsiienko <viachesl...@nvidia.com>; Matan Azrad > > > > > > <ma...@nvidia.com>; Shahaf Shuler <shah...@nvidia.com> > > > > > > Cc: dev@dpdk.org; nd <n...@arm.com>; sta...@dpdk.org; Ruifeng > > > Wang > > > > > > <ruifeng.w...@arm.com>; nd <n...@arm.com>; nd > <n...@arm.com>; > > > nd > > > > > > <n...@arm.com> > > > > > > Subject: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for > > > > > > Memory Region cache > > > > > > > > > > > > Hi, Slava > > > > > > > > > > > > Another question suddenly occurred to me, in order to keep the > > > > > > order that rebuilding global cache before updating ”dev_gen“, > > > > > > the wmb should be before updating "dev_gen" rather than after it. > > > > > > Otherwise, in the out-of-order platforms, current order cannot > > > > > > be > > kept. > > > > > > > > > > > > Thus, we should change the code as: > > > > > > a) rebuild global cache; > > > > > > b) rte_smp_wmb(); > > > > > > c) updating dev_gen > > > > > > > > > > > > Best Regards > > > > > > Feifei > > > > > > > -----邮件原件----- > > > > > > > 发件人: Feifei Wang > > > > > > > 发送时间: 2021年4月20日 13:54 > > > > > > > 收件人: Slava Ovsiienko <viachesl...@nvidia.com>; Matan Azrad > > > > > > > <ma...@nvidia.com>; Shahaf Shuler <shah...@nvidia.com> > > > > > > > 抄送: dev@dpdk.org; nd <n...@arm.com>; sta...@dpdk.org; > > Ruifeng > > > > > Wang > > > > > > > <ruifeng.w...@arm.com>; nd <n...@arm.com>; nd > <n...@arm.com> > > > > > > > 主题: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory > > > > Region > > > > > > > cache > > > > > > > > > > > > > > Hi, Slava > > > > > > > > > > > > > > Thanks very much for your explanation. > > > > > > > > > > > > > > I can understand the app can wait all mbufs are returned to > > > > > > > the memory pool, and then it can free this mbufs, I agree with > this. > > > > > > > > > > > > > > As a result, I will remove the bug fix patch from this > > > > > > > series and just replace the smp barrier with C11 thread > > > > > > > fence. Thanks very much for your patient explanation again. > > > > > > > > > > > > > > Best Regards > > > > > > > Feifei > > > > > > > > > > > > > > > -----邮件原件----- > > > > > > > > 发件人: Slava Ovsiienko <viachesl...@nvidia.com> > > > > > > > > 发送时间: 2021年4月20日 2:51 > > > > > > > > 收件人: Feifei Wang <feifei.wa...@arm.com>; Matan Azrad > > > > > > > > <ma...@nvidia.com>; Shahaf Shuler <shah...@nvidia.com> > > > > > > > > 抄送: dev@dpdk.org; nd <n...@arm.com>; sta...@dpdk.org; > > > Ruifeng > > > > > > Wang > > > > > > > > <ruifeng.w...@arm.com>; nd <n...@arm.com> > > > > > > > > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for > > > > > > > > Memory Region cache > > > > > > > > > > > > > > > > Hi, Feifei > > > > > > > > > > > > > > > > Please, see below > > > > > > > > > > > > > > > > .... > > > > > > > > > > > > > > > > > > Hi, Feifei > > > > > > > > > > > > > > > > > > > > Sorry, I do not follow what this patch fixes. Do we > > > > > > > > > > have some issue/bug with MR cache in practice? > > > > > > > > > > > > > > > > > > This patch fixes the bug which is based on logical > > > > > > > > > deduction, and it doesn't actually happen. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Each Tx queue has its own dedicated "local" cache for > > > > > > > > > > MRs to convert buffer address in mbufs being > > > > > > > > > > transmitted to LKeys (HW-related entity > > > > > > > > > > handle) and the "global" cache for all MR registered > > > > > > > > > > on the > > > device. > > > > > > > > > > > > > > > > > > > > AFAIK, how conversion happens in datapath: > > > > > > > > > > - check the local queue cache flush request > > > > > > > > > > - lookup in local cache > > > > > > > > > > - if not found: > > > > > > > > > > - acquire lock for global cache read access > > > > > > > > > > - lookup in global cache > > > > > > > > > > - release lock for global cache > > > > > > > > > > > > > > > > > > > > How cache update on memory freeing/unregistering > happens: > > > > > > > > > > - acquire lock for global cache write access > > > > > > > > > > - [a] remove relevant MRs from the global cache > > > > > > > > > > - [b] set local caches flush request > > > > > > > > > > - free global cache lock > > > > > > > > > > > > > > > > > > > > If I understand correctly, your patch swaps [a] and > > > > > > > > > > [b], and local caches flush is requested earlier. What > > > > > > > > > > problem does it > > > solve? > > > > > > > > > > It is not supposed there are in datapath some mbufs > > > > > > > > > > referencing to the memory being freed. Application > > > > > > > > > > must ensure this and must not allocate new mbufs from > > > > > > > > > > this memory regions > > > > > being freed. > > > > > > > > > > Hence, the lookups for these MRs in caches should not > occur. > > > > > > > > > > > > > > > > > > For your first point that, application can take charge > > > > > > > > > of preventing MR freed memory being allocated to data path. > > > > > > > > > > > > > > > > > > Does it means that If there is an emergency of MR > > > > > > > > > fragment, such as hotplug, the application must inform > > > > > > > > > thedata path in advance, and this memory will not be > > > > > > > > > allocated, and then the control path will free this > > > > > > > > > memory? If application can do like this, I agree that > > > > > > > > > this bug > > > > > > > cannot happen. > > > > > > > > > > > > > > > > Actually, this is the only correct way for application to > > > > > > > > operate. > > > > > > > > Let's suppose we have some memory area that application > > > > > > > > wants to > > > > > free. > > > > > > > > ALL references to this area must be removed. If we have > > > > > > > > some mbufs allocated from this area, it means that we have > > > > > > > > memory pool created > > > > > > there. > > > > > > > > > > > > > > > > What application should do: > > > > > > > > - notify all its components/agents the memory area is > > > > > > > > going to be freed > > > > > > > > - all components/agents free the mbufs they might own > > > > > > > > - PMD might not support freeing for some mbufs (for > > > > > > > > example being sent and awaiting for completion), so app > > > > > > > > should just wait > > > > > > > > - wait till all mbufs are returned to the memory pool (by > > > > > > > > monitoring available obj == pool size) > > > > > > > > > > > > > > > > Otherwise - it is dangerous to free the memory. There are > > > > > > > > just some mbufs still allocated, it is regardless to buf > > > > > > > > address to MR translation. We just can't free the memory - > > > > > > > > the mapping will be destroyed and might cause the > > > > > > > > segmentation fault by SW or some HW issues on DMA access > > > > > > > > to unmapped memory. It is very generic safety approach - > > > > > > > > do not free the memory that is still in > > > use. > > > > > > > > Hence, at the moment of freeing and unregistering the MR, > > > > > > > > there MUST BE NO any > > > > > > > mbufs in flight referencing to the addresses being freed. > > > > > > > > No translation to MR being invalidated can happen. > > > > > > > > > > > > > > > > > > > > > > > > > > > For other side, the cache flush has negative effect - > > > > > > > > > > the local cache is getting empty and can't provide > > > > > > > > > > translation for other valid (not being removed) MRs, > > > > > > > > > > and the translation has to look up in the global > > > > > > > > > > cache, that is locked now for rebuilding, this causes > > > > > > > > > > the delays in datapatch > > > > > > > > > on acquiring global cache lock. > > > > > > > > > > So, I see some potential performance impact. > > > > > > > > > > > > > > > > > > If above assumption is true, we can go to your second point. > > > > > > > > > I think this is a problem of the tradeoff between cache > > > > > > > > > coherence and > > > > > > > > performance. > > > > > > > > > > > > > > > > > > I can understand your meaning that though global cache > > > > > > > > > has been changed, we should keep the valid MR in local > > > > > > > > > cache as long as possible to ensure the fast searching speed. > > > > > > > > > In the meanwhile, the local cache can be rebuilt later > > > > > > > > > to reduce its waiting time for acquiring the global cache > > > > > > > > > lock. > > > > > > > > > > > > > > > > > > However, this mechanism just ensures the performance > > > > > > > > > unchanged for the first few mbufs. > > > > > > > > > During the next mbufs lkey searching after 'dev_gen' > > > > > > > > > updated, it is still necessary to update the local cache. > > > > > > > > > And the performance can firstly reduce and then returns. > > > > > > > > > Thus, no matter whether there is this patch or not, the > > > > > > > > > performance will jitter in a certain period of > > > > > > time. > > > > > > > > > > > > > > > > Local cache should be updated to remove MRs no longer valid. > > > > > > > > But we just flush the entire cache. > > > > > > > > Let's suppose we have valid MR0, MR1, and not valid MRX in > > > > > > > > local > > > > cache. > > > > > > > > And there are traffic in the datapath for MR0 and MR1, and > > > > > > > > no traffic for MRX anymore. > > > > > > > > > > > > > > > > 1) If we do as you propose: > > > > > > > > a) take a lock > > > > > > > > b) request flush local cache first - all MR0, MR1, MRX > > > > > > > > will be removed on translation in datapath > > > > > > > > c) update global cache, > > > > > > > > d) free lock > > > > > > > > All the traffic for valid MR0, MR1 ALWAYS will be blocked > > > > > > > > on lock taken for cache update since point b) till point d). > > > > > > > > > > > > > > > > 2) If we do as it is implemented now: > > > > > > > > a) take a lock > > > > > > > > b) update global cache > > > > > > > > c) request flush local cache > > > > > > > > d) free lock > > > > > > > > The traffic MIGHT be locked ONLY for MRs non-existing in > > > > > > > > local cache (not happens for MR0 and MR1, must not happen > > > > > > > > for MRX), and probability should be minor. And lock might > > > > > > > > happen since > > > > > > > > c) till > > > > > > > > d) > > > > > > > > - quite short period of time > > > > > > > > > > > > > > > > Summary, the difference between 1) and 2) > > > > > > > > > > > > > > > > Lock probability: > > > > > > > > - 1) lock ALWAYS happen for ANY MR translation after b), > > > > > > > > 2) lock MIGHT happen, for cache miss ONLY, after c) > > > > > > > > > > > > > > > > Lock duration: > > > > > > > > - 1) lock since b) till d), > > > > > > > > 2) lock since c) till d), that seems to be much shorter. > > > > > > > > > > > > > > > > > > > > > > > > > > Finally, in conclusion, I tend to think that the bottom > > > > > > > > > layer can do more things to ensure the correct execution > > > > > > > > > of the program, which may have a negative impact on the > > > > > > > > > performance in a short time, but in the long run, the > > > > > > > > > performance will eventually > > > > > come back. > > > > > > > > > Furthermore, maybe we should pay attention to the > > > > > > > > > performance in the stable period, and try our best to > > > > > > > > > ensure the correctness of the program in case of > > > > > > > > emergencies. > > > > > > > > > > > > > > > > If we have some mbufs still allocated in memory being > > > > > > > > freed > > > > > > > > - there is nothing to say about correctness, it is totally > > > > > > > > incorrect. In my opinion, we should not think how to > > > > > > > > mitigate this incorrect behavior, we should not encourage > > > > > > > > application developers to follow the wrong > > > > > > > approaches. > > > > > > > > > > > > > > > > With best regards, > > > > > > > > Slava > > > > > > > > > > > > > > > > > > > > > > > > > > Best Regards > > > > > > > > > Feifei > > > > > > > > > > > > > > > > > > > With best regards, > > > > > > > > > > Slava