Yes +1 Let the application handle it once there is an issue.
BR. Bing > -----Original Message----- > From: Thomas Monjalon <tho...@monjalon.net> > Sent: Wednesday, November 10, 2021 10:58 PM > To: Yigit, Ferruh <ferruh.yi...@intel.com>; Bing Zhao > <bi...@nvidia.com>; Ananyev, Konstantin > <konstantin.anan...@intel.com> > Cc: andrew.rybche...@oktetlabs.ru; dev@dpdk.org > Subject: Re: [PATCH 2/2] ethdev: fix the race condition for fp ops > reset > > External email: Use caution opening links or attachments > > > 10/11/2021 15:37, Ananyev, Konstantin: > > > > Hi Ferruh, > > > > > >> 22/10/2021 23:14, Bing Zhao: > > > >>> In the function "eth_dev_fp_ops_reset", a structure > assignment > > > >>> operation is used to reset one queue's callback functions, > etc., > > > >>> but it is not thread safe. > > > >>> > > > >>> The structure assignment is not atomic, a lot of > instructions > > > >>> will be generated. Right now, since not all the fields are > > > >>> needed, the fields in the "dummy_ops" which is not set > > > >>> explicitly will be 0s based on the specification and > compiler > > > >>> behavior. In order to make "fpo" has the same content with > > > >>> "dummy_ops", some clearing to 0 operation is needed. > > > >>> > > > >>> By checking the object instructions (e.g. with GCC 4.8.5) > > > >>> 0x0000000000a58317 <+35>: mov %rsi,%rdi > > > >>> 0x0000000000a5831a <+38>: mov %rdx,%rcx > > > >>> => 0x0000000000a5831d <+41>: rep stos %rax,%es:(%rdi) > > > >>> 0x0000000000a58320 <+44>: mov -0x38(%rsp),%rax > > > >>> 0x0000000000a58325 <+49>: lea -0xe0(%rip),%rdx > > > >>> // # 0xa5824c <dummy_eth_rx_burst> > > > >>> > > > >>> It shows that "rep stos" will clear the "fpo" structure > before > > > >>> assigning new values. > > > >>> > > > >>> In the other thread, if some data path Tx / Rx functions are > > > >>> still running, there is a risk to get 0 instead of the > correct > > > >>> dummy content. > > > >>> 1. qd = p->rxq.data[queue_id] > > > >>> 2. (void **)&p->rxq.clbk[queue_id] "data" and "clbk" may > be > > > >>> observed with NULL (0) in other threads. > > > >>> Even it is temporary, the accessing to a NULL pointer will > cause > > > >>> a crash. Using "memcpy" could get rid of this. > > > >>> > > > >>> Fixes: c87d435a4d79 ("ethdev: copy fast-path API into > separate > > > >>> structure") > > > >>> Cc: konstantin.anan...@intel.com > > > >>> > > > >>> Signed-off-by: Bing Zhao <bi...@nvidia.com> > > > >>> --- > > > >>> --- a/lib/ethdev/ethdev_private.c > > > >>> +++ b/lib/ethdev/ethdev_private.c > > > >>> @@ -206,7 +206,7 @@ eth_dev_fp_ops_reset(struct > rte_eth_fp_ops *fpo) > > > >>> .txq = {.data = dummy_data, .clbk = > dummy_data,}, > > > >>> }; > > > >>> > > > >>> - *fpo = dummy_ops; > > > >>> + rte_memcpy(fpo, &dummy_ops, sizeof(struct > > > >>> + rte_eth_fp_ops)); > > > >> > > > >> That's not trivial. > > > >> Please add a comment to briefly explain that memcpy avoids > zeroing of a simple assignment. > > > >> > > > > > > > > I think that patch is based on two totally wrong assumptions: > > > > 1) ethdev data-path and control-path API is MT-safe. > > > > With current design it is not. > > > > When calling rx/tx_burst it is caller responsibility to > make sure that given port is > > > > already properly configured and started. Also it is user > responsibility to guarantee > > > > that none other thread doing dev_stop for the same port > simultaneously. > > > > And visa-versa when calling dev_stop(), it is user > responsibility to ensure that > > > > none other thread doing rx/tx_burst for given port > simultaneously. > > > > If your app doesn't follow these principles, then it is a > bug that needs to be fixed. > > > > 2) rte_memcpy() provides some sort of atomicity and it is safe > to use it on its own > > > > in MT environment. That's totally wrong. > > > > In both cases compiler has total freedom to perform copy > in any order it likes > > > > (let say it can first read whole source data in some > temporary buffer (SIMD register), > > > > and then right it in one go, or it can do the same trick > with 'rep stos' as above). > > > > Moreover CPU itself can reorder instructions. > > > > So if you need this copy to be atomic you need to use > some sort of > > > > sync primitives along with it (mutex, rwlock, rcu, etc.). > > > > But as I said above right now ethdev API is not MT-safe, > so it is not required. > > > > > > > > To summarise - there is no point to mae these changes, and > patch > > > > comment is wrong and misleading. > > > > > > Can we mark this patch as rejected now? > > > > I believe so. > > > > > Patch seems trying to cover a wrong application usage, and it > should > > > be addressed in the application level. > > Yes >