> > <snip> > > > > > > > > > > +/** > > > > > > > > + * RTE thread Quiescent State structure. > > > > > > > > + */ > > > > > > > > +struct rte_rcu_qsbr { > > > > > > > > + uint64_t reg_thread_id[RTE_QSBR_BIT_MAP_ELEMS] > > > > > > > __rte_cache_aligned; > > > > > > > > + /**< Registered reader thread IDs - reader threads > > > > > > > > reporting > > > > > > > > + * on this QS variable represented in a bit map. > > > > > > > > + */ > > > > > > > > + > > > > > > > > + uint64_t token __rte_cache_aligned; > > > > > > > > + /**< Counter to allow for multiple simultaneous QS > > > > > > > > +queries */ > > > > > > > > + > > > > > > > > + struct rte_rcu_qsbr_cnt w[RTE_RCU_MAX_THREADS] > > > > > > > __rte_cache_aligned; > > > > > > > > + /**< QS counter for each reader thread, counts upto > > > > > > > > + * current value of token. > > > > > > > > > > > > > > As I understand you decided to stick with neutral thread_id > > > > > > > and let user define what exactly thread_id is (lcore, syste, > > > > > > > thread id, something > > > > > else)? > > > > > > Yes, that is correct. I will reply to the other thread to > > > > > > continue the > > > discussion. > > > > > > > > > > > > > If so, can you probably get rid of RTE_RCU_MAX_THREADS > limitation? > > > > > > I am not seeing this as a limitation. The user can change this > > > > > > if required. May > > > > > be I should change it as follows: > > > > > > #ifndef RTE_RCU_MAX_THREADS > > > > > > #define RTE_RCU_MAX_THREADS 128 #endif > > > > > > > > > > Yep, that's better, though it would still require user to > > > > > rebuild the code if he would like to increase total number of threads > supported. > > > > Agree > > > > > > > > > Though it seems relatively simply to extend current code to > > > > > support dynamic max thread num here (2 variable arrays plus > > > > > shift value plus > > > mask). > > > > Agree, supporting dynamic 'max thread num' is simple. But this > > > > means memory needs to be allocated to the arrays. The API > > > > 'rte_rcu_qsbr_init' has to take max thread num as the parameter. > > > > We also > > > have to introduce another API to free this memory. This will become > > > very similar to alloc/free APIs I had in the v1. > > > > I hope I am following you well, please correct me if not. > > > > > > I think we can still leave alloc/free tasks to the user. > > > We probabply just need extra function rte_rcu_qsbr_size(uint32_ > > > max_threads) to help user calculate required size. > > > rte_rcu_qsbr_init() might take as an additional parameter 'size' to > > > make checks. > > The size is returned by an API provided by the library. Why does it > > need to be validated again? If 'size' is required for rte_rcu_qsbr_init, it > could calculate it again. > > Just as extra-safety check. > I don't have strong opinion here - if you think it is overkill, let's drop it. > > > > > > > Thought about something like that: > > > > > > size_t sz = rte_rcu_qsbr_size(max_threads); struct rte_rcu_qsbr > > > *qsbr = alloc_aligned(CACHE_LINE, sz); rte_rcu_qsbr_init(qsbr, > max_threads, sz); ... > > > > > Do you see any advantage for allowing the user to allocate the memory? > So user can choose where to allocate the memory (eal malloc, normal malloc, > stack, something else). > Again user might decide to make rcu part of some complex data structure - in > that case he probably would like to allocate one big chunk of memory at once > and then provide part of it for rcu. > Or some other usage scenario that I can't predict. >
I made this change and added performance tests similar to liburcu. With the dynamic memory allocation change the performance of rte_rcu_qsbr_update comes down by 42% - 45% and that of rte_rcu_qsbr_check also comes down by 133% on Arm platform. On x86 (E5-2660 v4 @ 2.00GHz), the results are mixed. rte_rcu_qsbr_update comes down by 15%, but that of rte_rcu_qsbr_check improves. On the Arm platform, the issue seems to be due to address calculation that needs to happen at run time. If I fix the reg_thread_id array size, I am getting back/improving the performance both for Arm and x86. What this means is, we will still have a max thread limitation, but it will be high - 512 (1 cache line). We could make this 1024 (2 cache lines). However, per thread counter data size will depend on the 'max thread' provided by the user. I think this solution serves your requirement (though with an acceptable constraint not affecting the near future), please let me know what you think. These changes and the 3 variants of the implementation are present in RFC v3 [1], in case you want to run these tests. 1/5, 2/5 - same as RFC v2 + 1 bug fixed 3/5 - Addition of rte_rcu_qsbr_get_memsize. Memory size for register thread bitmap array as well as per thread counter data is calculated based on max_threads parameter 4/5 - Test cases are modified to use the new API 5/5 - Size of register thread bitmap array is fixed to hold 512 thread IDs. However, the per thread counter data is calculated based on max_threads parameter. If you do not want to run the tests, you can just look at 3/5 and 5/5. [1] http://patchwork.dpdk.org/cover/50431/ > > This approach requires the user to call 3 APIs (including memory > > allocation). These 3 can be abstracted in a rte_rcu_qsbr_alloc API, user has > to call just 1 API. > > > > > Konstantin > > > <snip>