> > <snip>
> >
> > > > > > > > +/**
> > > > > > > > + * RTE thread Quiescent State structure.
> > > > > > > > + */
> > > > > > > > +struct rte_rcu_qsbr {
> > > > > > > > +       uint64_t reg_thread_id[RTE_QSBR_BIT_MAP_ELEMS]
> > > > > > > __rte_cache_aligned;
> > > > > > > > +       /**< Registered reader thread IDs - reader threads 
> > > > > > > > reporting
> > > > > > > > +        * on this QS variable represented in a bit map.
> > > > > > > > +        */
> > > > > > > > +
> > > > > > > > +       uint64_t token __rte_cache_aligned;
> > > > > > > > +       /**< Counter to allow for multiple simultaneous QS
> > > > > > > > +queries */
> > > > > > > > +
> > > > > > > > +       struct rte_rcu_qsbr_cnt w[RTE_RCU_MAX_THREADS]
> > > > > > > __rte_cache_aligned;
> > > > > > > > +       /**< QS counter for each reader thread, counts upto
> > > > > > > > +        * current value of token.
> > > > > > >
> > > > > > > As I understand you decided to stick with neutral thread_id
> > > > > > > and let user define what exactly thread_id is (lcore, syste,
> > > > > > > thread id, something
> > > > > else)?
> > > > > > Yes, that is correct. I will reply to the other thread to
> > > > > > continue the
> > > discussion.
> > > > > >
> > > > > > > If so, can you probably get rid of RTE_RCU_MAX_THREADS
> limitation?
> > > > > > I am not seeing this as a limitation. The user can change this
> > > > > > if required. May
> > > > > be I should change it as follows:
> > > > > > #ifndef RTE_RCU_MAX_THREADS
> > > > > > #define RTE_RCU_MAX_THREADS 128 #endif
> > > > >
> > > > > Yep, that's better, though it would still require user to
> > > > > rebuild the code if he would like to increase total number of threads
> supported.
> > > > Agree
> > > >
> > > > > Though it seems relatively simply to extend current code to
> > > > > support dynamic max thread num here (2 variable arrays plus
> > > > > shift value plus
> > > mask).
> > > > Agree, supporting dynamic 'max thread num' is simple. But this
> > > > means memory needs to be allocated to the arrays. The API
> > > > 'rte_rcu_qsbr_init' has to take max thread num as the parameter.
> > > > We also
> > > have to introduce another API to free this memory. This will become
> > > very similar to alloc/free APIs I had in the v1.
> > > > I hope I am following you well, please correct me if not.
> > >
> > > I think we can still leave alloc/free tasks to the user.
> > > We probabply just need extra function rte_rcu_qsbr_size(uint32_
> > > max_threads) to help user calculate required size.
> > > rte_rcu_qsbr_init() might take as an additional parameter 'size' to
> > > make checks.
> > The size is returned by an API provided by the library. Why does it
> > need to be validated again? If 'size' is required for rte_rcu_qsbr_init, it
> could calculate it again.
> 
> Just as extra-safety check.
> I don't have strong opinion here - if you think it is overkill, let's drop it.
> 
> 
> >
> > > Thought about something like that:
> > >
> > > size_t sz = rte_rcu_qsbr_size(max_threads); struct rte_rcu_qsbr
> > > *qsbr = alloc_aligned(CACHE_LINE, sz); rte_rcu_qsbr_init(qsbr,
> max_threads, sz); ...
> > >
> > Do you see any advantage for allowing the user to allocate the memory?
> So user can choose where to allocate the memory (eal malloc, normal malloc,
> stack, something else).
> Again user might decide to make rcu part of some complex data structure - in
> that case he probably would like to allocate one big chunk of memory at once
> and then provide part of it for rcu.
> Or some other usage scenario that I can't predict.
> 

I made this change and added performance tests similar to liburcu. With the 
dynamic memory allocation change the performance of rte_rcu_qsbr_update comes 
down by 42% - 45% and that of rte_rcu_qsbr_check also comes down by 133% on Arm 
platform. On x86 (E5-2660 v4 @ 2.00GHz), the results are mixed. 
rte_rcu_qsbr_update comes down by 15%, but that of rte_rcu_qsbr_check improves.

On the Arm platform, the issue seems to be due to address calculation that 
needs to happen at run time. If I fix the reg_thread_id array size, I am 
getting back/improving the performance both for Arm and x86. What this means 
is, we will still have a max thread limitation, but it will be high - 512 (1 
cache line). We could make this 1024 (2 cache lines). However, per thread 
counter data size will depend on the 'max thread' provided by the user. I think 
this solution serves your requirement (though with an acceptable constraint not 
affecting the near future), please let me know what you think.

These changes and the 3 variants of the implementation are present in RFC v3 
[1], in case you want to run these tests.
1/5, 2/5 - same as RFC v2 + 1 bug fixed
3/5 - Addition of rte_rcu_qsbr_get_memsize. Memory size for register thread 
bitmap array as well as per thread counter data is calculated based on 
max_threads parameter
4/5 - Test cases are modified to use the new API
5/5 - Size of register thread bitmap array is fixed to hold 512 thread IDs. 
However, the per thread counter data is calculated based on max_threads 
parameter.

If you do not want to run the tests, you can just look at 3/5 and 5/5.

[1] http://patchwork.dpdk.org/cover/50431/

> > This approach requires the user to call 3 APIs (including memory
> > allocation). These 3 can be abstracted in a rte_rcu_qsbr_alloc API, user has
> to call just 1 API.
> >
> > > Konstantin
> > >

<snip>

Reply via email to