On 2022-10-03 15:33, Van Haaren, Harry wrote: >> -----Original Message----- >> From: Mattias Rönnblom <mattias.ronnb...@ericsson.com> >> Sent: Tuesday, September 6, 2022 5:14 PM >> To: Van; Haaren; Van Haaren, Harry <harry.van.haa...@intel.com> >> Cc: dev@dpdk.org; Honnappa Nagarahalli <honnappa.nagaraha...@arm.com>; >> Morten Brørup <m...@smartsharesystems.com>; nd <n...@arm.com>; >> mattias.ronnblom <mattias.ronnb...@ericsson.com> >> Subject: [PATCH 3/6] service: reduce average case service core overhead >> >> Optimize service loop so that the starting point is the lowest-indexed >> service mapped to the lcore in question, and terminate the loop at the >> highest-indexed service. >> >> While the worst case latency remains the same, this patch >> significantly reduces the service framework overhead for the average >> case. In particular, scenarios where an lcore only runs a single >> service, or multiple services which id values are close (e.g., three >> services with ids 17, 18 and 22), show significant improvements. >> >> The worse case is a where the lcore two services mapped to it; one >> with service id 0 and the other with id 63. > > I like the optimization - nice work. There is one caveat, that with the > builtin_ctz() call, RTE_SERVICE_NUM_MAX *must* be 64 or lower. > Today it is defined as 64, but we must ensure that this value cannot > be changed "by accident" without explicit compilation failures and a > comment explaining that fact. > > There are likely options around making it runtime-dynamic, but I don't > think the complexity is justified: suggest we use compile-time check > BUILD_BUG_ON() and error if its > 64? >
Sounds like a good idea. The limitations is not new though; the use of an uint64_t-based bitmask limits the services to 64 already. > Note in rte_service_component_register(), we *re-use* IDs when they > become available, so we can have up to 64 active services at a time, but > the can register/unregister more times than that. This is a very unlikely > usage of the services API to continually register-unregister services. > > With the BUILD_BUG_ON() around the 64 MAX value with a comment: > Acked-by: Harry van Haaren <harry.van.haa...@intel.com> > Thanks for your reviews Harry. > >> On a service lcore serving a single service, the service loop overhead >> is reduced from ~190 core clock cycles to ~46. (On an Intel Cascade >> Lake generation Xeon.) On weakly ordered CPUs, the gain is larger, >> since the loop included load-acquire atomic operations. >> >> Signed-off-by: Mattias Rönnblom <mattias.ronnb...@ericsson.com> >> --- >> lib/eal/common/rte_service.c | 14 ++++++++++---- >> 1 file changed, 10 insertions(+), 4 deletions(-) >> >> diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c >> index 87df04e3ac..4cac866792 100644 >> --- a/lib/eal/common/rte_service.c >> +++ b/lib/eal/common/rte_service.c >> @@ -464,7 +464,6 @@ static int32_t >> service_runner_func(void *arg) >> { >> RTE_SET_USED(arg); >> - uint32_t i; >> const int lcore = rte_lcore_id(); >> struct core_state *cs = &lcore_states[lcore]; >> >> @@ -478,10 +477,17 @@ service_runner_func(void *arg) >> RUNSTATE_RUNNING) { >> >> const uint64_t service_mask = cs->service_mask; >> + uint8_t start_id; >> + uint8_t end_id; >> + uint8_t i; >> >> - for (i = 0; i < RTE_SERVICE_NUM_MAX; i++) { >> - if (!service_registered(i)) >> - continue; >> + if (service_mask == 0) >> + continue; >> + >> + start_id = __builtin_ctzl(service_mask); >> + end_id = 64 - __builtin_clzl(service_mask); >> + >> + for (i = start_id; i < end_id; i++) { >> /* return value ignored as no change to code flow */ >> service_run(i, cs, service_mask, service_get(i), 1); >> } >