From: Pavan Nikhilesh <pbhagavat...@marvell.com> Each workslot is always bound to a specific lcore there is no multi-core contention to cause cache trashing as a result it is safe to remove the WFE. Also, in dual workslot dequeue work will mostlikely be available on the pair workslot making WFE impractical.
Signed-off-by: Pavan Nikhilesh <pbhagavat...@marvell.com> --- Also, this in-turn reduces the branch misses Before: 0 arm_spe_0/ts_enable=1,pct_enable=1,pa_enable=1,branch_filter=1,jitter=1,min_latency=0/ 0 dummy:u 0 llc-miss 0 tlb-miss 853 branch-miss 0 remote-access 0 l1d-miss After: 0 arm_spe_0/ts_enable=1,pct_enable=1,pa_enable=1,branch_filter=1,jitter=1,min_latency=0/ 0 dummy:u 0 llc-miss 0 tlb-miss 250 branch-miss 0 remote-access 0 l1d-miss WFE Data: 0x4C40 - WFI_WFE_WAIT_CYCLES - Number of cycles waiting at a WFI or WFE instruction. - WFE Cycles before the patch for Dual workslot #perf stat -C 20 -e r4C40 sleep 1 Performance counter stats for 'CPU(s) 20': 264 r4C40 1.002494168 seconds time elapsed - WFE Cycles for single workslot #perf stat -C 20 -e r4C40 sleep 1 Performance counter stats for 'CPU(s) 20': 908,778,351 r4C40 1.002598253 seconds time elapsed drivers/event/octeontx2/otx2_worker_dual.h | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/drivers/event/octeontx2/otx2_worker_dual.h b/drivers/event/octeontx2/otx2_worker_dual.h index 5134e3d52..c88420eb4 100644 --- a/drivers/event/octeontx2/otx2_worker_dual.h +++ b/drivers/event/octeontx2/otx2_worker_dual.h @@ -29,11 +29,7 @@ otx2_ssogws_dual_get_work(struct otx2_ssogws_state *ws, rte_prefetch_non_temporal(lookup_mem); #ifdef RTE_ARCH_ARM64 asm volatile( - " ldr %[tag], [%[tag_loc]] \n" - " ldr %[wqp], [%[wqp_loc]] \n" - " tbz %[tag], 63, done%= \n" - " sevl \n" - "rty%=: wfe \n" + "rty%=: \n" " ldr %[tag], [%[tag_loc]] \n" " ldr %[wqp], [%[wqp_loc]] \n" " tbnz %[tag], 63, rty%= \n" -- 2.17.1