Hi lads, Looking at rte_ring move_head functions I noticed that all of them use slightly different approach to guarantee desired order of memory accesses:
1. rte_ring_generic_pvt.h: ===================== pseudo-c-code // related armv8 instructions -------------------- -------------------------------------- head.load() // ldr [head] rte_smp_rmb() // dmb ishld opposite_tail.load() // ldr [opposite_tail] ... rte_atomic32_cmpset(head, ...) // ldrex[head];... stlex[head] 2. rte_ring_c11_pvt.h ===================== pseudo-c-code // related armv8 instructions -------------------- -------------------------------------- head.atomic_load(relaxed) // ldr[head] atomic_thread_fence(acquire) // dmb ish opposite_tail.atomic_load(acquire) // lda[opposite_tail] ... head.atomic_cas(..., relaxed) // ldrex[haed]; ... strex[head] 3. rte_ring_hts_elem_pvt.h ========================== pseudo-c-code // related armv8 instructions -------------------- -------------------------------------- head.atomic_load(acquire) // lda [head] opposite_tail.load() // ldr [opposite_tail] ... head.atomic_cas(..., acquire) // ldaex[head]; ... strex[head] The questions that arose from these observations: a) are all 3 approaches equivalent in terms of functionality? b) if yes, is there any difference in terms of performance between: "ldr; dmb; ldr;" vs "lda; ldr;" ? c) Comapring at 1) and 2) above, combination of ldr [head]; dmb; lda [opposite_tail]: looks like an overkill to me. Wouldn't just: ldr [head]; dmb; ldr[opposite_tail]; be sufficient here? I.E.- for reading tail value - we don't need to use load(acquire). Or probably I do miss something obvious here? Thanks Konstantin For convenience, I created a godbot page with all these variants: https://godbolt.org/z/Yjj73b8xa #1 - __rte_ring_headtail_move_head() #2 - __rte_ring_headtail_move_head_c11_v1 #3 - __rte_ring_headtail_move_head_c11_v2 #2 with c) - __rte_ring_headtail_move_head_c11_v3