<snip> > > > > > > > > > + > > > > > > > > > +static __rte_always_inline void > > > > > > > > > +enqueue_elems_128(struct rte_ring *r, uint32_t > > > > > > > > > +prod_head, const void *obj_table, uint32_t n) { > > > > > > > > > +unsigned int i; const uint32_t size = > > > > > > > > > +r->size; uint32_t idx = prod_head & r->mask; > > > > > > > > > +r->__uint128_t > > > > > > > > > +*ring = (__uint128_t *)&r[1]; const __uint128_t *obj = > > > > > > > > > +(const __uint128_t *)obj_table; if (likely(idx + n < > > > > > > > > > +size)) { for (i = 0; i < (n & ~0x1); i += 2, idx += 2) > > > > > > > > > +{ ring[idx] = obj[i]; ring[idx + 1] = obj[i + 1]; > > > > > > > > > > > > > > > > > > > > > > > > AFAIK, that implies 16B aligned obj_table... > > > > > > > > Would it always be the case? > > > > > > > I am not sure from the compiler perspective. > > > > > > > At least on Arm architecture, unaligned access (address that > > > > > > > is accessed is not aligned to the size of the data element > > > > > > > being > > > > > > > accessed) will result in faults or require additional cycles. > > > > > > > So, aligning on > > > > > 16B should be fine. > > > > > > Further, I would be changing this to use 'rte_int128_t' as > > > > > > '__uint128_t' is > > > > > not defined on 32b systems. > > > > > > > > > > What I am trying to say: with this code we imply new requirement > > > > > for elems > > > > The only existing use case in DPDK for 16B is the event ring. The > > > > event ring > > > already does similar kind of copy (using 'struct rte_event'). > > > > So, there is no change in expectations for event ring. > > > > For future code, I think this expectation should be fine since it > > > > allows for > > > optimal code. > > > > > > > > > in the ring: when sizeof(elem)==16 it's alignment also has to be > > > > > at least > > > 16. > > > > > Which from my perspective is not ideal. > > > > Any reasoning? > > > > > > New implicit requirement and inconsistency. > > > Code like that: > > > > > > struct ring_elem {uint64_t a, b;}; > > > .... > > > struct ring_elem elem; > > > rte_ring_dequeue_elem(ring, &elem, sizeof(elem)); > > > > > > might cause a crash. > > The alignment here is 8B. Assuming that instructions generated will > > require 16B alignment, it will result in a crash, if configured to generate > exception. > > But, these instructions are not atomic instructions. At least on > > aarch64, unaligned access will not result in an exception for non-atomic > loads/stores. I believe it is the same behavior for x86 as well. > > On IA, there are 2 types of 16B load/store instructions: aligned and > unaligned. > Aligned are a bit faster, but will cause an exception if used on non 16B > aligned > address. > As you using uint128_t * compiler will assume that both src and dst are 16B > aligned and might generate code with aligned instructions. Ok, looking at few articles, I read that if the address is aligned, the unaligned instructions do not incur the penalty. Is this understanding correct?
I see 2 solutions here: 1) We can switch this copy to use uint32_t pointer. It would still allow the compiler to generate (unaligned) instructions for up to 256b load/store. The 2 multiplications (to normalize the index and the size of copy) can use shifts. This should make it safer. If one wants performance, they can align the obj table to 16B (the ring itself is already aligned on the cache line boundary). 2) Considering that performance is paramount, we could document that the obj table needs to be aligned on 16B boundary. This would affect event dev (if we go ahead with replacing the event ring implementation) significantly. Note that we have to do the same thing for 64b elements as well. > > > > > > While exactly the same code with: > > > > > > struct ring_elem {uint64_t a, b, c;}; OR struct ring_elem {uint64_t > > > a, b, c, d;}; > > > > > > will work ok. > > The alignment for these structures is still 8B. Are you saying this > > will work because these will be copied using pointer to uint32_t (whose > alignment is 4B)? > > Yes, as we doing uint32_t copies, compiler can't assume the data will be 16B > aligned and will use unaligned instructions. > > > > > > > > > > > > > > > Note that for elem sizes > 16 (24, 32), there is no such constraint. > > > > The rest of them need to be aligned on 4B boundary. However, this > > > > should > > > not affect the existing code. > > > > The code for 8B and 16B is kept as is to ensure the performance is > > > > not > > > affected for the existing code. > > <snip>