<snip> > > > > > > > > > > > + > > > > > > > > > > > +static __rte_always_inline void > > > > > > > > > > > +enqueue_elems_128(struct rte_ring *r, uint32_t > > > > > > > > > > > +prod_head, const void *obj_table, uint32_t n) { > > > > > > > > > > > +unsigned int i; const uint32_t size = > > > > > > > > > > > +r->size; uint32_t idx = prod_head & r->mask; > > > > > > > > > > > +r->__uint128_t > > > > > > > > > > > +*ring = (__uint128_t *)&r[1]; const __uint128_t > > > > > > > > > > > +*obj = (const __uint128_t *)obj_table; if > > > > > > > > > > > +(likely(idx + n < > > > > > > > > > > > +size)) { for (i = 0; i < (n & ~0x1); i += 2, idx += > > > > > > > > > > > +2) { ring[idx] = obj[i]; ring[idx + 1] = obj[i + > > > > > > > > > > > +1]; > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > AFAIK, that implies 16B aligned obj_table... > > > > > > > > > > Would it always be the case? > > > > > > > > > I am not sure from the compiler perspective. > > > > > > > > > At least on Arm architecture, unaligned access (address > > > > > > > > > that is accessed is not aligned to the size of the data > > > > > > > > > element being > > > > > > > > > accessed) will result in faults or require additional cycles. > > > > > > > > > So, aligning on > > > > > > > 16B should be fine. > > > > > > > > Further, I would be changing this to use 'rte_int128_t' as > > > > > > > > '__uint128_t' is > > > > > > > not defined on 32b systems. > > > > > > > > > > > > > > What I am trying to say: with this code we imply new > > > > > > > requirement for elems > > > > > > The only existing use case in DPDK for 16B is the event ring. > > > > > > The event ring > > > > > already does similar kind of copy (using 'struct rte_event'). > > > > > > So, there is no change in expectations for event ring. > > > > > > For future code, I think this expectation should be fine since > > > > > > it allows for > > > > > optimal code. > > > > > > > > > > > > > in the ring: when sizeof(elem)==16 it's alignment also has > > > > > > > to be at least > > > > > 16. > > > > > > > Which from my perspective is not ideal. > > > > > > Any reasoning? > > > > > > > > > > New implicit requirement and inconsistency. > > > > > Code like that: > > > > > > > > > > struct ring_elem {uint64_t a, b;}; .... > > > > > struct ring_elem elem; > > > > > rte_ring_dequeue_elem(ring, &elem, sizeof(elem)); > > > > > > > > > > might cause a crash. > > > > The alignment here is 8B. Assuming that instructions generated > > > > will require 16B alignment, it will result in a crash, if > > > > configured to generate > > > exception. > > > > But, these instructions are not atomic instructions. At least on > > > > aarch64, unaligned access will not result in an exception for > > > > non-atomic > > > loads/stores. I believe it is the same behavior for x86 as well. > > > > > > On IA, there are 2 types of 16B load/store instructions: aligned and > unaligned. > > > Aligned are a bit faster, but will cause an exception if used on non > > > 16B aligned address. > > > As you using uint128_t * compiler will assume that both src and dst > > > are 16B aligned and might generate code with aligned instructions. > > Ok, looking at few articles, I read that if the address is aligned, > > the unaligned instructions do not incur the penalty. Is this understanding > correct? > > Yes, from my experience the difference is negligible. > > > > > I see 2 solutions here: > > 1) We can switch this copy to use uint32_t pointer. It would still > > allow the compiler to generate (unaligned) instructions for up to 256b > > load/store. The 2 multiplications (to normalize the index and the size of > > copy) > can use shifts. This should make it safer. If one wants performance, they can > align the obj table to 16B (the ring itself is already aligned on the cache > line > boundary). > > Sounds good to me. > > > > > 2) Considering that performance is paramount, we could document that > > the obj table needs to be aligned on 16B boundary. This would affect event > dev (if we go ahead with replacing the event ring implementation) > significantly. > > I don't think perf difference would be that significant to justify such > constraint. > I am in favor of #1. Ok, will go with this. Is it ok if I squash the intermediate commits for test cases? I can keep one commit for functional tests and another for performance tests.
> > > Note that we have to do the same thing for 64b elements as well. > > I don't mind to have one unified copy procedure, which would always use 32bit > elems, but AFAIK, on IA there is no such limitation for 64bit load/stores. > > > > > > > > > > > > > > > > While exactly the same code with: > > > > > > > > > > struct ring_elem {uint64_t a, b, c;}; OR struct ring_elem > > > > > {uint64_t a, b, c, d;}; > > > > > > > > > > will work ok. > > > > The alignment for these structures is still 8B. Are you saying > > > > this will work because these will be copied using pointer to > > > > uint32_t (whose > > > alignment is 4B)? > > > > > > Yes, as we doing uint32_t copies, compiler can't assume the data > > > will be 16B aligned and will use unaligned instructions. > > > > > > > > > > > > > > > > > > > > > > > > > Note that for elem sizes > 16 (24, 32), there is no such > > > > > > > constraint. > > > > > > The rest of them need to be aligned on 4B boundary. However, > > > > > > this should > > > > > not affect the existing code. > > > > > > The code for 8B and 16B is kept as is to ensure the > > > > > > performance is not > > > > > affected for the existing code. > > > > <snip>