<snip> + David Christensen for Power architecture
> > > > > > > It > > > > would mean extra work for the users. > > > > > > > > > 2. A lot of code duplication with these 3 copies of > > > > > ENQUEUE/DEQUEUE macros. > > > > > > > > > > Looking at ENQUEUE/DEQUEUE macros, I can see that main loop > > > > > always does 32B copy per iteration. > > > > Yes, I tried to keep it the same as the existing one (originally, > > > > I guess the intention was to allow for 256b vector instructions to > > > > be > > > > generated) > > > > > > > > > So wonder can we make a generic function that would do 32B copy > > > > > per iteration in a main loop, and copy tail by 4B chunks? > > > > > That would avoid copy duplication and will allow user to have > > > > > any elem size (multiple of 4B) he wants. > > > > > Something like that (note didn't test it, just a rough idea): > > > > > > > > > > static inline void > > > > > copy_elems(uint32_t du32[], const uint32_t su32[], uint32_t num, > > > > > uint32_t > > > > > esize) { > > > > > uint32_t i, sz; > > > > > > > > > > sz = (num * esize) / sizeof(uint32_t); > > > > If 'num' is a compile time constant, 'sz' will be a compile time > > > > constant. > > > Otherwise, this will result in a multiplication operation. > > > > > > Not always. > > > If esize is compile time constant, then for esize as power of 2 > > > (4,8,16,...), it would be just one shift. > > > For other constant values it could be a 'mul' or in many cases just > > > 2 shifts plus 'add' (if compiler is smart enough). > > > I.E. let say for 24B elem is would be either num * 6 or (num << 2) + > > > (num << 1). > > With num * 15 it has to be (num << 3) + (num << 2) + (num << 1) + num > > Not sure if the compiler will do this. > > For 15, it can be just (num << 4) - num > > > > > > I suppose for non-power of 2 elems it might be ok to get such small perf > > > hit. > > Agree, should be ok not to focus on right now. > > > > > > > > >I have tried > > > > to avoid the multiplication operation and try to use shift and > > > >mask > > > operations (just like how the rest of the ring code does). > > > > > > > > > > > > > > for (i = 0; i < (sz & ~7); i += 8) > > > > > memcpy(du32 + i, su32 + i, 8 * > > > > > sizeof(uint32_t)); > > > > I had used memcpy to start with (for the entire copy operation), > > > > performance is not the same for 64b elements when compared with > > > > the > > > existing ring APIs (some cases more and some cases less). > > > > > > I remember that from one of your previous mails, that's why here I > > > suggest to use in a loop memcpy() with fixed size. > > > That way for each iteration complier will replace memcpy() with > > > instructions to copy 32B in a way he thinks is optimal (same as for > > > original > macro, I think). > > I tried this. On x86 (Xeon(R) Gold 6132 CPU @ 2.60GHz), the results are as > follows. The numbers in brackets are with the code on master. > > gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0 > > > > RTE>>ring_perf_elem_autotest > > ### Testing single element and burst enq/deq ### SP/SC single > > enq/dequeue: 5 MP/MC single enq/dequeue: 40 (35) SP/SC burst > > enq/dequeue (size: 8): 2 MP/MC burst enq/dequeue (size: 8): 6 SP/SC > > burst enq/dequeue (size: 32): 1 (2) MP/MC burst enq/dequeue (size: > > 32): 2 > > > > ### Testing empty dequeue ### > > SC empty dequeue: 2.11 > > MC empty dequeue: 1.41 (2.11) > > > > ### Testing using a single lcore ### > > SP/SC bulk enq/dequeue (size: 8): 2.15 (2.86) MP/MC bulk enq/dequeue > > (size: 8): 6.35 (6.91) SP/SC bulk enq/dequeue (size: 32): 1.35 (2.06) > > MP/MC bulk enq/dequeue (size: 32): 2.38 (2.95) > > > > ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size: > > 8): 73.81 (15.33) MP/MC bulk enq/dequeue (size: 8): 75.10 (71.27) > > SP/SC bulk enq/dequeue (size: 32): 21.14 (9.58) MP/MC bulk enq/dequeue > > (size: 32): 25.74 (20.91) > > > > ### Testing using two NUMA nodes ### > > SP/SC bulk enq/dequeue (size: 8): 164.32 (50.66) MP/MC bulk > > enq/dequeue (size: 8): 176.02 (173.43) SP/SC bulk enq/dequeue (size: > > 32): 50.78 (23) MP/MC bulk enq/dequeue (size: 32): 63.17 (46.74) > > > > On one of the Arm platform > > MP/MC bulk enq/dequeue (size: 32): 0.37 (0.33) (~12% hit, the rest are > > ok) > > So it shows better numbers for one core, but worse on 2, right? > > > > On another Arm platform, all numbers are same or slightly better. > > > > I can post the patch with this change if you want to run some benchmarks on > your platform. > > Sure, please do. > I'll try to run on my boxes. Sent v5, please check. Other platform owners should run this as well. > > > I have not used the same code you have suggested, instead I have used the > same logic in a single macro with memcpy. > >