<snip>

+ David Christensen for Power architecture

> > >
> > > > It
> > > > would mean extra work for the users.
> > > >
> > > > > 2. A lot of code duplication with these 3 copies of
> > > > > ENQUEUE/DEQUEUE macros.
> > > > >
> > > > > Looking at ENQUEUE/DEQUEUE macros, I can see that main loop
> > > > > always does 32B copy per iteration.
> > > > Yes, I tried to keep it the same as the existing one (originally,
> > > > I guess the intention was to allow for 256b vector instructions to
> > > > be
> > > > generated)
> > > >
> > > > > So wonder can we make a generic function that would do 32B copy
> > > > > per iteration in a main loop, and copy tail  by 4B chunks?
> > > > > That would avoid copy duplication and will allow user to have
> > > > > any elem size (multiple of 4B) he wants.
> > > > > Something like that (note didn't test it, just a rough idea):
> > > > >
> > > > >  static inline void
> > > > > copy_elems(uint32_t du32[], const uint32_t su32[], uint32_t num,
> > > > > uint32_t
> > > > > esize) {
> > > > >         uint32_t i, sz;
> > > > >
> > > > >         sz = (num * esize) / sizeof(uint32_t);
> > > > If 'num' is a compile time constant, 'sz' will be a compile time 
> > > > constant.
> > > Otherwise, this will result in a multiplication operation.
> > >
> > > Not always.
> > > If esize is compile time constant, then for esize as power of 2
> > > (4,8,16,...), it would be just one shift.
> > > For other constant values it could be a 'mul' or in many cases just
> > > 2 shifts plus 'add' (if compiler is smart enough).
> > > I.E. let say for 24B elem is would be either num * 6 or (num << 2) +
> > > (num << 1).
> > With num * 15 it has to be (num << 3) + (num << 2) + (num << 1) + num
> > Not sure if the compiler will do this.
> 
> For 15, it can be just (num << 4) - num
> 
> >
> > > I suppose for non-power of 2 elems it might be ok to get such small perf 
> > > hit.
> > Agree, should be ok not to focus on right now.
> >
> > >
> > > >I have tried
> > > > to avoid the multiplication operation and try to use shift and
> > > >mask
> > > operations (just like how the rest of the ring code does).
> > > >
> > > > >
> > > > >         for (i = 0; i < (sz & ~7); i += 8)
> > > > >                 memcpy(du32 + i, su32 + i, 8 *
> > > > > sizeof(uint32_t));
> > > > I had used memcpy to start with (for the entire copy operation),
> > > > performance is not the same for 64b elements when compared with
> > > > the
> > > existing ring APIs (some cases more and some cases less).
> > >
> > > I remember that from one of your previous mails, that's why here I
> > > suggest to use in a loop memcpy() with fixed size.
> > > That way for each iteration complier will replace memcpy() with
> > > instructions to copy 32B in a way he thinks is optimal (same as for 
> > > original
> macro, I think).
> > I tried this. On x86 (Xeon(R) Gold 6132 CPU @ 2.60GHz), the results are as
> follows. The numbers in brackets are with the code on master.
> > gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
> >
> > RTE>>ring_perf_elem_autotest
> > ### Testing single element and burst enq/deq ### SP/SC single
> > enq/dequeue: 5 MP/MC single enq/dequeue: 40 (35) SP/SC burst
> > enq/dequeue (size: 8): 2 MP/MC burst enq/dequeue (size: 8): 6 SP/SC
> > burst enq/dequeue (size: 32): 1 (2) MP/MC burst enq/dequeue (size:
> > 32): 2
> >
> > ### Testing empty dequeue ###
> > SC empty dequeue: 2.11
> > MC empty dequeue: 1.41 (2.11)
> >
> > ### Testing using a single lcore ###
> > SP/SC bulk enq/dequeue (size: 8): 2.15 (2.86) MP/MC bulk enq/dequeue
> > (size: 8): 6.35 (6.91) SP/SC bulk enq/dequeue (size: 32): 1.35 (2.06)
> > MP/MC bulk enq/dequeue (size: 32): 2.38 (2.95)
> >
> > ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size:
> > 8): 73.81 (15.33) MP/MC bulk enq/dequeue (size: 8): 75.10 (71.27)
> > SP/SC bulk enq/dequeue (size: 32): 21.14 (9.58) MP/MC bulk enq/dequeue
> > (size: 32): 25.74 (20.91)
> >
> > ### Testing using two NUMA nodes ###
> > SP/SC bulk enq/dequeue (size: 8): 164.32 (50.66) MP/MC bulk
> > enq/dequeue (size: 8): 176.02 (173.43) SP/SC bulk enq/dequeue (size:
> > 32): 50.78 (23) MP/MC bulk enq/dequeue (size: 32): 63.17 (46.74)
> >
> > On one of the Arm platform
> > MP/MC bulk enq/dequeue (size: 32): 0.37 (0.33) (~12% hit, the rest are
> > ok)
> 
> So it shows better numbers for one core, but worse on 2, right?
> 
> 
> > On another Arm platform, all numbers are same or slightly better.
> >
> > I can post the patch with this change if you want to run some benchmarks on
> your platform.
> 
> Sure, please do.
> I'll try to run on my boxes.
Sent v5, please check. Other platform owners should run this as well.

> 
> > I have not used the same code you have suggested, instead I have used the
> same logic in a single macro with memcpy.
> >

Reply via email to