On Fri, May 06, 2022 at 08:41:12AM -0700, Stephen Hemminger wrote:
> On Fri, 6 May 2022 00:24:34 -0700
> Tyler Retzlaff <roret...@linux.microsoft.com> wrote:
> 
> > On Thu, May 05, 2022 at 10:59:32PM +0000, Honnappa Nagarahalli wrote:
> > > Thanks Stephen. Do you see any performance difference with this change?  
> > 
> > as a matter of due diligence i think a comparison should be made just
> > to be confident nothing is regressing.
> > 
> > i support this change in principal since it is generally accepted best
> > practice to not force inlining since it can remove more valuable
> > optimizations that the compiler may make that the human can't see.
> > the optimizations may vary depending on compiler implementation.
> > 
> > force inlining should be used as a targeted measure rather than blanket
> > on every function and when in use probably needs to be periodically
> > reviewed and potentially removed as the code / compiler evolves.
> > 
> > also one other consideration is the impact of a particular compiler's
> > force inlining intrinsic/builtin is that it may permit inlining of
> > functions when not declared in a header. i.e. a function from one
> > library may be able to be inlined to another binary as a link time
> > optimization. although everything here is in a header so it's a bit
> > moot.
> > 
> > i'd like to see this change go in if possible.
> > 
> > thanks
> > 
> 
> 
> Some quick numbers from Gcc 10.3 and 2.7G AMD and ring_perf_autotest
> 
> Looks like always inline is faster on second run but just inline is
> slightly faster on first run. Maybe the icache gets loaded for second run,
> but on first pass the smaller code size helps.
>
Interesting, I haven't observed that effect.

The main trouble with the unit tests is that there are so many possible
numbers to compare. We probably need to focus on a few to make sense of it
all. Running on an "Intel(R) Xeon(R) Gold 6330N CPU @ 2.20GHz" with Ubuntu
20.04 (GCC 9.4), I scanned through the numbers looking for signicant
percentage differences. This one caught my eye due to the %age difference:

Basline value:
sudo ./build-baseline/app/test/dpdk-test -- ring_perf_autotest | grep "SP/SC: 
single:"
...
legacy APIs: SP/SC: single: 9.08
elem APIs: element size 16B: SP/SC: single: 11.13

With patch:
sudo ./build/app/test/dpdk-test -- ring_perf_autotest | grep "SP/SC: single:"
...
legacy APIs: SP/SC: single: 15.81
elem APIs: element size 16B: SP/SC: single: 21.14

So the SP/SC element enqueue cost went from 9-11 cycles to 15-21 cycles.
Percentage-wise, this seems a lot, though in absolute terms it may not be.
Therefore, I think we'll need a decent amount of varied testing before
taking this patch.

/Bruce

Reply via email to