On Fri, May 06, 2022 at 08:41:12AM -0700, Stephen Hemminger wrote: > On Fri, 6 May 2022 00:24:34 -0700 > Tyler Retzlaff <roret...@linux.microsoft.com> wrote: > > > On Thu, May 05, 2022 at 10:59:32PM +0000, Honnappa Nagarahalli wrote: > > > Thanks Stephen. Do you see any performance difference with this change? > > > > as a matter of due diligence i think a comparison should be made just > > to be confident nothing is regressing. > > > > i support this change in principal since it is generally accepted best > > practice to not force inlining since it can remove more valuable > > optimizations that the compiler may make that the human can't see. > > the optimizations may vary depending on compiler implementation. > > > > force inlining should be used as a targeted measure rather than blanket > > on every function and when in use probably needs to be periodically > > reviewed and potentially removed as the code / compiler evolves. > > > > also one other consideration is the impact of a particular compiler's > > force inlining intrinsic/builtin is that it may permit inlining of > > functions when not declared in a header. i.e. a function from one > > library may be able to be inlined to another binary as a link time > > optimization. although everything here is in a header so it's a bit > > moot. > > > > i'd like to see this change go in if possible. > > > > thanks > > > > > Some quick numbers from Gcc 10.3 and 2.7G AMD and ring_perf_autotest > > Looks like always inline is faster on second run but just inline is > slightly faster on first run. Maybe the icache gets loaded for second run, > but on first pass the smaller code size helps. > Interesting, I haven't observed that effect.
The main trouble with the unit tests is that there are so many possible numbers to compare. We probably need to focus on a few to make sense of it all. Running on an "Intel(R) Xeon(R) Gold 6330N CPU @ 2.20GHz" with Ubuntu 20.04 (GCC 9.4), I scanned through the numbers looking for signicant percentage differences. This one caught my eye due to the %age difference: Basline value: sudo ./build-baseline/app/test/dpdk-test -- ring_perf_autotest | grep "SP/SC: single:" ... legacy APIs: SP/SC: single: 9.08 elem APIs: element size 16B: SP/SC: single: 11.13 With patch: sudo ./build/app/test/dpdk-test -- ring_perf_autotest | grep "SP/SC: single:" ... legacy APIs: SP/SC: single: 15.81 elem APIs: element size 16B: SP/SC: single: 21.14 So the SP/SC element enqueue cost went from 9-11 cycles to 15-21 cycles. Percentage-wise, this seems a lot, though in absolute terms it may not be. Therefore, I think we'll need a decent amount of varied testing before taking this patch. /Bruce