06/05/2022 17:39, Bruce Richardson пишет:
On Fri, May 06, 2022 at 09:33:41AM -0700, Stephen Hemminger wrote:
On Fri, 6 May 2022 16:28:41 +0100
Bruce Richardson <bruce.richard...@intel.com> wrote:

On Fri, May 06, 2022 at 03:12:32PM +0000, Honnappa Nagarahalli wrote:
<snip>

On Thu, May 05, 2022 at 10:59:32PM +0000, Honnappa Nagarahalli wrote:
Thanks Stephen. Do you see any performance difference with this change?

as a matter of due diligence i think a comparison should be made just to be
confident nothing is regressing.

i support this change in principal since it is generally accepted best practice 
to
not force inlining since it can remove more valuable optimizations that the
compiler may make that the human can't see.
the optimizations may vary depending on compiler implementation.

force inlining should be used as a targeted measure rather than blanket on
every function and when in use probably needs to be periodically reviewed and
potentially removed as the code / compiler evolves.

also one other consideration is the impact of a particular compiler's force
inlining intrinsic/builtin is that it may permit inlining of functions when not
declared in a header. i.e. a function from one library may be able to be inlined
to another binary as a link time optimization. although everything here is in a
header so it's a bit moot.

i'd like to see this change go in if possible.
Like Stephen mentions below, I am sure we will have a for and against 
discussion here.
As a DPDK community we have put performance front and center, I would prefer to 
go down that route first.

I ran some initial numbers with this patch, and the very quick summary of
what I've seen so far:

* Unit tests show no major differences, and while it depends on what
   specific number you are interested in, most seem within margin of error.
* Within unit tests, the one number I mostly look at when considering
   inlining is the "empty poll" cost, since I believe we should look to keep
   that as close to zero as possible. In the past I've seen that number jump
   from 3 cycles to 12 cycles due to missed inlining. In this case, it seem
   fine.
* Ran a quick test with the eventdev_pipeline example app using SW eventdev,
   as a test of an actual app which is fairly ring-heavy [used 8 workers
   with 1000 cycles per packet hop]. (Thanks to Harry vH for this suggestion
   of a workload)
   * GCC 8 build - no difference observed
   * GCC 11 build - approx 2% perf reduction observed

Just to note that apart from ring_perf_autotest, there also exist ring_stress_autotest which trying to do some stress-testing in hopefully
more realistic usage scenarios.
Might be worth to consider when benchmarking.


As I said, these are just some quick rough numbers, and I'll try and get
some more numbers on a couple of different platforms, see if the small
reduction seen is consistent or not. I may also test a few differnet
combinations/options in the eventdev test.  It would be good if others also
tested on a few platforms available to them.

/Bruce

I wonder if a mixed approach might help where some key bits were marked
as more important to inline? Or setting compiler flags in build infra?

Yep, could be a number of options.

Reply via email to