On Fri, 20 Mar 2020 16:41:38 +0000
Konstantin Ananyev <konstantin.anan...@intel.com> wrote:

> As was discussed here:
> http://mails.dpdk.org/archives/dev/2020-February/158586.html
> this RFC aimed to hide ring internals into .c and make all
> ring functions non-inlined. In theory that might help to
> maintain ABI stability in future.
> This is just a POC to measure the impact of proposed idea,
> proper implementation would definetly need some extra effort.
> On IA box (SKX) ring_perf_autotest shows ~20-30 cycles extra for
> enqueue+dequeue pair. On some more realistic code, I suspect
> the impact it might be a bit higher.
> For MP/MC bulk transfers degradation seems quite small,
> though for SP/SC and/or small transfers it is more then noticable
> (see exact numbers below).
> From my perspective we'd probably keep it inlined for now
> to avoid any non-anticipated perfomance degradations.
> Though intersted to see perf results and opinions from
> other interested parties.
> 
> Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz
> ring_perf_autotest (without patch/with patch)
> 
> ### Testing single element enq/deq ###
> legacy APIs: SP/SC: single: 8.75/43.23
> legacy APIs: MP/MC: single: 56.18/80.44
> 
> ### Testing burst enq/deq ###
> legacy APIs: SP/SC: burst (size: 8): 37.36/53.37
> legacy APIs: SP/SC: burst (size: 32): 93.97/117.30
> legacy APIs: MP/MC: burst (size: 8): 78.23/91.45
> legacy APIs: MP/MC: burst (size: 32): 131.59/152.49
> 
> ### Testing bulk enq/deq ###
> legacy APIs: SP/SC: bulk (size: 8): 37.29/54.48
> legacy APIs: SP/SC: bulk (size: 32): 92.68/113.01
> legacy APIs: MP/MC: bulk (size: 8): 78.40/93.50
> legacy APIs: MP/MC: bulk (size: 32): 131.49/154.25
> 
> ### Testing empty bulk deq ###
> legacy APIs: SP/SC: bulk (size: 8): 4.00/16.86
> legacy APIs: MP/MC: bulk (size: 8): 7.01/15.55
> 
> ### Testing using two hyperthreads ###
> legacy APIs: SP/SC: bulk (size: 8): 10.64/17.56
> legacy APIs: MP/MC: bulk (size: 8): 15.30/16.69
> legacy APIs: SP/SC: bulk (size: 32): 5.84/7.09
> legacy APIs: MP/MC: bulk (size: 32): 6.34/7.54
> 
> ### Testing using two physical cores ###
> legacy APIs: SP/SC: bulk (size: 8): 24.34/42.40
> legacy APIs: MP/MC: bulk (size: 8): 70.34/71.82
> legacy APIs: SP/SC: bulk (size: 32): 12.67/14.68
> legacy APIs: MP/MC: bulk (size: 32): 22.41/17.93
> 
> ### Testing single element enq/deq ###
> elem APIs: element size 16B: SP/SC: single: 10.65/41.96
> elem APIs: element size 16B: MP/MC: single: 44.33/81.36
> 
> ### Testing burst enq/deq ###
> elem APIs: element size 16B: SP/SC: burst (size: 8): 39.20/58.52
> elem APIs: element size 16B: SP/SC: burst (size: 32): 123.19/142.79
> elem APIs: element size 16B: MP/MC: burst (size: 8): 80.72/101.36
> elem APIs: element size 16B: MP/MC: burst (size: 32): 169.21/185.38
> 
> ### Testing bulk enq/deq ###
> elem APIs: element size 16B: SP/SC: bulk (size: 8): 41.64/58.46
> elem APIs: element size 16B: SP/SC: bulk (size: 32): 122.74/142.52
> elem APIs: element size 16B: MP/MC: bulk (size: 8): 80.60/103.14
> elem APIs: element size 16B: MP/MC: bulk (size: 32): 169.39/186.67
> 
> ### Testing empty bulk deq ###
> elem APIs: element size 16B: SP/SC: bulk (size: 8): 5.01/17.17
> elem APIs: element size 16B: MP/MC: bulk (size: 8): 6.01/14.80
> 
> ### Testing using two hyperthreads ###
> elem APIs: element size 16B: SP/SC: bulk (size: 8): 12.02/17.18
> elem APIs: element size 16B: MP/MC: bulk (size: 8): 16.81/21.14
> elem APIs: element size 16B: SP/SC: bulk (size: 32): 7.87/9.01
> elem APIs: element size 16B: MP/MC: bulk (size: 32): 8.22/10.57
> 
> ### Testing using two physical cores ###
> elem APIs: element size 16B: SP/SC: bulk (size: 8): 27.00/51.94
> elem APIs: element size 16B: MP/MC: bulk (size: 8): 78.24/74.48
> elem APIs: element size 16B: SP/SC: bulk (size: 32): 15.41/16.14
> elem APIs: element size 16B: MP/MC: bulk (size: 32): 18.72/21.64
> 
> Signed-off-by: Konstantin Ananyev <konstantin.anan...@intel.com>

What is impact with LTO? I suspect compiler might have a chance to
get speed back with LTO.

Reply via email to