On Fri, 20 Mar 2020 16:41:38 +0000 Konstantin Ananyev <konstantin.anan...@intel.com> wrote:
> As was discussed here: > http://mails.dpdk.org/archives/dev/2020-February/158586.html > this RFC aimed to hide ring internals into .c and make all > ring functions non-inlined. In theory that might help to > maintain ABI stability in future. > This is just a POC to measure the impact of proposed idea, > proper implementation would definetly need some extra effort. > On IA box (SKX) ring_perf_autotest shows ~20-30 cycles extra for > enqueue+dequeue pair. On some more realistic code, I suspect > the impact it might be a bit higher. > For MP/MC bulk transfers degradation seems quite small, > though for SP/SC and/or small transfers it is more then noticable > (see exact numbers below). > From my perspective we'd probably keep it inlined for now > to avoid any non-anticipated perfomance degradations. > Though intersted to see perf results and opinions from > other interested parties. > > Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz > ring_perf_autotest (without patch/with patch) > > ### Testing single element enq/deq ### > legacy APIs: SP/SC: single: 8.75/43.23 > legacy APIs: MP/MC: single: 56.18/80.44 > > ### Testing burst enq/deq ### > legacy APIs: SP/SC: burst (size: 8): 37.36/53.37 > legacy APIs: SP/SC: burst (size: 32): 93.97/117.30 > legacy APIs: MP/MC: burst (size: 8): 78.23/91.45 > legacy APIs: MP/MC: burst (size: 32): 131.59/152.49 > > ### Testing bulk enq/deq ### > legacy APIs: SP/SC: bulk (size: 8): 37.29/54.48 > legacy APIs: SP/SC: bulk (size: 32): 92.68/113.01 > legacy APIs: MP/MC: bulk (size: 8): 78.40/93.50 > legacy APIs: MP/MC: bulk (size: 32): 131.49/154.25 > > ### Testing empty bulk deq ### > legacy APIs: SP/SC: bulk (size: 8): 4.00/16.86 > legacy APIs: MP/MC: bulk (size: 8): 7.01/15.55 > > ### Testing using two hyperthreads ### > legacy APIs: SP/SC: bulk (size: 8): 10.64/17.56 > legacy APIs: MP/MC: bulk (size: 8): 15.30/16.69 > legacy APIs: SP/SC: bulk (size: 32): 5.84/7.09 > legacy APIs: MP/MC: bulk (size: 32): 6.34/7.54 > > ### Testing using two physical cores ### > legacy APIs: SP/SC: bulk (size: 8): 24.34/42.40 > legacy APIs: MP/MC: bulk (size: 8): 70.34/71.82 > legacy APIs: SP/SC: bulk (size: 32): 12.67/14.68 > legacy APIs: MP/MC: bulk (size: 32): 22.41/17.93 > > ### Testing single element enq/deq ### > elem APIs: element size 16B: SP/SC: single: 10.65/41.96 > elem APIs: element size 16B: MP/MC: single: 44.33/81.36 > > ### Testing burst enq/deq ### > elem APIs: element size 16B: SP/SC: burst (size: 8): 39.20/58.52 > elem APIs: element size 16B: SP/SC: burst (size: 32): 123.19/142.79 > elem APIs: element size 16B: MP/MC: burst (size: 8): 80.72/101.36 > elem APIs: element size 16B: MP/MC: burst (size: 32): 169.21/185.38 > > ### Testing bulk enq/deq ### > elem APIs: element size 16B: SP/SC: bulk (size: 8): 41.64/58.46 > elem APIs: element size 16B: SP/SC: bulk (size: 32): 122.74/142.52 > elem APIs: element size 16B: MP/MC: bulk (size: 8): 80.60/103.14 > elem APIs: element size 16B: MP/MC: bulk (size: 32): 169.39/186.67 > > ### Testing empty bulk deq ### > elem APIs: element size 16B: SP/SC: bulk (size: 8): 5.01/17.17 > elem APIs: element size 16B: MP/MC: bulk (size: 8): 6.01/14.80 > > ### Testing using two hyperthreads ### > elem APIs: element size 16B: SP/SC: bulk (size: 8): 12.02/17.18 > elem APIs: element size 16B: MP/MC: bulk (size: 8): 16.81/21.14 > elem APIs: element size 16B: SP/SC: bulk (size: 32): 7.87/9.01 > elem APIs: element size 16B: MP/MC: bulk (size: 32): 8.22/10.57 > > ### Testing using two physical cores ### > elem APIs: element size 16B: SP/SC: bulk (size: 8): 27.00/51.94 > elem APIs: element size 16B: MP/MC: bulk (size: 8): 78.24/74.48 > elem APIs: element size 16B: SP/SC: bulk (size: 32): 15.41/16.14 > elem APIs: element size 16B: MP/MC: bulk (size: 32): 18.72/21.64 > > Signed-off-by: Konstantin Ananyev <konstantin.anan...@intel.com> What is impact with LTO? I suspect compiler might have a chance to get speed back with LTO.