On Mon, Apr 20, 2020 at 2:28 PM Konstantin Ananyev <konstantin.anan...@intel.com> wrote: > These days more and more customers use(/try to use) DPDK based apps within > overcommitted systems (multiple acttive threads over same pysical cores): > VM, container deployments, etc. > One quite common problem they hit: > Lock-Holder-Preemption/Lock-Waiter-Preemption with rte_ring. > LHP is quite a common problem for spin-based sync primitives > (spin-locks, etc.) on overcommitted systems. > The situation gets much worse when some sort of > fair-locking technique is used (ticket-lock, etc.). > As now not only lock-owner but also lock-waiters scheduling > order matters a lot (LWP). > These two problems are well-known for kernel within VMs: > http://www-archive.xenproject.org/files/xensummitboston08/LHP.pdf > https://www.cs.hs-rm.de/~kaiser/events/wamos2017/Slides/selcuk.pdf > The problem with rte_ring is that while head accusion is sort of > un-fair locking, waiting on tail is very similar to ticket lock schema - > tail has to be updated in particular order. > That makes current rte_ring implementation to perform > really pure on some overcommited scenarios. > It is probably not possible to completely resolve LHP problem in > userspace only (without some kernel communication/intervention). > But removing fairness at tail update helps to avoid LWP and > can mitigate the situation significantly. > This patch proposes two new optional ring synchronization modes: > 1) Head/Tail Sync (HTS) mode > In that mode enqueue/dequeue operation is fully serialized: > only one thread at a time is allowed to perform given op. > As another enhancement provide ability to split enqueue/dequeue > operation into two phases: > - enqueue/dequeue start > - enqueue/dequeue finish > That allows user to inspect objects in the ring without removing > them from it (aka MT safe peek). > 2) Relaxed Tail Sync (RTS) > The main difference from original MP/MC algorithm is that > tail value is increased not by every thread that finished enqueue/dequeue, > but only by the last one. > That allows threads to avoid spinning on ring tail value, > leaving actual tail value change to the last thread in the update queue. > > Note that these new sync modes are optional. > For current rte_ring users nothing should change > (both in terms of API/ABI and performance). > Existing sync modes MP/MC,SP/SC kept untouched, set up in the same > way (via flags and _init_), and MP/MC remains as default one. > The only thing that changed: > Format of prod/cons now could differ depending on mode selected at _init_. > So user has to stick with one sync model through whole ring lifetime. > In other words, user can't create a ring for let say SP mode and then > in the middle of data-path change his mind and start using MP_RTS mode. > For existing modes (SP/MP, SC/MC) format remains the same and > user can still use them interchangeably, though of course it is an > error prone practice. > > Test results on IA (see below) show significant improvements > for average enqueue/dequeue op times on overcommitted systems. > For 'classic' DPDK deployments (one thread per core) original MP/MC > algorithm still shows best numbers, though for 64-bit target > RTS numbers are not that far away. > Numbers were produced by new UT test-case: ring_stress_autotest, i.e.: > echo ring_stress_autotest | ./dpdk-test -n 4 --lcores='...' > > X86_64 @ Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz > DEQ+ENQ average cycles/obj > MP/MC HTS RTS > 1thread@1core(--lcores=6-7) 8.00 8.15 8.99 > 2thread@2core(--lcores=6-8) 19.14 19.61 20.35 > 4thread@4core(--lcores=6-10) 29.43 29.79 31.82 > 8thread@8core(--lcores=6-14) 110.59 192.81 119.50 > 16thread@16core(--lcores=6-22) 461.03 813.12 495.59 > 32thread/@32core(--lcores='6-22,55-70') 982.90 1972.38 1160.51 > > 2thread@1core(--lcores='6,(10-11)@7' 20140.50 23.58 25.14 > 4thread@2core(--lcores='6,(10-11)@7,(20-21)@8' 153680.60 76.88 80.05 > 8thread@2core(--lcores='6,(10-13)@7,(20-23)@8' 280314.32 294.72 318.79 > 16thread@2core(--lcores='6,(10-17)@7,(20-27)@8' 643176.59 1144.02 1175.14 > 32thread@2core(--lcores='6,(10-25)@7,(30-45)@8' 4264238.80 4627.48 4892.68 > > 8thread@2core(--lcores='6,(10-17)@(7,8))' 321085.98 298.59 307.47 > 16thread@4core(--lcores='6,(20-35)@(7-10))' 1900705.61 575.35 678.29 > 32thread@4core(--lcores='6,(20-51)@(7-10))' 5510445.85 2164.36 2714.12 > > i686 @ Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz > DEQ+ENQ average cycles/obj > MP/MC HTS RTS > 1thread@1core(--lcores=6-7) 7.85 12.13 11.31 > 2thread@2core(--lcores=6-8) 17.89 24.52 21.86 > 8thread@8core(--lcores=6-14) 32.58 354.20 54.58 > 32thread/@32core(--lcores='6-22,55-70') 813.77 6072.41 2169.91 > > 2thread@1core(--lcores='6,(10-11)@7' 16095.00 36.06 34.74 > 8thread@2core(--lcores='6,(10-13)@7,(20-23)@8' 1140354.54 346.61 361.57 > 16thread@2core(--lcores='6,(10-17)@7,(20-27)@8' 1920417.86 1314.90 1416.65 > > 8thread@2core(--lcores='6,(10-17)@(7,8))' 594358.61 332.70 357.74 > 32thread@4core(--lcores='6,(20-51)@(7-10))' 5319896.86 2836.44 3028.87
I fixed a couple of typos and split the doc updates. Series applied with the patch from Pavan. Thanks for the work Konstantin, Honnappa. -- David Marchand