<snip> > > > > > Subject: [dpdk-dev] [RFC 0/6] New sync modes for ring > > > > > > Upfront note - that RFC is not a complete patch. > > > It introduces an ABI breakage, plus it doesn't update ring_elem code > > > properly, > > As per the current rules, these changes (in the current form) will be > > accepted only for 20.11 release. How do we address this for immediate > requirements like RCU defer APIs? > > I think I found a way to introduce these new modes without API/ABI breakage. > Working on v1 right now. Plan to submit it by end of that week/start of next > one. ok
> > > I suggest that we move forward with my RFC (taking into consideration your > feedback) to make progress on RCU APIs. > > > > > etc. > > > I plan to deal with all these things in later versions. > > > Right now I seek an initial feedback about proposed ideas. > > > Would also ask people to repeat performance tests (see below) on > > > their platforms to confirm the impact. > > > > > > More and more customers use(/try to use) DPDK based apps within > > > overcommitted systems (multiple acttive threads over same pysical cores): > > > VM, container deployments, etc. > > > One quite common problem they hit: Lock-Holder-Preemption with > rte_ring. > > > LHP is quite a common problem for spin-based sync primitives > > > (spin-locks, etc.) on overcommitted systems. > > > The situation gets much worse when some sort of fair-locking > > > technique is used (ticket-lock, etc.). > > > As now not only lock-owner but also lock-waiters scheduling order > > > matters a lot. > > > This is a well-known problem for kernel within VMs: > > > http://www-archive.xenproject.org/files/xensummitboston08/LHP.pdf > > > https://www.cs.hs-rm.de/~kaiser/events/wamos2017/Slides/selcuk.pdf > > > The problem with rte_ring is that while head accusion is sort of > > > un-fair locking, waiting on tail is very similar to ticket lock > > > schema - tail has to be updated in particular order. > > > That makes current rte_ring implementation to perform really pure on > > > some overcommited scenarios. > > > While it is probably not possible to completely resolve this problem > > > in userspace only (without some kernel communication/intervention), > > > removing fairness in tail update can mitigate it significantly. > > > So this RFC proposes two new optional ring synchronization modes: > > > 1) Head/Tail Sync (HTS) mode > > > In that mode enqueue/dequeue operation is fully serialized: > > > only one thread at a time is allowed to perform given op. > > > As another enhancement provide ability to split enqueue/dequeue > > > operation into two phases: > > > - enqueue/dequeue start > > > - enqueue/dequeue finish > > > That allows user to inspect objects in the ring without removing > > > them from it (aka MT safe peek). > > IMO, this will not address the problem described above. > > It does, please see the results produced by ring_stress_*autotest below. > Let say for test-case: 8thread@2core(--lcores='6,(10-13)@7,(20-23)@8' it Had not looked at these tests. Please see the numbers below. > shows: > avg number of cycles per object for enqueue /dequeue: > MP/MC: 280314.32 > HTS: 294.72 > RTS: 318.79 > > Customer who tried it reported similar level of improvement. Is this tested with the VM/Container setup described in the slides you referred to? > Actually if you have time - would be very interesting to see what numbers will > be on ARM boxes. > To reproduce, just: > $cat ring_tests_u4 > ring_stress_autotest > ring_stress_hts_autotest > ring_stress_rts_autotest > > /app/test/dpdk-test --lcores='6,(10-13)@7,(20-23)@8' -n 4 < ring_tests_u4 > 2>&1 | tee res1 > > Then look at the ' AGGREGATE' stats. > Right now it is a bit too verbose, so probably the easiest thing to extract > same > numbers quickly: > grep 'cycles/obj' res1 | grep 'cycles/obj' | cat -n | awk '{if ($(1)%9==0) > print > $(NF);}' > 280314.32 > 1057833.55 > 294.72 > 480.10 > 318.79 > 461.52 > > First 2 numbers will be for MP/MC, next 2 for HTS, last 2 for RTS. 12305.05 12027.09 3.59 7.37 4.41 7.98 > > > For ex: when a producer updates the head and gets scheduled out, other > > producers have to spin. > > Sure, as I wrote in original cover letter: > " While it is probably not possible to completely resolve this problem in > userspace only (without some kernel communication/intervention), removing > fairness in tail update can mitigate it significantly." > Results from the ring_stress_*_autotest confirm that. > > > The problem is probably worse as with non-HTS case moving of the head > > and copying of the ring elements can happen in parallel between the > producers (similarly for consumers). > > Yes as we serialize the ring, we remove possibility of simultaneous copy. > That's why for 'normal' cases (one thread per core) original MP/MC is usually > faster. > Though on overcommitted scenarios current MP/MC performance degrades > dramatically. > The main problem with current MP/MC implementation is in that tail update > have to be done in strict order (sort of fair locking scheme). > Which means that we have much-much worse LHP manifestation, then when > we use unfair schemes. > With serialized ring (HTS) we remove that ordering completely (same idea as > switch from fair to unfair locking for PV spin-locks). > > > IMO, HTS should not be a configurable flag. > > Why? > > > In RCU requirement, a MP enqueue and HTS dequeue are required. > > This is supported, user can specify different modes for consumer and > producer: > (0 | RING_F_MC_HTS_DEQ). > Then it is up to the user either to call generic > rte_ring_enqueue/rte_ring_dequeue, > or specify mode manually by function name: > rte_ring_mp_enqueue_bulk/ rte_ring_hts_dequeue_bulk. Ok, that should be good. > > > > > > 2) Relaxed Tail Sync (RTS) > > > The main difference from original MP/MC algorithm is that tail value > > > is increased not by every thread that finished enqueue/dequeue, but > > > only by the last one. > > > That allows threads to avoid spinning on ring tail value, leaving > > > actual tail value change to the last thread in the update queue. > > This can be a configurable flag on the ring. > > I am not sure how this solves the problem you have stated above > > completely. Updating the count from all intermediate threads is still > > required to update the value of the head. But yes, it reduces the severity > > of > the problem by not enforcing the order in which the tail is updated. > > As I said above, main source of slowdown here - that we have to update tail > in > particular order. > So the main objective (same as for HTS) is to remove that ordering. > > > I also think it introduces the problem on the other side of the ring > > because the tail is not updated soon enough (the other side has to wait > longer for the elements to become available). > > Yes, producer/consumer starvation. > That's why we need max allowed Head-Tail-Distance (htd_max) - to limit how > far head can go away from tail. > > > It also introduces another configuration parameter (HTD_MAX_DEF) which > > they have to deal with. > > If user doesn't provide any value, it will be set by default to ring.capacity > / 8. > From my measurements works quite well. > Though there possibility for the user to set another value, if needed. > > > Users have to still implement the current hypervisor related solutions. > > Didn't get what you trying to say with that phrase. The references you provided talked about resolving LHP by doing co-scheduling of vCPUs (which I think could be applied to DPDK applications). I am saying that we still need such mechanisms along with these solutions. > > > IMO, we should run the benchmark for this on an over committed setup to > understand the benefits. > > That's why I created ring_stress_*autotest test-cases and collected numbers > provided below. > I suppose they clearly show the problem on overcommitted scenarios, and > how RTS/HTS improve that situation. > Would appreciate if you repeat these tests on your machines. > > > > > > > > > Test results on IA (see below) show significant improvements for > > > average enqueue/dequeue op times on overcommitted systems. > > > For 'classic' DPDK deployments (one thread per core) original MP/MC > > > algorithm still shows best numbers, though for 64-bit target RTS > > > numbers are not that far away. > > > Numbers were produced by ring_stress_*autotest (first patch in these > series). > > > > > > X86_64 @ Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz > > > DEQ+ENQ average cycles/obj > > > > > > MP/MC HTS RTS > > > 1thread@1core(--lcores=6-7) 8.00 8.15 8.99 > > > 2thread@2core(--lcores=6-8) 19.14 19.61 20.35 > > > 4thread@4core(--lcores=6-10) 29.43 29.79 31.82 > > > 8thread@8core(--lcores=6-14) 110.59 192.81 119.50 > > > 16thread@16core(--lcores=6-22) 461.03 813.12 495.59 > > > 32thread/@32core(--lcores='6-22,55-70') 982.90 1972.38 1160.51 > > > > > > 2thread@1core(--lcores='6,(10-11)@7' 20140.50 23.58 25.14 > > > 4thread@2core(--lcores='6,(10-11)@7,(20-21)@8' 153680.60 76.88 > 80.05 > > > 8thread@2core(--lcores='6,(10-13)@7,(20-23)@8' 280314.32 294.72 > > > 318.79 16thread@2core(--lcores='6,(10-17)@7,(20-27)@8' 643176.59 > > > 1144.02 > > > 1175.14 32thread@2core(--lcores='6,(10-25)@7,(30-45)@8' 4264238.80 > > > 4627.48 4892.68 > > > > > > 8thread@2core(--lcores='6,(10-17)@(7,8))' 321085.98 298.59 307.47 > > > 16thread@4core(--lcores='6,(20-35)@(7-10))' 1900705.61 575.35 > 678.29 > > > 32thread@4core(--lcores='6,(20-51)@(7-10))' 5510445.85 2164.36 > 2714.12 > > > > > > i686 @ Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz > > > DEQ+ENQ average cycles/obj > > > > > > MP/MC HTS RTS > > > 1thread@1core(--lcores=6-7) 7.85 12.13 11.31 > > > 2thread@2core(--lcores=6-8) 17.89 24.52 21.86 > > > 8thread@8core(--lcores=6-14) 32.58 354.20 54.58 > > > 32thread/@32core(--lcores='6-22,55-70') 813.77 6072.41 2169.91 > > > > > > 2thread@1core(--lcores='6,(10-11)@7' 16095.00 36.06 34.74 > > > 8thread@2core(--lcores='6,(10-13)@7,(20-23)@8' 1140354.54 346.61 > > > 361.57 16thread@2core(--lcores='6,(10-17)@7,(20-27)@8' 1920417.86 > > > 1314.90 > > > 1416.65 > > > > > > 8thread@2core(--lcores='6,(10-17)@(7,8))' 594358.61 332.70 357.74 > > > 32thread@4core(--lcores='6,(20-51)@(7-10))' 5319896.86 2836.44 > 3028.87 > > > > > > Konstantin Ananyev (6): > > > test/ring: add contention stress test > > > ring: rework ring layout to allow new sync schemes > > > ring: introduce RTS ring mode > > > test/ring: add contention stress test for RTS ring > > > ring: introduce HTS ring mode > > > test/ring: add contention stress test for HTS ring > > > > > > app/test/Makefile | 3 + > > > app/test/meson.build | 3 + > > > app/test/test_pdump.c | 6 +- > > > app/test/test_ring_hts_stress.c | 28 ++ > > > app/test/test_ring_rts_stress.c | 28 ++ > > > app/test/test_ring_stress.c | 27 ++ > > > app/test/test_ring_stress.h | 477 +++++++++++++++++++ > > > lib/librte_pdump/rte_pdump.c | 2 +- > > > lib/librte_port/rte_port_ring.c | 12 +- > > > lib/librte_ring/Makefile | 4 +- > > > lib/librte_ring/meson.build | 4 +- > > > lib/librte_ring/rte_ring.c | 84 +++- > > > lib/librte_ring/rte_ring.h | 619 +++++++++++++++++++++++-- > > > lib/librte_ring/rte_ring_elem.h | 8 +- > > > lib/librte_ring/rte_ring_hts_generic.h | 228 +++++++++ > > > lib/librte_ring/rte_ring_rts_generic.h | 240 ++++++++++ > > > 16 files changed, 1721 insertions(+), 52 deletions(-) create mode > > > 100644 app/test/test_ring_hts_stress.c create mode 100644 > > > app/test/test_ring_rts_stress.c create mode 100644 > > > app/test/test_ring_stress.c create mode 100644 > > > app/test/test_ring_stress.h create mode 100644 > > > lib/librte_ring/rte_ring_hts_generic.h > > > create mode 100644 lib/librte_ring/rte_ring_rts_generic.h > > > > > > -- > > > 2.17.1