Re: [dpdk-dev] [RFC 0/6] New sync modes for ring

Honnappa Nagarahalli Mon, 30 Mar 2020 14:30:11 -0700

<snip>

> >
> > > Subject: [dpdk-dev] [RFC 0/6] New sync modes for ring
> > >
> > > Upfront note - that RFC is not a complete patch.
> > > It introduces an ABI breakage, plus it doesn't update ring_elem code
> > > properly,
> > As per the current rules, these changes (in the current form) will be
> > accepted only for 20.11 release. How do we address this for immediate
> requirements like RCU defer APIs?
> 
> I think I found a way to introduce these new modes without API/ABI breakage.
> Working on v1 right now. Plan to submit it by end of that week/start of next
> one.
ok


> 
> > I suggest that we move forward with my RFC (taking into consideration your
> feedback) to make progress on RCU APIs.
> >
> > > etc.
> > > I plan to deal with all these things in later versions.
> > > Right now I seek an initial feedback about proposed ideas.
> > > Would also ask people to repeat performance tests (see below) on
> > > their platforms to confirm the impact.
> > >
> > > More and more customers use(/try to use) DPDK based apps within
> > > overcommitted systems (multiple acttive threads over same pysical cores):
> > > VM, container deployments, etc.
> > > One quite common problem they hit: Lock-Holder-Preemption with
> rte_ring.
> > > LHP is quite a common problem for spin-based sync primitives
> > > (spin-locks, etc.) on overcommitted systems.
> > > The situation gets much worse when some sort of fair-locking
> > > technique is used (ticket-lock, etc.).
> > > As now not only lock-owner but also lock-waiters scheduling order
> > > matters a lot.
> > > This is a well-known problem for kernel within VMs:
> > > http://www-archive.xenproject.org/files/xensummitboston08/LHP.pdf
> > > https://www.cs.hs-rm.de/~kaiser/events/wamos2017/Slides/selcuk.pdf
> > > The problem with rte_ring is that while head accusion is sort of
> > > un-fair locking, waiting on tail is very similar to ticket lock
> > > schema - tail has to be updated in particular order.
> > > That makes current rte_ring implementation to perform really pure on
> > > some overcommited scenarios.
> > > While it is probably not possible to completely resolve this problem
> > > in userspace only (without some kernel communication/intervention),
> > > removing fairness in tail update can mitigate it significantly.
> > > So this RFC proposes two new optional ring synchronization modes:
> > > 1) Head/Tail Sync (HTS) mode
> > > In that mode enqueue/dequeue operation is fully serialized:
> > >     only one thread at a time is allowed to perform given op.
> > >     As another enhancement provide ability to split enqueue/dequeue
> > >     operation into two phases:
> > >       - enqueue/dequeue start
> > >       - enqueue/dequeue finish
> > >     That allows user to inspect objects in the ring without removing
> > >     them from it (aka MT safe peek).
> > IMO, this will not address the problem described above.
> 
> It does, please see the results produced by ring_stress_*autotest below.
> Let say for test-case: 8thread@2core(--lcores='6,(10-13)@7,(20-23)@8' it
Had not looked at these tests. Please see the numbers below.

> shows:
> avg number of cycles per object for enqueue /dequeue:
> MP/MC: 280314.32
> HTS:         294.72
> RTS:         318.79
> 
> Customer who tried it reported similar level of improvement.
Is this tested with the VM/Container setup described in the slides you referred 
to?

> Actually if you have time - would be very interesting to see what numbers will
> be on ARM boxes.
> To reproduce, just:
> $cat ring_tests_u4
> ring_stress_autotest
> ring_stress_hts_autotest
> ring_stress_rts_autotest
> 
> /app/test/dpdk-test --lcores='6,(10-13)@7,(20-23)@8'  -n 4 < ring_tests_u4
> 2>&1 | tee res1
> 
> Then look at the ' AGGREGATE' stats.
> Right now it is a bit too verbose, so probably the easiest thing to extract 
> same
> numbers quickly:
> grep 'cycles/obj'  res1 | grep 'cycles/obj' | cat -n | awk '{if ($(1)%9==0) 
> print
> $(NF);}'
> 280314.32
> 1057833.55
> 294.72
> 480.10
> 318.79
> 461.52
> 
> First 2 numbers will be for MP/MC, next 2 for HTS, last 2 for RTS.
12305.05
12027.09
3.59
7.37
4.41
7.98

> 
> > For ex: when a producer updates the head and gets scheduled out, other
> > producers have to spin.
> 
> Sure, as I wrote in original cover letter:
> " While it is probably not possible to completely resolve this problem in
> userspace only (without some kernel communication/intervention), removing
> fairness in tail update can mitigate it significantly."
> Results from the ring_stress_*_autotest confirm that.
> 
> > The problem is probably worse as with non-HTS case moving of the head
> > and copying of the ring elements can happen in parallel between the
> producers (similarly for consumers).
> 
> Yes as we serialize the ring, we remove possibility of simultaneous copy.
> That's why for 'normal' cases (one thread per core) original MP/MC is usually
> faster.
> Though on overcommitted scenarios current MP/MC performance degrades
> dramatically.
> The main problem with current MP/MC implementation is in that tail update
> have to be done in strict order (sort of fair locking scheme).
> Which means that we have much-much worse LHP manifestation, then when
> we use unfair schemes.
> With serialized ring (HTS) we remove that ordering completely (same idea as
> switch from fair to unfair locking for PV spin-locks).
> 
> > IMO, HTS should not be a configurable flag.
> 
> Why?
> 
> > In RCU requirement, a MP enqueue and HTS dequeue are required.
> 
> This is supported, user can specify different modes for consumer and
> producer:
> (0 | RING_F_MC_HTS_DEQ).
> Then it is up to the user either to call generic
> rte_ring_enqueue/rte_ring_dequeue,
> or specify mode manually by function name:
> rte_ring_mp_enqueue_bulk/ rte_ring_hts_dequeue_bulk.
Ok, that should be good.

> 
> >
> > > 2) Relaxed Tail Sync (RTS)
> > > The main difference from original MP/MC algorithm is that tail value
> > > is increased not by every thread that finished enqueue/dequeue, but
> > > only by the last one.
> > > That allows threads to avoid spinning on ring tail value, leaving
> > > actual tail value change to the last thread in the update queue.
> > This can be a configurable flag on the ring.
> > I am not sure how this solves the problem you have stated above
> > completely. Updating the count from all intermediate threads is still
> > required to update the value of the head. But yes, it reduces the severity 
> > of
> the problem by not enforcing the order in which the tail is updated.
> 
> As I said above, main source of slowdown here - that we have to  update tail 
> in
> particular order.
> So the main objective (same as for HTS) is to remove that ordering.
> 
> > I also think it introduces the problem on the other side of the ring
> > because the tail is not updated soon enough (the other side has to wait
> longer for the elements to become available).
> 
> Yes, producer/consumer starvation.
> That's why we need max allowed Head-Tail-Distance (htd_max) - to limit how
> far head can go away from tail.
> 
> > It also introduces another configuration parameter (HTD_MAX_DEF) which
> > they have to deal with.
> 
> If user doesn't provide any value, it will be set by default to ring.capacity 
> / 8.
> From my measurements works quite well.
> Though there possibility for the user to set another value, if needed.
> 
> > Users have to still implement the current hypervisor related solutions.
> 
> Didn't get what you trying to say with that phrase.
The references you provided talked about resolving LHP by doing co-scheduling 
of vCPUs (which I think could be applied to DPDK applications). I am saying 
that we still need such mechanisms along with these solutions.

> 
> > IMO, we should run the benchmark for this on an over committed setup to
> understand the benefits.
> 
> That's why I created ring_stress_*autotest test-cases and collected numbers
> provided below.
> I suppose they clearly show the problem on overcommitted scenarios, and
> how RTS/HTS improve that situation.
> Would appreciate if you repeat these tests on your machines.
> 
> >
> > >
> > > Test results on IA (see below) show significant improvements for
> > > average enqueue/dequeue op times on overcommitted systems.
> > > For 'classic' DPDK deployments (one thread per core) original MP/MC
> > > algorithm still shows best numbers, though for 64-bit target RTS
> > > numbers are not that far away.
> > > Numbers were produced by ring_stress_*autotest (first patch in these
> series).
> > >
> > > X86_64 @ Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz
> > > DEQ+ENQ average cycles/obj
> > >
> > >                                                 MP/MC      HTS     RTS
> > > 1thread@1core(--lcores=6-7)                     8.00       8.15    8.99
> > > 2thread@2core(--lcores=6-8)                     19.14      19.61   20.35
> > > 4thread@4core(--lcores=6-10)                    29.43      29.79   31.82
> > > 8thread@8core(--lcores=6-14)                    110.59     192.81  119.50
> > > 16thread@16core(--lcores=6-22)                  461.03     813.12  495.59
> > > 32thread/@32core(--lcores='6-22,55-70')         982.90     1972.38 1160.51
> > >
> > > 2thread@1core(--lcores='6,(10-11)@7'            20140.50   23.58   25.14
> > > 4thread@2core(--lcores='6,(10-11)@7,(20-21)@8'  153680.60  76.88
> 80.05
> > > 8thread@2core(--lcores='6,(10-13)@7,(20-23)@8'  280314.32  294.72
> > > 318.79 16thread@2core(--lcores='6,(10-17)@7,(20-27)@8' 643176.59
> > > 1144.02
> > > 1175.14 32thread@2core(--lcores='6,(10-25)@7,(30-45)@8' 4264238.80
> > > 4627.48 4892.68
> > >
> > > 8thread@2core(--lcores='6,(10-17)@(7,8))'       321085.98  298.59  307.47
> > > 16thread@4core(--lcores='6,(20-35)@(7-10))'     1900705.61 575.35
> 678.29
> > > 32thread@4core(--lcores='6,(20-51)@(7-10))'     5510445.85 2164.36
> 2714.12
> > >
> > > i686 @ Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz
> > > DEQ+ENQ average cycles/obj
> > >
> > >                                                 MP/MC      HTS     RTS
> > > 1thread@1core(--lcores=6-7)                     7.85       12.13   11.31
> > > 2thread@2core(--lcores=6-8)                     17.89      24.52   21.86
> > > 8thread@8core(--lcores=6-14)                    32.58      354.20  54.58
> > > 32thread/@32core(--lcores='6-22,55-70')         813.77     6072.41 2169.91
> > >
> > > 2thread@1core(--lcores='6,(10-11)@7'            16095.00   36.06   34.74
> > > 8thread@2core(--lcores='6,(10-13)@7,(20-23)@8'  1140354.54 346.61
> > > 361.57 16thread@2core(--lcores='6,(10-17)@7,(20-27)@8' 1920417.86
> > > 1314.90
> > > 1416.65
> > >
> > > 8thread@2core(--lcores='6,(10-17)@(7,8))'       594358.61  332.70  357.74
> > > 32thread@4core(--lcores='6,(20-51)@(7-10))'     5319896.86 2836.44
> 3028.87
> > >
> > > Konstantin Ananyev (6):
> > >   test/ring: add contention stress test
> > >   ring: rework ring layout to allow new sync schemes
> > >   ring: introduce RTS ring mode
> > >   test/ring: add contention stress test for RTS ring
> > >   ring: introduce HTS ring mode
> > >   test/ring: add contention stress test for HTS ring
> > >
> > >  app/test/Makefile                      |   3 +
> > >  app/test/meson.build                   |   3 +
> > >  app/test/test_pdump.c                  |   6 +-
> > >  app/test/test_ring_hts_stress.c        |  28 ++
> > >  app/test/test_ring_rts_stress.c        |  28 ++
> > >  app/test/test_ring_stress.c            |  27 ++
> > >  app/test/test_ring_stress.h            | 477 +++++++++++++++++++
> > >  lib/librte_pdump/rte_pdump.c           |   2 +-
> > >  lib/librte_port/rte_port_ring.c        |  12 +-
> > >  lib/librte_ring/Makefile               |   4 +-
> > >  lib/librte_ring/meson.build            |   4 +-
> > >  lib/librte_ring/rte_ring.c             |  84 +++-
> > >  lib/librte_ring/rte_ring.h             | 619 +++++++++++++++++++++++--
> > >  lib/librte_ring/rte_ring_elem.h        |   8 +-
> > >  lib/librte_ring/rte_ring_hts_generic.h | 228 +++++++++
> > > lib/librte_ring/rte_ring_rts_generic.h | 240 ++++++++++
> > >  16 files changed, 1721 insertions(+), 52 deletions(-)  create mode
> > > 100644 app/test/test_ring_hts_stress.c  create mode 100644
> > > app/test/test_ring_rts_stress.c  create mode 100644
> > > app/test/test_ring_stress.c  create mode 100644
> > > app/test/test_ring_stress.h create mode 100644
> > > lib/librte_ring/rte_ring_hts_generic.h
> > >  create mode 100644 lib/librte_ring/rte_ring_rts_generic.h
> > >
> > > --
> > > 2.17.1

Re: [dpdk-dev] [RFC 0/6] New sync modes for ring

Reply via email to