> From: Mattias Rönnblom [mailto:mattias.ronnb...@ericsson.com] > Sent: Wednesday, 1 March 2023 16.50 > > On 2023-03-01 14:31, Morten Brørup wrote: > >> From: Mattias Rönnblom [mailto:mattias.ronnb...@ericsson.com] > >> Sent: Wednesday, 1 March 2023 12.18 > >> > >> On 2023-02-28 17:01, Morten Brørup wrote: > >>>> From: Mattias Rönnblom [mailto:mattias.ronnb...@ericsson.com] > >>>> Sent: Tuesday, 28 February 2023 10.39 > >>> > >>> I have been looking for a high performance timer library (for use in > a fast > >> path TCP stack), and this looks very useful, Mattias. > >>> > >>> My initial feedback is based on quickly skimming the patch source > code, and > >> reading this cover letter. > >>> > >>>> > >>>> This patchset is an attempt to introduce a high-performance, highly > >>>> scalable timer facility into DPDK. > >>>> > >>>> More specifically, the goals for the htimer library are: > >>>> > >>>> * Efficient handling of a handful up to hundreds of thousands of > >>>> concurrent timers. > >>>> * Reduced overhead of adding and canceling timers. > >>>> * Provide a service functionally equivalent to that of > >>>> <rte_timer.h>. API/ABI backward compatibility is secondary. > >>>> > >>>> In the author's opinion, there are two main shortcomings with the > >>>> current DPDK timer library (i.e., rte_timer.[ch]). > >>>> > >>>> One is the synchronization overhead, where heavy-weight full- > barrier > >>>> type synchronization is used. rte_timer.c uses per-EAL/lcore skip > >>>> lists, but any thread may add or cancel (or otherwise access) > timers > >>>> managed by another lcore (and thus resides in its timer skip list). > >>>> > >>>> The other is an algorithmic shortcoming, with rte_timer.c's > reliance > >>>> on a skip list, which, seemingly, is less efficient than certain > >>>> alternatives. > >>>> > >>>> This patchset implements a hierarchical timer wheel (HWT, in > >>> > >>> Typo: HWT or HTW? > >> > >> Yes. I don't understand how I could managed to make so many such HTW > -> > >> HWT typos. At least I got the filenames (rte_htw.[ch]) correct. > >> > >>> > >>>> rte_htw.c), as per the Varghese and Lauck paper "Hashed and > >>>> Hierarchical Timing Wheels: Data Structures for the Efficient > >>>> Implementation of a Timer Facility". A HWT is a data structure > >>>> purposely design for this task, and used by many operating system > >>>> kernel timer facilities. > >>>> > >>>> To further improve the solution described by Varghese and Lauck, a > >>>> bitset is placed in front of each of the timer wheel in the HWT, > >>>> reducing overhead of rte_htimer_mgr_manage() (i.e., progressing > time > >>>> and expiry processing). > >>>> > >>>> Cycle-efficient scanning and manipulation of these bitsets are > crucial > >>>> for the HWT's performance. > >>>> > >>>> The htimer module keeps a per-lcore (or per-registered EAL thread) > HWT > >>>> instance, much like rte_timer.c keeps a per-lcore skip list. > >>>> > >>>> To avoid expensive synchronization overhead for thread-local timer > >>>> management, the HWTs are accessed only from the "owning" thread. > Any > >>>> interaction any other thread has with a particular lcore's timer > >>>> wheel goes over a set of DPDK rings. A side-effect of this design > is > >>>> that all operations working toward a "remote" HWT must be > >>>> asynchronous. > >>>> > >>>> The <rte_htimer.h> API is available only to EAL threads and > registered > >>>> non-EAL threads. > >>>> > >>>> The htimer API allows the application to supply the current time, > >>>> useful in case it already has retrieved this for other purposes, > >>>> saving the cost of a rdtsc instruction (or its equivalent). > >>>> > >>>> Relative htimer does not retrieve a new time, but reuse the current > >>>> time (as known via/at-the-time of the manage-call), again to shave > off > >>>> some cycles of overhead. > >>> > >>> I have a comment to the two points above. > >>> > >>> I agree that the application should supply the current time. > >>> > >>> This should be the concept throughout the library. I don't > understand why > >> TSC is used in the library at all? > >>> > >>> Please use a unit-less tick, and let the application decide what one > tick > >> means. > >>> > >> > >> I suspect the design of rte_htimer_mgr.h (and rte_timer.h) makes more > >> sense if you think of the user of the API as not just a "monolithic" > >> application, but rather a set of different modules, developed by > >> different organizations, and reused across a set of applications. The > >> idea behind the API design is they should all be able to share one > timer > >> service instance. > >> > >> The different parts of the application and any future DPDK platform > >> modules that use the htimer service needs to agree what a tick means > in > >> terms of actual wall-time, if it's not mandated by the API. > > > > I see. Then those non-monolithic applications can agree that the unit > of time is nanoseconds, or whatever makes sense for those applications. > And then they can instantiate one shared HTW for that purpose. > > > > <rte_htimer_mgr.h> contains nothing but shared HTWs. > > > There is no need to impose such an API limit on other users of the > library. > > > >> > >> There might be room for module-specific timer wheels as well, with > >> different resolution or other characteristics. The event timer > adapter's > >> use of a timer wheel could be one example (although I'm not sure it > is). > > > > We are not using the event device, and I have not looked into it, so I > have no qualified comments to this. > > > >> > >> If timer-wheel-as-a-private-lego-piece is also a valid use case, then > >> one could consider make the <rte_htw.h> API public as well. That is > what > >> I think you as asking for here: a generic timer wheel that doesn't > know > >> anything about time sources, time source time -> tick conversion, or > >> timer source time -> monotonic wall time conversion, and maybe is > also > >> not bound to a particular thread. > > > > Yes, that is what I had been searching the Internet for. > > > > (I'm not sure what you mean by "not bound to a particular thread". > Your per-thread design seems good to me.) > > > > I don't want more stuff in the EAL. What I want is high-performance > DPDK libraries we can use in our applications. > > > >> > >> I picked TSC because it seemed like a good "universal time unit" for > >> DPDK. rdtsc (and its equivalent) is also a very precise (especially > on > >> x86) and cheap-to-retrieve (especially on ARM, from what I > understand). > > > > The TSC does have excellent performance, but on all other parameters > it is a horrible time keeper: The measurement unit depends on the > underlying hardware, the TSC drifts depending on temperature, it cannot > be PTP synchronized, the list is endless! > > > >> > >> That said, at the moment, I'm leaning toward nanoseconds (uint64_t > >> format) should be the default for timer expiration time instead of > TSC. > >> TSC could still be an option for passing the current time, since TSC > >> will be a common time source, and it shaves off one conversion. > > > > There are many reasons why nanoseconds is a much better choice than > TSC. > > > >> > >>> A unit-less tick will also let the application instantiate a HTW > with higher > >> resolution than the TSC. (E.g. think about oversampling in audio > processing, > >> or Brezenham's line drawing algorithm for 2D visuals - oversampling > can sound > >> and look better.) > > > > Some of the timing data in our application have a resolution orders of > magnitude higher than one nanosecond. If we combined that with a HTW > library with nanosecond resolution, we would need to keep these timer > values in two locations: The original high-res timer in our data > structure, and the shadow low-res (nanosecond) timer in the HTW. > > > > There is no way you will meet timers with anything approaching > pico-second-level precision.
Correct. Our sub-nanosecond timers don't need to meet the exact time, but the higher resolution prevents loss of accuracy when a number has been added to it many times. Think of it like a special fixed-point number, where the least significant part is included to ensure accuracy in calculations, while the actual timer only considers the most significant part of the number. > You will also get into a value range issue, > since you will wrap around a 64-bit integer in a matter of days. Yes. We use timers with different scales for individual purposes. Our highest resolution are sub-nanosecond. > > The HTW only stores the timeout in ticks, not TSC, nanoseconds or > picoseconds. Excellent. Then I'm happy. > Generally, you don't want pico-second-level tick > granularity, since it increases the overhead of advancing the wheel(s). We currently use proprietary algorithms for our bandwidth scheduling. It seems that a HTW is not a good fit for this purpose. Perhaps you are offering a hammer, and it's not a good replacement for my screwdriver. I suppose that nanosecond resolution suffices for a TCP stack, which is the use case I have been on the lookout for a timer library for. :-) > The first (lowest-significance) few wheels will pretty much always be > empty. > > > We might also need to frequently update the HTW timers to prevent > drifting away from the high-res timers. E.g. 1.2 + 1.2 is still 2 when > rounded, but + 1.2 becomes 3 when it should have been 4 (3 * 1.2 = 3.6) > rounded. This level of drifting would also make periodic timers in the > HTW useless. > > > > Useless, for a certain class of applications. What application would > that be? Sorry about being unclear there. Yes, I only meant the specific application I was talking about, i.e. our application for high precision bandwidth management. For reference, 1 bit at 100 Gbit/s is 10 picoseconds. > > > Please note: I haven't really considered merging the high-res timing > in our application with this HTW, and I'm also not saying that PERIODIC > timers in the HTW are required or even useful for our application. I'm > only providing arguments for a unit-less time! > > > >>> > >>> For reference (supporting my suggestion), the dynamic timestamp > field in the > >> rte_mbuf structure is also defined as being unit-less. (I think > NVIDIA > >> implements it as nanoseconds, but that's an implementation specific > choice.) > >>> > >>>> > >>>> A semantic improvement compared to the <rte_timer.h> API is that > the > >>>> htimer library can give a definite answer on the question if the > timer > >>>> expiry callback was called, after a timer has been canceled. > >>>> > >>>> Below is a performance data from DPDK's 'app/test' micro > benchmarks, > >>>> using 10k concurrent timers. The benchmarks (test_timer_perf.c and > >>>> test_htimer_mgr_perf.c) aren't identical in their structure, but > the > >>>> numbers give some indication of the difference. > >>>> > >>>> Use case htimer timer > >>>> ------------------------------------ > >>>> Add timer 28 253 > >>>> Cancel timer 10 412 > >>>> Async add (source lcore) 64 > >>>> Async add (target lcore) 13 > >>>> > >>>> (AMD 5900X CPU. Time in TSC.) > >>>> > >>>> Prototype integration of the htimer library into real, timer-heavy, > >>>> applications indicates that htimer may result in significant > >>>> application-level performance gains. > >>>> > >>>> The bitset implementation which the HWT implementation depends upon > >>>> seemed generic-enough and potentially useful outside the world of > >>>> HWTs, to justify being located in the EAL. > >>>> > >>>> This patchset is very much an RFC, and the author is yet to form an > >>>> opinion on many important issues. > >>>> > >>>> * If deemed a suitable replacement, should the htimer replace the > >>>> current DPDK timer library in some particular (ABI-breaking) > >>>> release, or should it live side-by-side with the then-legacy > >>>> <rte_timer.h> API? A lot of things in and outside DPDK depend > on > >>>> <rte_timer.h>, so coexistence may be required to facilitate a > smooth > >>>> transition. > >>> > >>> It's my immediate impression that they are totally different in both > design > >> philosophy and API. > >>> > >>> Personal opinion: I would call it an entirely different library. > >>> > >>>> > >>>> * Should the htimer and htw-related files be colocated with > rte_timer.c > >>>> in the timer library? > >>> > >>> Personal opinion: No. This is an entirely different library, and > should live > >> for itself in a directory of its own. > >>> > >>>> > >>>> * Would it be useful for applications using asynchronous cancel to > >>>> have the option of having the timer callback run not only in > case of > >>>> timer expiration, but also cancellation (on the target lcore)? > The > >>>> timer cb signature would need to include an additional > parameter in > >>>> that case. > >>> > >>> If one thread cancels something in another thread, some > synchronization > >> between the threads is going to be required anyway. So we could > reprase your > >> question: Will the burden of the otherwise required synchronization > between > >> the two threads be significantly reduced if the library has the > ability to run > >> the callback on asynchronous cancel? > >>> > >> > >> Yes. > >> > >> Intuitively, it seems convenient that if you hand off a timer to a > >> different lcore, the timer callback will be called exactly once, > >> regardless if the timer was canceled or expired. > >> > >> But, as you indicate, you may still need synchronization to solve the > >> resource reclamation issue. > >> > >>> Is such a feature mostly "Must have" or "Nice to have"? > >>> > >>> More thoughts in this area... > >>> > >>> If adding and additional callback parameter, it could be an enum, so > the > >> callback could be expanded to support "timeout (a.k.a. timer fired)", > "cancel" > >> and more events we have not yet come up with, e.g. "early kick". > >>> > >> > >> Yes, or an int. > >> > >>> Here's an idea off the top of my head: An additional callback > parameter has > >> a (small) performance cost incurred with every timer fired (which is > a very > >> large multiplier). It might not be required. As an alternative to an > "what > >> happened" parameter to the callback, the callback could investigate > the state > >> of the object for which the timer fired, and draw its own conclusion > on how to > >> proceed. Obviously, this also has a performance cost, but perhaps the > callback > >> works on the object's state anyway, making this cost insignificant. > >>> > >> > >> It's not obvious to me that you, in the timer callback, can determine > >> what happened, if the same callback is called both in the cancel and > the > >> expired case. > >> > >> The cost of an extra integer passed in a register (or checking a > flag, > >> if the timer callback should be called at all at cancellation) that > is > >> the concern for me; it's extra bit of API complexity. > > > > Then introduce the library without this feature. More features can be > added later. > > > > The library will be introduced as "experimental", so we are free to > improve it and modify the ABI along the way. > > > >> > >>> Here's another alternative to adding a "what happened" parameter to > the > >> callback: > >>> > >>> The rte_htimer could have one more callback pointer, which (if set) > will be > >> called on cancellation of the timer. > >>> > >> > >> This will grow the timer struct with 16 bytes. > > > > If the rte_htimer struct stays within one cache line, it should be > acceptable. > > > > Timer structs are often embedded in other structures, and need not > themselves be cache line aligned (although the "parent" struct may need > to be, e.g. if it's dynamically allocated). > > So smaller is better. Just consider if you want your attosecond-level > time stamp in a struct: > > struct my_timer { > uint64_t high_precision_time_high_bits; > uint64_t high_precision_time_low_bits; > struct rte_htimer timer; > }; > > ...and you allocate those structs from a mempool. If rte_htimer is small > enough, you will fit on one cache line. Ahh... I somehow assumed they only existed as stand-alone elements inside the HTW. Then I obviously agree that shorter is better. > > > On the other hand, this approach is less generic than passing an > additional parameter. (E.g. add yet another callback pointer for "early > kick"?) > > > > BTW, async cancel is a form of inter-thread communication. Does this > library really need to provide any inter-thread communication > mechanisms? Doesn't an inter-thread communication mechanism belong in a > separate library? > > > > Yes, <rte_htimer_mgr.h> needs this because: > 1) Being able to schedule timers on a remote lcore is a useful feature > (especially since we don't have much else in terms of deferred work > mechanisms in DPDK). Although remote procedures is a useful feature, providing such a feature doesn't necessarily belong in a library that uses remote procedures. > 2) htimer aspires to be a plug-in replacement for <rte_timer.h> (albeit > an ABI-breaking one). This is a good argument. But I would much rather have a highly tuned stand-alone HTW library than a plug-in replacement of the old <rte_timer.h>. > > The pure HTW is in rte_htw.[ch]. > > Plus, with the current design, async operations basically come for free > (if you don't use them), from a performance perspective. The extra > overhead boils down to occasionally polling an empty ring, which is an > inexpensive operation. OK. Then no worries. > > >> > >>>> > >>>> * Should the rte_htimer be a nested struct, so the htw parts be > separated > >>>> from the htimer parts? > >>>> > >>>> * <rte_htimer.h> is kept separate from <rte_htimer_mgr.h>, so that > >>>> <rte_htw.h> may avoid a depedency to <rte_htimer_mgr.h>. Should > it > >>>> be so? > >>>> > >>>> * rte_htimer struct is only supposed to be used by the application > to > >>>> give an indication of how much memory it needs to allocate, and > is > >>>> its member are not supposed to be directly accessed (w/ the > possible > >>>> exception of the owner_lcore_id field). Should there be a dummy > >>>> struct, or a #define RTE_HTIMER_MEMSIZE or a > rte_htimer_get_memsize() > >>>> function instead, serving the same purpose? Better > encapsulation, > >>>> but more inconvenient for applications. Run-time dynamic sizing > >>>> would force application-level dynamic allocations. > >>>> > >>>> * Asynchronous cancellation is a little tricky to use for the > >>>> application (primarily due to timer memory reclamation/race > >>>> issues). Should this functionality be removed? > >>>> > >>>> * Should rte_htimer_mgr_init() also retrieve the current time? If > so, > >>>> there should to be a variant which allows the user to specify > the > >>>> time (to match rte_htimer_mgr_manage_time()). One pitfall with > the > >>>> current proposed API is an application calling > rte_htimer_mgr_init() > >>>> and then immediately adding a timer with a relative timeout, in > >>>> which case the current absolute time used is 0, which might be > a > >>>> surprise. > >>>> > >>>> * Should libdivide (optionally) be used to avoid the div in the TSC > -> > >>>> tick conversion? (Doesn't improve performance on Zen 3, but may > >>>> do on other CPUs.) Consider <rte_reciprocal.h> as well. > >>>> > >>>> * Should the TSC-per-tick be rounded up to a power of 2, so shifts > can be > >>>> used for conversion? Very minor performance gains to be found > there, > >>>> at least on Zen 3 cores. > >>>> > >>>> * Should it be possible to supply the time in rte_htimer_mgr_add() > >>>> and/or rte_htimer_mgr_manage_time() functions as ticks, rather > than > >>>> as TSC? Should it be possible to also use nanoseconds? > >>>> rte_htimer_mgr_manage_time() would need a flags parameter in > that > >>>> case. > >>> > >>> Do not use TSC anywhere in this library. Let the application decide > the > >> meaning of a tick. > >>> > >>>> > >>>> * Would the event timer adapter be best off using <rte_htw.h> > >>>> directly, or <rte_htimer.h>? In the latter case, there needs to > be a > >>>> way to instantiate more HWTs (similar to the "alt" functions of > >>>> <rte_timer.h>)? > >>>> > >>>> * Should the PERIODICAL flag (and the complexity it brings) be > >>>> removed? And leave the application with only single-shot > timers, and > >>>> the option to re-add them in the timer callback. > >>> > >>> First thought: Yes, keep it lean and remove the periodical stuff. > >>> > >>> Second thought: This needs a more detailed analysis. > >>> > >>> From one angle: > >>> > >>> How many PERIODICAL versus ONESHOT timers do we expect? > >>> > >> > >> I suspect you should be prepared for the ratio being anything. > > > > In theory, anything is possible. But I'm asking that we consider > realistic use cases. > > > >> > >>> Intuitively, I would use this library for ONESHOT timers, and > perhaps > >> implement my periodical timers by other means. > >>> > >>> If the PERIODICAL:ONESHOT ratio is low, we can probably live with > the extra > >> cost of cancel+add for a few periodical timers. > >>> > >>> From another angle: > >>> > >>> What is the performance gain with the PERIODICAL flag? > >>> > >> > >> None, pretty much. It's just there for convenience. > > > > OK, then I suggest that you remove it, unless you get objections. > > > > The library can be expanded with useful features at any time later. > Useless features are (nearly) impossible to remove, once they are in > there - they are just "technical debt" with associated maintenance > costs, added complexity weaving into other features, etc.. > > > >> > >>> Without a periodical timer, cancel+add costs 10+28 cycles. How many > cycles > >> would a "move" function, performing both cancel and add, use? > >>> > >>> And then compare that to the cost (in cycles) of repeating a timer > with > >> PERIODICAL? > >>> > >>> Furthermore, not having the PERIODICAL flag probably improves the > >> performance for non-periodical timers. How many cycles could we gain > here? > >>> > >>> > >>> Another, vaguely related, idea: > >>> > >>> The callback pointer might not need to be stored per rte_htimer, but > could > >> instead be common for the rte_htw. > >>> > >> > >> Do you mean rte_htw, or rte_htimer_mgr? > >> > >> If you make one common callback, all the different parts of the > >> application needs to be coordinated (in a big switch-statement, or > >> something of that sort), or have some convention for using an > >> application-specific wrapper structure (accessed via container_of()). > >> > >> This is a problem if the timer service API consumer is a set of > largely > >> uncoordinated software modules. > >> > >> Btw, the eventdev API has the same issue, and the proposed event > >> dispatcher is one way to help facilitate application-internal > decoupling. > >> > >> For a module-private rte_htw instance your suggestion may work, but > not > >> for <rte_htimer_mgr.h>. > > > > I was speculating that a common callback pointer might provide a > performance benefit for single-purpose HTW instances. (The same concept > applies if there are multiple callbacks, e.g. a "Timer Fired", a "Timer > Cancelled", and an "Early Kick" callback pointer - i.e. having the > callback pointers per HTW instance, instead of per timer.) > > > >> > >>> When a timer fires, the callback probably needs to check/update the > state of > >> the object for which the timer fired anyway, so why not just let the > >> application use that state to determine the appropriate action. This > might > >> provide some performance benefit. > >>> > >>> It might complicate using one HTW for multiple different purposes, > though. > >> Probably a useless idea, but I wanted to share the idea anyway. It > might > >> trigger other, better ideas in the community. > >>> > >>>> > >>>> * Should the async result codes and the sync cancel error codes be > merged > >>>> into one set of result codes? > >>>> > >>>> * Should the rte_htimer_mgr_async_add() have a flag which allow > >>>> buffering add request messages until rte_htimer_mgr_process() > is > >>>> called? Or any manage function. Would reduce ring signaling > overhead > >>>> (i.e., burst enqueue operations instead of single-element > >>>> enqueue). Could also be a rte_htimer_mgr_async_add_burst() > function, > >>>> solving the same "problem" a different way. (The signature of > such > >>>> a function would not be pretty.) > >>>> > >>>> * Does the functionality provided by the rte_htimer_mgr_process() > >>>> function match its the use cases? Should there me a more clear > >>>> separation between expiry processing and asynchronous operation > >>>> processing? > >>>> > >>>> * Should the patchset be split into more commits? If so, how? > >>>> > >>>> Thanks to Erik Carrillo for his assistance. > >>>> > >>>> Mattias Rönnblom (2): > >>>> eal: add bitset type > >>>> eal: add high-performance timer facility > >