> From: Jerin Jacob [mailto:jerinjac...@gmail.com] > Sent: Thursday, 6 January 2022 17.55 > > On Thu, Jan 6, 2022 at 5:54 PM Morten Brørup <m...@smartsharesystems.com> > wrote: > > > > A flush threshold for the mempool cache was introduced in DPDK > version > > 1.3, but rte_mempool_do_generic_get() was not completely updated back > > then. > > > > The incompleteness did not cause any functional bugs, so this patch > > could be considered refactoring for the purpose of cleaning up. > > > > This patch completes the update of rte_mempool_do_generic_get() as > > follows: > > > > 1. A few comments were malplaced or no longer correct. > > Some comments have been updated/added/corrected. > > > > 2. The code that initially screens the cache request was not updated. > > The initial screening compared the request length to the cache size, > > which was correct before, but became irrelevant with the introduction > of > > the flush threshold. E.g. the cache can hold up to flushthresh > objects, > > which is more than its size, so some requests were not served from > the > > cache, even though they could be. > > The initial screening has now been corrected to match the initial > > screening in rte_mempool_do_generic_put(), which verifies that a > cache > > is present, and that the length of the request does not overflow the > > memory allocated for the cache. > > > > 3. The code flow for satisfying the request from the cache was weird. > > The likely code path where the objects are simply served from the > cache > > was treated as unlikely; now it is treated as likely. > > And in the code path where the cache was backfilled first, numbers > were > > added and subtracted from the cache length; now this code path simply > > sets the cache length to its final value. > > > > 4. The objects were returned in reverse order. > > Returning the objects in reverse order is not necessary, so > rte_memcpy() > > is now used instead. > > Have you checked the performance with network workload? > IMO, reverse order makes sense(LIFO vs FIFO). > The LIFO makes the cache warm as the same buffers are reused > frequently.
I have not done any performance testing. We probably agree that the only major difference lies in how the objects are returned. And we probably also agree that rte_memcpy() is faster than the copy loop it replaced, especially when n is constant at compile time. So the performance difference mainly depends on the application, which I will discuss below. Let's first consider LIFO vs. FIFO. The key argument for the rte_memcpy() optimization is that we are still getting the burst of objects from the top of the stack (LIFO); only the order of the objects inside the burst is not reverse anymore. Here is an example: The cache initially contains 8 objects: 01234567. 8 more objects are put into the cache: 89ABCDEF. The cache now holds: 0123456789ABCDEF. Getting 4 objects from the cache gives us CDEF instead of FEDC, i.e. we are still getting the 4 objects most recently put into the cache. Furthermore, if the application is working with fixed size bursts, it will usually put and get the same size burst, i.e. put the burst 89ABCDEF into the cache, and then get the burst 89ABCDEF from the cache again. Here is an example unfavorable scenario: The cache initially contains 4 objects, which have gone cold: 0123. 4 more objects, which happen to be hot, are put into the cache: 4567. Getting 8 objects from the cache gives us 01234567 instead of 76543210. Now, if the application only processes the first 4 of the 8 objects in the burst, it would have benefitted from those objects being the hot 7654 objects instead of the cold 0123 objects. However, I think that most applications process the complete burst, so I do consider this scenario unlikely. Similarly, a pipelined application doesn't process objects in reverse order at every other step in the pipeline, even though the previous step in the pipeline most recently touched the last object of the burst. My overall conclusion was that the benefit of using rte_memcpy() outweighs the disadvantage of the unfavorable scenario, because I consider the probability of the unfavorable scenario occurring very low. But again: it mainly depends on the application. If anyone disagrees with the risk analysis described above, I will happily provide a version 2 of the patch, where the objects are still returned in reverse order. After all, the rte_memcpy() benefit is relatively small compared to the impact if the unlikely scenario occurs.