On Fri, Jan 7, 2022 at 2:16 PM Morten Brørup <m...@smartsharesystems.com> wrote: > > > From: Jerin Jacob [mailto:jerinjac...@gmail.com] > > Sent: Thursday, 6 January 2022 17.55 > > > > On Thu, Jan 6, 2022 at 5:54 PM Morten Brørup <m...@smartsharesystems.com> > > wrote: > > > > > > A flush threshold for the mempool cache was introduced in DPDK > > version > > > 1.3, but rte_mempool_do_generic_get() was not completely updated back > > > then. > > > > > > The incompleteness did not cause any functional bugs, so this patch > > > could be considered refactoring for the purpose of cleaning up. > > > > > > This patch completes the update of rte_mempool_do_generic_get() as > > > follows: > > > > > > 1. A few comments were malplaced or no longer correct. > > > Some comments have been updated/added/corrected. > > > > > > 2. The code that initially screens the cache request was not updated. > > > The initial screening compared the request length to the cache size, > > > which was correct before, but became irrelevant with the introduction > > of > > > the flush threshold. E.g. the cache can hold up to flushthresh > > objects, > > > which is more than its size, so some requests were not served from > > the > > > cache, even though they could be. > > > The initial screening has now been corrected to match the initial > > > screening in rte_mempool_do_generic_put(), which verifies that a > > cache > > > is present, and that the length of the request does not overflow the > > > memory allocated for the cache. > > > > > > 3. The code flow for satisfying the request from the cache was weird. > > > The likely code path where the objects are simply served from the > > cache > > > was treated as unlikely; now it is treated as likely. > > > And in the code path where the cache was backfilled first, numbers > > were > > > added and subtracted from the cache length; now this code path simply > > > sets the cache length to its final value. > > > > > > 4. The objects were returned in reverse order. > > > Returning the objects in reverse order is not necessary, so > > rte_memcpy() > > > is now used instead. > > > > Have you checked the performance with network workload? > > IMO, reverse order makes sense(LIFO vs FIFO). > > The LIFO makes the cache warm as the same buffers are reused > > frequently. > > I have not done any performance testing. We probably agree that the only > major difference lies in how the objects are returned. And we probably also > agree that rte_memcpy() is faster than the copy loop it replaced, especially > when n is constant at compile time. So the performance difference mainly > depends on the application, which I will discuss below. > > Let's first consider LIFO vs. FIFO. > > The key argument for the rte_memcpy() optimization is that we are still > getting the burst of objects from the top of the stack (LIFO); only the order > of the objects inside the burst is not reverse anymore. > > Here is an example: > > The cache initially contains 8 objects: 01234567. > > 8 more objects are put into the cache: 89ABCDEF. > > The cache now holds: 0123456789ABCDEF.
Agree. However I think, it may matter with less sized L1 cache machines and burst size is more where it plays role what can be in L1 with the scheme. I would suggest splitting each performance improvement as a separate patch for better tracking and quantity of the performance improvement. I think, mempool performance test and tx only stream mode in testpmd can quantify patches. > > Getting 4 objects from the cache gives us CDEF instead of FEDC, i.e. we are > still getting the 4 objects most recently put into the cache. > > Furthermore, if the application is working with fixed size bursts, it will > usually put and get the same size burst, i.e. put the burst 89ABCDEF into the > cache, and then get the burst 89ABCDEF from the cache again. > > > Here is an example unfavorable scenario: > > The cache initially contains 4 objects, which have gone cold: 0123. > > 4 more objects, which happen to be hot, are put into the cache: 4567. > > Getting 8 objects from the cache gives us 01234567 instead of 76543210. > > Now, if the application only processes the first 4 of the 8 objects in the > burst, it would have benefitted from those objects being the hot 7654 objects > instead of the cold 0123 objects. > > However, I think that most applications process the complete burst, so I do > consider this scenario unlikely. > > Similarly, a pipelined application doesn't process objects in reverse order > at every other step in the pipeline, even though the previous step in the > pipeline most recently touched the last object of the burst. > > > My overall conclusion was that the benefit of using rte_memcpy() outweighs > the disadvantage of the unfavorable scenario, because I consider the > probability of the unfavorable scenario occurring very low. But again: it > mainly depends on the application. > > If anyone disagrees with the risk analysis described above, I will happily > provide a version 2 of the patch, where the objects are still returned in > reverse order. After all, the rte_memcpy() benefit is relatively small > compared to the impact if the unlikely scenario occurs. >