> > On Mar 1, 2024, at 5:16 AM, Morten Brørup <m...@smartsharesystems.com> > > wrote: > > > >> From: Konstantin Ananyev [mailto:konstantin.anan...@huawei.com] > >> Sent: Thursday, 22 February 2024 17.16 > >> > >>> For some reason your email is not visible to me, even though it's in the > >>> archive. > >> > >> No worries. > >> > >>> > >>> On 02/11/202416:32,Konstantin Ananyev konstantin.v.ananyev wrote: > >>> > >>>> From one side the code itself is very small and straightforward, > from > >> other side - it is not clear to me what is intended usage for it > >>>> within DPDK and it's applianances? > >>>> Konstantin > >>> > >>> The intended usage is explained in the cover email (see below) and > >> demonstrated > >>> in the test supplied in the following patch - when sending arrays of > >> pointers > >>> between cores as it happens in a forwarding example. > >> > >> Yes, I saw that. The thing is that test is a 'synthetic' one. > >> My question was about how do you expect people to use it in more realistic > >> scenarios? > >> Let say user has a bunch of mbuf pointers, possibly from different > >> mempools. > >> How he can use this API: how to deduce the base pointer for all of them and > >> what to > >> do if it can't be done? > > > > I share Konstantin's concerns with this feature. > > > > If we want to compress mbuf pointers in applications with a few mbuf pools, > > e.g. an mbuf pool per CPU socket, the compression > algorithm would be different. > This feature is targeted for pipeline mode of applications. We see many > customers using pipeline mode. This feature helps in reducing > the cost of transferring the packets between cores by reducing the copies > involved.
I do understand the intention, and I am not arguing about usefulness of the pipeline model. My point is you are introducing new API: compress/decompress pointers, but don't provide (or even describe) any proper way for the developer to use it in a safe and predictable manner. Which from my perspective make it nearly useless and misleading. > For an application with multiple pools, it depends on how the applications > are using multiple pools. But, if there is a bunch of packets > belonging to multiple mempools, compressing those mbufs may not be possible. > But if those mbufs are grouped per mempool and > are transferred on different queues, then it is possible. Hence the APIs are > implemented very generically. Ok, let's consider even more simplistic scenario - all pointers belong to one mempool. AFAIK, even one mempool can contain elements from different memzones, and these memzones are not guaranteed to have consecutive VAs. So even one mempool, with total size <=4GB can contain elements with distances between them more than 4GB. Now let say at startup user created a mempool, how he can determine programmatically can he apply your compress API safely on it or not? I presume that if you are serious about this API usage, then such ability has to be provided. Something like: int compress_pointer_deduce_mempool_base(const struct rte_memepool *mp[], uint32_t nb_mp, uint32_t compress_size, uintptr_t *base_ptr); Or probably even more generic one: struct mem_buf {uintptr_t base, size_t len;}; int compress_pointer_deduce_base(const struct mem_buf *mem_buf[], uint32_t nb_membuf, uint32_t compress_size, uintptr_t *base_ptr); Even with these functions in-place, user has to be extra careful: - he can't add new memory chunks to these mempools (or he'll need to re-calcualte the new base_ptr) - he needs to make sure that pointers from only these mempools will be used by compress/decompress. But at least it provides some ability to use this feature in real apps. With such API in place it should be possible to make the auto-test more realistic: - allocate mempool - deduce base_pointer - then we can have a loop with producer/consumer to mimic realistic workload. As an example: producer(s): mempool_alloc(); <fill mbuf with some values>; ring_enqueue(); consumer(s): ring_dequeue(); <read_and_check_mbuf_data>; free_mbuf(); - free mempool Or probably you can go even further: take some existing pipeline sample app and make it use compress/decompress API. That will provide people with some ability to test it and measure it's perf impact. Again, it will provide an example of the amount of changes required to enable it. My speculation here that majority of users will find the effort too big, while the gain way too limited and fragile. But at least, there would be some realistic reference point for it and users can decide themselves is it worth it or not. > > > > I would like to add: > > If we want to offer optimizations specifically for applications with a > > single mbuf pool, I think it should be considered in a system-wide > context to determine if performance could be improved in more areas. > > E.g. removing the pool field from the rte_mbuf structure might free up > > space to move hot fields from the second cache line to the > first, so the second cache line rarely needs to be touched. (As an > alternative to removing the pool field, it could be moved to the > second cache line, only to be used if the global "single mbuf pool" is NULL.) > Agree on this. The feedback I have received is on similar lines, many are > using simple features. I also received feedback that 90% of > the applications use less than 4GB of memory for mbuf and burst sizes are up > to 256. Well, from my perspective the story is completely different: Majority of real-world apps I am aware do use multiple mempools, it is also not uncommon to have a mempools with size bigger then 4GB (8/16). Again, there are queries to make mempools growable/shrinkable on demand. > > > > On the other hand, I agree that pointer compression can be useful for some > > applications, so we should accept it. > > > > However, pointer compression has nothing to do with the underlying hardware > > or operating system, so it does not belong in the EAL > (which is already too bloated). It should be a separate library. > Yes, this is generic (though there is SIMD code). We could move it out of EAL. > > > > >> > >>> On 01/11/2023 18:12, Paul Szczepanek wrote: > >>> > >>>> This patchset is proposing adding a new EAL header with utility functions > >>>> that allow compression of arrays of pointers. > >>>> > >>>> When passing caches full of pointers between threads, memory containing > >>>> the pointers is copied multiple times which is especially costly between > >>>> cores. A compression method will allow us to shrink the memory size > >>>> copied. > >>>> > >>>> The compression takes advantage of the fact that pointers are usually > >>>> located in a limited memory region (like a mempool). We can compress them > >>>> by converting them to offsets from a base memory address. > >>>> > >>>> Offsets can be stored in fewer bytes (dictated by the memory region size > >>>> and alignment of the pointer). For example: an 8 byte aligned pointer > >>>> which is part of a 32GB memory pool can be stored in 4 bytes. The API is > >>>> very generic and does not assume mempool pointers, any pointer can be > >>>> passed in. > >>>> > >>>> Compression is based on few and fast operations and especially with > >>>> vector > >>>> instructions leveraged creates minimal overhead. > >>>> > >>>> The API accepts and returns arrays because the overhead means it only is > >>>> worth it when done in bulk. > >>>> > >>>> Test is added that shows potential performance gain from compression. In > >>>> this test an array of pointers is passed through a ring between two > >>>> cores. > >>>> It shows the gain which is dependent on the bulk operation size. In this > >>>> synthetic test run on ampere altra a substantial (up to 25%) performance > >>>> gain is seen if done in bulk size larger than 32. At 32 it breaks even > >>>> and > >>>> lower sizes create a small (less than 5%) slowdown due to overhead. > >>>> > >>>> In a more realistic mock application running the l3 forwarding dpdk > >>>> example that works in pipeline mode on two cores this translated into a > >>>> ~5% throughput increase on an ampere altra. > > > > Which burst size was used to achieve this ~5% throughput increase? > This is the stock L3fwd application which is split into 2 stages: RX, L3fwd, > TX. The default burst size 32 is used. > > > > >>>> > >>>> v2: > >>>> * addressed review comments (style, explanations and typos) > >>>> * lowered bulk iterations closer to original numbers to keep runtime > >>>> short > >>>> * fixed pointer size warning on 32-bit arch > >>>> v3: > >>>> * added 16-bit versions of compression functions and tests > >>>> * added documentation of these new utility functions in the EAL guide > >>>> v4: > >>>> * added unit test > >>>> * fix bug in NEON implementation of 32-bit decompress > >>>> v5: > >>>> * disable NEON and SVE implementation on AARCH32 due to wrong pointer > >>>> size > >>>> > >>>> Paul Szczepanek (4): > >>>> eal: add pointer compression functions > >>>> test: add pointer compress tests to ring perf test > >>>> docs: add pointer compression to the EAL guide > >>>> test: add unit test for ptr compression > >>>> > >>>> .mailmap | 1 + > >>>> app/test/meson.build | 1 + > >>>> app/test/test_eal_ptr_compress.c | 108 ++++++ > >>>> app/test/test_ring.h | 94 ++++- > >>>> app/test/test_ring_perf.c | 354 ++++++++++++------ > >>>> .../prog_guide/env_abstraction_layer.rst | 142 +++++++ > >>>> lib/eal/include/meson.build | 1 + > >>>> lib/eal/include/rte_ptr_compress.h | 266 +++++++++++++ > >>>> 8 files changed, 843 insertions(+), 124 deletions(-) > >>>> create mode 100644 app/test/test_eal_ptr_compress.c > >>>> create mode 100644 lib/eal/include/rte_ptr_compress.h > >>>> > >>>> -- > >>>> 2.25.1 > >>>>