Hi Konstantin, > -----Original Message----- > From: Konstantin Ananyev <konstantin.anan...@huawei.com> > Sent: Thursday, January 4, 2024 09:47 > > > This is a blocker, showstopper for me. > > +1 > > > > > Have you considered having something like > > > rte_flow_create_bulk() > > > > > > or better yet a Linux iouring style API? > > > > > > A ring style API would allow for better mixed operations across the > > > board and get rid of the I-cache overhead which is the root cause of the > needing inline. > > Existing async flow API is somewhat close to the io_uring interface. > > The difference being that queue is not directly exposed to the application. > > Application interacts with the queue using rte_flow_async_* APIs (e.g., > places operations in the queue, pushes them to the HW). > > Such design has some benefits over a flow API which exposes the queue to > the user: > > - Easier to use - Applications do not manage the queue directly, they do it > through exposed APIs. > > - Consistent with other DPDK APIs - In other libraries, queues are > manipulated through API, not directly by an application. > > - Lower memory usage - only HW primitives are needed (e.g., HW queue > > on PMD side), no need to allocate separate application queues. > > > > Bulking of flow operations is a tricky subject. > > Compared to packet processing, where it is desired to keep the > > manipulation of raw packet data to the minimum (e.g., only packet > > headers are accessed), during flow rule creation all items and actions must > be processed by PMD to create a flow rule. > > The amount of memory consumed by items and actions themselves during > this process might be nonnegligible. > > If flow rule operations were bulked, the size of working set of memory > > would increase, which could have negative consequences on the cache > behavior. > > So, it might be the case that by utilizing bulking the I-cache overhead is > removed, but the D-cache overhead is added. > > Is rte_flow struct really that big? > We do bulk processing for mbufs, crypto_ops, etc., and usually bulk > processing improves performance not degrades it. > Of course bulk size has to be somewhat reasonable. It does not really depend on rte_flow struct size itself (it's opaque to the user), but on sizes of items and actions which are the parameters for flow operations. To create a flow through async flow API the following is needed: - array of items and their spec, - array of actions and their configuration, - pointer to template table, - indexes of pattern and actions templates to be used. If we assume a simple case of ETH/IPV4/TCP/END match and COUNT/RSS/END actions, then we have at most: - 4 items (32B each) + 3 specs (20B each) = 188B - 3 actions (16B each) + 2 configurations (4B and 40B) = 92B - 8B for table pointer - 2B for template indexes In total = 290B. Bulk API can be designed in a way that single bulk operates on a single set of tables and templates - this would remove a few bytes. Flow actions can be based on actions templates (so no need for conf), but items' specs are still needed. This would leave us at 236B, so at least 4 cache lines (assuming everything is tightly packed) for a single flow and almost twice the size of the mbuf. Depending on the bulk size it might be a much more significant chunk of the cache.
I don't want to dismiss the idea. I think it's worth of evaluation. However, I'm not entirely confident if bulking API would introduce performance benefits. Best regards, Dariusz Sosnowski