Thanks for the thoughts Morten. I believe we need benchmarks of different scenarios with different drivers.
31/10/2020 19:20, Morten Brørup: > Thomas, > > Adding my thoughts to the already detailed feedback on this important patch... > > The first cache line is not inherently "hotter" than the second. The hotness > depends on their usage. > > The mbuf cacheline1 marker has the following comment: > /* second cache line - fields only used in slow path or on TX */ > > In other words, the second cache line is intended not to be touched in fast > path RX. > > I do not think this is true anymore. Not even with simple non-scattered RX. > And regression testing probably didn't catch this, because the tests perform > TX after RX, so the cache miss moved from TX to RX and became a cache hit in > TX instead. (I may be wrong about this claim, but it's not important for the > discussion.) > > I think the right question for this patch is: Can we achieve this - not using > the second cache line for fast path RX - again by putting the right fields in > the first cache line? > > Probably not in all cases, but perhaps for some... > > Consider the application scenarios. > > When a packet is received, one of three things happens to it: > 1. It is immediately transmitted on one or more ports. > 2. It is immediately discarded, e.g. by a firewall rule. > 3. It is put in some sort of queue, e.g. a ring for the next pipeline stage, > or in a QoS queue. > > 1. If the packet is immediately transmitted, the m->tx_offload field in the > second cache line will be touched by the application and TX function anyway, > so we don't need to optimize the mbuf layout for this scenario. > > 2. The second scenario touches m->pool no matter how it is implemented. The > application can avoid touching m->next by using rte_mbuf_raw_free(), knowing > that the mbuf came directly from RX and thus no other fields have been > touched. In this scenario, we want m->pool in the first cache line. > > 3. Now, let's consider the third scenario, where RX is followed by enqueue > into a ring. If the application does nothing but put the packet into a ring, > we don't need to move anything into the first cache line. But applications > usually does more... So it is application specific what would be good to move > to the first cache line: > > A. If the application does not use segmented mbufs, and performs analysis and > preparation for transmission in the initial pipeline stages, and only the > last pipeline stage performs TX, we could move m->tx_offload to the first > cache line, which would keep the second cache line cold until the actual TX > happens in the last pipeline stage - maybe even after the packet has waited > in a QoS queue for a long time, and its cache lines have gone cold. > > B. If the application uses segmented mbufs on RX, it might make sense moving > m->next to the first cache line. (We don't use segmented mbufs, so I'm not > sure about this.) > > > However, reality perhaps beats theory: > > Looking at the E1000 PMD, it seems like even its non-scattered RX function, > eth_igb_recv_pkts(), sets m->next. If it only kept its own free pool > pre-initialized instead... I haven't investigated other PMDs, except briefly > looking at the mlx5 PMD, and it seems like it doesn't touch m->next in RX. > > I haven't looked deeper into how m->pool is being used by RX in PMDs, but I > suppose that it isn't touched in RX. > > <rant on> > If only we had a performance test where RX was not immediately followed by > TX, but the packets were passed through a large queue in-between, so RX cache > misses were not free of charge because they transform TX cache misses into > cache hits instead... > <rant off> > > Whatever you choose, I am sure that most applications will find it more > useful than the timestamp. :-)