Thanks for the thoughts Morten.
I believe we need benchmarks of different scenarios with different drivers.


31/10/2020 19:20, Morten Brørup:
> Thomas,
> 
> Adding my thoughts to the already detailed feedback on this important patch...
> 
> The first cache line is not inherently "hotter" than the second. The hotness 
> depends on their usage.
> 
> The mbuf cacheline1 marker has the following comment:
> /* second cache line - fields only used in slow path or on TX */
> 
> In other words, the second cache line is intended not to be touched in fast 
> path RX.
> 
> I do not think this is true anymore. Not even with simple non-scattered RX. 
> And regression testing probably didn't catch this, because the tests perform 
> TX after RX, so the cache miss moved from TX to RX and became a cache hit in 
> TX instead. (I may be wrong about this claim, but it's not important for the 
> discussion.)
> 
> I think the right question for this patch is: Can we achieve this - not using 
> the second cache line for fast path RX - again by putting the right fields in 
> the first cache line?
> 
> Probably not in all cases, but perhaps for some...
> 
> Consider the application scenarios.
> 
> When a packet is received, one of three things happens to it:
> 1. It is immediately transmitted on one or more ports.
> 2. It is immediately discarded, e.g. by a firewall rule.
> 3. It is put in some sort of queue, e.g. a ring for the next pipeline stage, 
> or in a QoS queue.
> 
> 1. If the packet is immediately transmitted, the m->tx_offload field in the 
> second cache line will be touched by the application and TX function anyway, 
> so we don't need to optimize the mbuf layout for this scenario.
> 
> 2. The second scenario touches m->pool no matter how it is implemented. The 
> application can avoid touching m->next by using rte_mbuf_raw_free(), knowing 
> that the mbuf came directly from RX and thus no other fields have been 
> touched. In this scenario, we want m->pool in the first cache line.
> 
> 3. Now, let's consider the third scenario, where RX is followed by enqueue 
> into a ring. If the application does nothing but put the packet into a ring, 
> we don't need to move anything into the first cache line. But applications 
> usually does more... So it is application specific what would be good to move 
> to the first cache line:
> 
> A. If the application does not use segmented mbufs, and performs analysis and 
> preparation for transmission in the initial pipeline stages, and only the 
> last pipeline stage performs TX, we could move m->tx_offload to the first 
> cache line, which would keep the second cache line cold until the actual TX 
> happens in the last pipeline stage - maybe even after the packet has waited 
> in a QoS queue for a long time, and its cache lines have gone cold.
> 
> B. If the application uses segmented mbufs on RX, it might make sense moving 
> m->next to the first cache line. (We don't use segmented mbufs, so I'm not 
> sure about this.)
> 
> 
> However, reality perhaps beats theory:
> 
> Looking at the E1000 PMD, it seems like even its non-scattered RX function, 
> eth_igb_recv_pkts(), sets m->next. If it only kept its own free pool 
> pre-initialized instead... I haven't investigated other PMDs, except briefly 
> looking at the mlx5 PMD, and it seems like it doesn't touch m->next in RX.
> 
> I haven't looked deeper into how m->pool is being used by RX in PMDs, but I 
> suppose that it isn't touched in RX.
> 
> <rant on>
> If only we had a performance test where RX was not immediately followed by 
> TX, but the packets were passed through a large queue in-between, so RX cache 
> misses were not free of charge because they transform TX cache misses into 
> cache hits instead...
> <rant off>
> 
> Whatever you choose, I am sure that most applications will find it more 
> useful than the timestamp. :-)



Reply via email to