On 4/20/06, David S. Miller <[EMAIL PROTECTED]> wrote: > Yes, and it means that the memory bandwidth costs are equivalent > between I/O AT and cpu copy.
The following is a response from the I/OAT architects. I only point out that this is not coming directly from me because I have not seen the data to verify the claims regarding the speed of a copy vs a load and the cost of the rep mov instruction. I'll encourage more direct participation in this discussion from the architects moving forward. - Chris Let's talk about the caching benefits that is seemingly lost when using the DMA engine. The intent of the DMA engine is to save CPU cycles spent in copying data (rep mov). In cases where the destination is already warm in cache (due to destination buffer re-use) and the source is in memory, the cycles spent in a host copy is not just due to the cache misses it encounters in the process of bringing in the source but also due to the execution of rep move itself within the host core. If you contrast this to simply touching (loading) the data residing in memory, the cost of this load is primarily the cost of the cache misses and not so much CPU execution time. Given this, some of the following points are noteworthy: 1. While the DMA engine forces the destination to be in memory and touching it may cause the same number of observable cache misses as a host copy assuming a cache warmed destination, the cost of the host copy (in terms of CPU cycles) is much more than the cost of the touch. 2. CPU hardware prefetchers do a pretty good job of staying ahead of the fetch stream to minimize cache misses. So for loads of medium to large buffers, cache misses form a much smaller component of the data fetch timeā¦most of it is dominated by front side bus (FSB) or Memory bandwidth. For small buffers, we do not use the DMA engine but if we had to, we would insert SW prefetches that do reasonably well. 3. If the destination wasn't already warm in cache i.e., it was in memory or some CPU other cache, host copy will have to snoop and bring the destination in and will encounter additional misses on the destination buffer as well. These misses are the same as those encountered in #1 above when using the DMA engine and touching the data afterwards. So in effect it becomes a wash when compared to the DMA engine's behavior. The case where the destination is already warm in cache is common in benchmarks such as iperf, ttcp etc. where the same buffer is reused over and over again. Real applications typically will not exhibit this aggressive buffer re-use behavior. 4. It may take a large number of packets (and several interrupts) to satisfy a large posted buffer (say 64KB). Even if you use host copy to warm the cache with the destination, there is no guarantee that some or all of the destination will stay in the cache before the application has a chance to read the data. 5. The source data payload (skb ->data) is typically needed only once for the copy and has no use later. The host copy brings it into the cache and may end up polluting the cache, and consuming FSB bandwidth whereas the DMA engine avoids this altogether. The IxChariot data posted earlier that touches the data and yet shows I/OAT benefit is due to some of the reasons above. Bottom line is that I agree with the cache benefit argument of host copy for small buffers (64B to 512B) but for larger buffers and certain application scenarios (destination in memory), the DMA engine will show better performance regardless of where the destination buffer resided to begin with and where it is accessed from. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html