On Thu, Dec 01, 2016 at 08:19:42PM -0500, Zhihong Wang wrote: > This patch optimizes Vhost performance for large packets when the > Mergeable Rx buffer feature is enabled. It introduces a dedicated > memcpy function for vhost enqueue/dequeue to replace rte_memcpy. > > The reason is that rte_memcpy is for general cases, it handles > unaligned copies and make store aligned, it even makes load aligned > for micro architectures like Ivy Bridge. However alignment handling > comes at a price: It introduces extra load/store instructions. > > Vhost memcpy is rather special: The copy is aligned, and remote, > and there is header write along which is also remote. In this case > the memcpy instruction stream should be simplified, to reduce extra > load/store, therefore reduce the probability of load/store buffer > full caused pipeline stall, to let the actual memcpy instructions > be issued and let H/W prefetcher goes to work as early as possible. ... > > +/** > + * This function is used to for vhost memcpy, to replace rte_memcpy. > + * The reason is that rte_memcpy is for general cases, where vhost > + * memcpy is a rather special case: The copy is aligned, and remote, > + * and there is header write along which is also remote. In this case > + * the memcpy instruction stream should be simplified to reduce extra > + * load/store, therefore reduce the probability of load/store buffer > + * full caused pipeline stall, to let the actual memcpy instructions > + * be issued and let H/W prefetcher goes to work as early as possible. > + */ > +static inline void __attribute__((always_inline)) > +vhost_memcpy(void *dst, const void *src, size_t n)
I like this function a lot, since it's really simple and straightforward! Moreover, it performs better. But, I don't quite like how this function is proposed: - rte_movX are more like internal help functions that should be used only in corresponding rte_memcpy.h file. - It's a good optimization, however, it will not benefit for other use cases, though vhost is the most typical case here. - The optimization proves to be good for X86, but think there is no guarantee it may behave well for other platforms, say ARM. I still would suggest you to go this way: move this function into x86's rte_memcpy.h and call it when the data is well aligned. --yliu > +{ > + /* Copy size <= 16 bytes */ > + if (n < 16) { > + if (n & 0x01) { > + *(uint8_t *)dst = *(const uint8_t *)src; > + src = (const uint8_t *)src + 1; > + dst = (uint8_t *)dst + 1; > + } > + if (n & 0x02) { > + *(uint16_t *)dst = *(const uint16_t *)src; > + src = (const uint16_t *)src + 1; > + dst = (uint16_t *)dst + 1; > + } > + if (n & 0x04) { > + *(uint32_t *)dst = *(const uint32_t *)src; > + src = (const uint32_t *)src + 1; > + dst = (uint32_t *)dst + 1; > + } > + if (n & 0x08) > + *(uint64_t *)dst = *(const uint64_t *)src; > + > + return; > + } > + > + /* Copy 16 <= size <= 32 bytes */ > + if (n <= 32) { > + rte_mov16((uint8_t *)dst, (const uint8_t *)src); > + rte_mov16((uint8_t *)dst - 16 + n, > + (const uint8_t *)src - 16 + n); > + > + return; > + } > + > + /* Copy 32 < size <= 64 bytes */ > + if (n <= 64) { > + rte_mov32((uint8_t *)dst, (const uint8_t *)src); > + rte_mov32((uint8_t *)dst - 32 + n, > + (const uint8_t *)src - 32 + n); > + > + return; > + } > + > + /* Copy 64 bytes blocks */ > + for (; n >= 64; n -= 64) { > + rte_mov64((uint8_t *)dst, (const uint8_t *)src); > + dst = (uint8_t *)dst + 64; > + src = (const uint8_t *)src + 64; > + } > + > + /* Copy whatever left */ > + rte_mov64((uint8_t *)dst - 64 + n, > + (const uint8_t *)src - 64 + n); > +}