On 6/26/2020 1:41 PM, Van Haaren, Harry wrote: >> -----Original Message----- >> From: Yigit, Ferruh <ferruh.yi...@intel.com> >> Sent: Friday, June 26, 2020 1:08 PM >> To: Morten Brørup <m...@smartsharesystems.com>; dev@dpdk.org >> Cc: Olivier Matz <olivier.m...@6wind.com>; Van Haaren, Harry >> <harry.van.haa...@intel.com>; Ananyev, Konstantin >> <konstantin.anan...@intel.com> >> Subject: Re: [dpdk-dev] rte_ether_addr_copy() strange comment >> >> On 6/25/2020 4:45 PM, Morten Brørup wrote: >>> The function rte_ether_addr_copy() checks for __INTEL_COMPILER and has a >> comment about "a strange gcc warning". It says: >>> >>> static inline void rte_ether_addr_copy(const struct rte_ether_addr *ea_from, >>> struct rte_ether_addr *ea_to) >>> { >>> #ifdef __INTEL_COMPILER >>> uint16_t *from_words = (uint16_t *)(ea_from->addr_bytes); >>> uint16_t *to_words = (uint16_t *)(ea_to->addr_bytes); >>> >>> to_words[0] = from_words[0]; >>> to_words[1] = from_words[1]; >>> to_words[2] = from_words[2]; >>> #else >>> /* >>> * Use the common way, because of a strange gcc warning. >>> */ >>> *ea_to = *ea_from; >>> #endif >>> } >>> >>> I can see that from_words discards the const qualifier. Is that the >>> "strange" gcc >> warning the comment is referring to? >>> >>> This goes back to before the first public release of DPDK in 2013, ref. >> https://git.dpdk.org/dpdk/log/lib/librte_ether/rte_ether.h >>> >>> >>> It should be fixed as follows: >>> >>> -uint16_t *from_words = (uint16_t *)(ea_from->addr_bytes); >>> -uint16_t *to_words = (uint16_t *)(ea_to->addr_bytes); >>> +const uint16_t *from_words = (const uint16_t *)ea_from; >>> +uint16_t *to_words = (uint16_t *)ea_to; >>> >>> And the consequential question: Is copying the three shorts faster than >> copying the struct? In other words: Should we get rid of the #ifdef and use >> the >> first method only? >> >> >> I tried to investigate this in godbolt: https://godbolt.org/z/YSmvDn > > There was a small hiccup in the struct mac definition, it is aligned to 2, > not 16 as the above Godbolt link... > With the aligned attribute changed to 2 (as per DPDK header > https://git.dpdk.org/dpdk/tree/lib/librte_net/rte_ether.h#n59 ) > we get the required (but less performant) smaller stores: > > WORD_COPY = 0, Aligned = 16 > NOTE: Incorrect alignment provided, and invalid ASM as it stores over the 10 > bytes after eth addr. > This code is from a GodBolt test only, and this bug is NOT present in > upstream DPDK.
Thanks for the clarification. (not sure how I end up testing 16 byte alignment) <...> > PS: For extra bonus points, here's a SIMD version that only uses one store > https://godbolt.org/z/VAR2La. Unless you intend on copying billions of > L1 resident eth addrs, this may or may not be a useful optimization. > Note that it requires the 10 bytes after the ether addr to be valid to read. > It loads 16B across both SRC and DST, blends 48 bits of SRC into DST and > writes the result back to DST. > movdqu (%rsi), %xmm0 > movdqu (%rdi), %xmm1 > pblendw $7, %xmm1, %xmm0 > movups %xmm0, (%rdi) > ret > > Actually, its possible to do this using a uint64_t (8 byte scalar) load/store > too, > with some masking and bitwise OR... left as an exercise to the reader? :) > Does below work? (not for real life usage, just to experiment single store solutions :) [https://godbolt.org/z/TmqwQh] movzwl 6(%rdi), %eax salq $48, %rax orq (%rsi), %rax movq %rax, (%rdi) ret ---- void copy(struct mac *dst, const struct mac *src) { uint64_t *s = (uint64_t *) &src->addr; uint64_t *d = (uint64_t *) &dst->addr; uint16_t dd = ((uint16_t *)d)[3]; *d = (*s & ~(0xffffUL<48)) | ((uint64_t)dd << 48); }