> From: Emil Berg [mailto:emil.b...@ericsson.com] > Sent: Monday, 27 June 2022 14.51 > > > From: Emil Berg > > Sent: den 27 juni 2022 14:46 > > > > > From: Mattias Rönnblom <hof...@lysator.liu.se> > > > Sent: den 27 juni 2022 14:28 > > > > > > On 2022-06-23 14:51, Morten Brørup wrote: > > > >> From: Morten Brørup [mailto:m...@smartsharesystems.com] > > > >> Sent: Thursday, 23 June 2022 14.39 > > > >> > > > >> With this patch, the checksum can be calculated on an unaligned > buffer. > > > >> I.e. the buf parameter is no longer required to be 16 bit > aligned. > > > >> > > > >> The checksum is still calculated using a 16 bit aligned pointer, > so > > > >> the compiler can auto-vectorize the function's inner loop. > > > >> > > > >> When the buffer is unaligned, the first byte of the buffer is > > > >> handled separately. Furthermore, the calculated checksum of the > > > >> buffer is byte shifted before being added to the initial > checksum, > > > >> to compensate for the checksum having been calculated on the > buffer > > > >> shifted by one byte. > > > >> > > > >> v4: > > > >> * Add copyright notice. > > > >> * Include stdbool.h (Emil Berg). > > > >> * Use RTE_PTR_ADD (Emil Berg). > > > >> * Fix one more typo in commit message. Is 'unligned' even a > word? > > > >> v3: > > > >> * Remove braces from single statement block. > > > >> * Fix typo in commit message. > > > >> v2: > > > >> * Do not assume that the buffer is part of an aligned packet > buffer. > > > >> > > > >> Bugzilla ID: 1035 > > > >> Cc: sta...@dpdk.org > > > >> > > > >> Signed-off-by: Morten Brørup <m...@smartsharesystems.com> > > > >> --- > > > >> lib/net/rte_ip.h | 32 +++++++++++++++++++++++++++----- > > > >> 1 file changed, 27 insertions(+), 5 deletions(-) > > > >> > > > >> diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h index > > > >> b502481670..738d643da0 100644 > > > >> --- a/lib/net/rte_ip.h > > > >> +++ b/lib/net/rte_ip.h > > > >> @@ -3,6 +3,7 @@ > > > >> * The Regents of the University of California. > > > >> * Copyright(c) 2010-2014 Intel Corporation. > > > >> * Copyright(c) 2014 6WIND S.A. > > > >> + * Copyright(c) 2022 SmartShare Systems. > > > >> * All rights reserved. > > > >> */ > > > >> > > > >> @@ -15,6 +16,7 @@ > > > >> * IP-related defines > > > >> */ > > > >> > > > >> +#include <stdbool.h> > > > >> #include <stdint.h> > > > >> > > > >> #ifdef RTE_EXEC_ENV_WINDOWS > > > >> @@ -162,20 +164,40 @@ __rte_raw_cksum(const void *buf, size_t > len, > > > >> uint32_t sum) > > > >> { > > > >> /* extend strict-aliasing rules */ > > > >> typedef uint16_t __attribute__((__may_alias__)) u16_p; > > > >> - const u16_p *u16_buf = (const u16_p *)buf; > > > >> - const u16_p *end = u16_buf + len / sizeof(*u16_buf); > > > >> + const u16_p *u16_buf; > > > >> + const u16_p *end; > > > >> + uint32_t bsum = 0; > > > >> + const bool unaligned = (uintptr_t)buf & 1; > > > >> + > > > >> + /* if buffer is unaligned, keeping it byte order > independent */ > > > >> + if (unlikely(unaligned)) { > > > >> + uint16_t first = 0; > > > >> + if (unlikely(len == 0)) > > > >> + return 0; > > > >> + ((unsigned char *)&first)[1] = *(const unsigned > > > char *)buf; > > > >> + bsum += first; > > > >> + buf = RTE_PTR_ADD(buf, 1); > > > >> + len--; > > > >> + } > > > >> > > > >> + /* aligned access for compiler auto-vectorization */ > > > > > > The compiler will be able to auto vectorize even unaligned > accesses, > > > just with different instructions. From what I can tell, there's no > > > performance impact, at least not on the x86_64 systems I tried on. > > > > > > I think you should remove the first special case conditional and > use > > > memcpy() instead of the cumbersome __may_alias__ construct to > retrieve > > > the data. > > > > > > > Here: > > https://www.agner.org/optimize/instruction_tables.pdf > > it lists the latency of vmovdqa (aligned) as 6 cycles and the latency > for > > vmovdqu (unaligned) as 7 cycles. So I guess there can be some > difference. > > Although in practice I'm not sure what difference it makes. I've not > seen any > > difference in runtime between the two versions. > > > > Correction to my comment: > Those stats are for some older CPU. For some newer CPUs such as Tiger > Lake the stats seem to be the same regardless of aligned or unaligned. >
I agree that the memcpy method is more elegant and easy to read. However, we would need to performance test the modified checksum function with a large number of CPUs to prove that we don't introduce a performance regression on any CPU architecture still supported by DPDK. And Emil already found a CPU where it costs 1 extra cycle per 16 bytes, which adds up to a total of ca. 91 extra cycles on a 1460 byte TCP packet. So I opted for a solution with zero changes to the inner loop, so no performance retesting is required (for the previously supported use cases, where the buffer is aligned). I have previously submitted a couple of patches to fix some minor bugs in the mempool cache functions [1] and [2], while also refactoring the functions for readability. But after having incorporated various feedback, nobody wants to proceed with the patches, probably due to fear of performance regressions. I didn't want to risk the same with this patch. [1] http://inbox.dpdk.org/dev/98cbd80474fa8b44bf855df32c47dc35d87...@smartserver.smartshare.dk/ [2] http://inbox.dpdk.org/dev/98cbd80474fa8b44bf855df32c47dc35d86...@smartserver.smartshare.dk/