> -----Original Message----- > From: Emil Berg > Sent: den 27 juni 2022 14:46 > To: Mattias Rönnblom <hof...@lysator.liu.se>; Morten Brørup > <m...@smartsharesystems.com>; bruce.richard...@intel.com; > dev@dpdk.org > Cc: step...@networkplumber.org; sta...@dpdk.org; bugzi...@dpdk.org; > olivier.m...@6wind.com > Subject: RE: [PATCH v4] net: fix checksum with unaligned buffer > > > > > -----Original Message----- > > From: Mattias Rönnblom <hof...@lysator.liu.se> > > Sent: den 27 juni 2022 14:28 > > To: Morten Brørup <m...@smartsharesystems.com>; Emil Berg > > <emil.b...@ericsson.com>; bruce.richard...@intel.com; dev@dpdk.org > > Cc: step...@networkplumber.org; sta...@dpdk.org; bugzi...@dpdk.org; > > olivier.m...@6wind.com > > Subject: Re: [PATCH v4] net: fix checksum with unaligned buffer > > > > On 2022-06-23 14:51, Morten Brørup wrote: > > >> From: Morten Brørup [mailto:m...@smartsharesystems.com] > > >> Sent: Thursday, 23 June 2022 14.39 > > >> > > >> With this patch, the checksum can be calculated on an unaligned buffer. > > >> I.e. the buf parameter is no longer required to be 16 bit aligned. > > >> > > >> The checksum is still calculated using a 16 bit aligned pointer, so > > >> the compiler can auto-vectorize the function's inner loop. > > >> > > >> When the buffer is unaligned, the first byte of the buffer is > > >> handled separately. Furthermore, the calculated checksum of the > > >> buffer is byte shifted before being added to the initial checksum, > > >> to compensate for the checksum having been calculated on the buffer > > >> shifted by one byte. > > >> > > >> v4: > > >> * Add copyright notice. > > >> * Include stdbool.h (Emil Berg). > > >> * Use RTE_PTR_ADD (Emil Berg). > > >> * Fix one more typo in commit message. Is 'unligned' even a word? > > >> v3: > > >> * Remove braces from single statement block. > > >> * Fix typo in commit message. > > >> v2: > > >> * Do not assume that the buffer is part of an aligned packet buffer. > > >> > > >> Bugzilla ID: 1035 > > >> Cc: sta...@dpdk.org > > >> > > >> Signed-off-by: Morten Brørup <m...@smartsharesystems.com> > > >> --- > > >> lib/net/rte_ip.h | 32 +++++++++++++++++++++++++++----- > > >> 1 file changed, 27 insertions(+), 5 deletions(-) > > >> > > >> diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h index > > >> b502481670..738d643da0 100644 > > >> --- a/lib/net/rte_ip.h > > >> +++ b/lib/net/rte_ip.h > > >> @@ -3,6 +3,7 @@ > > >> * The Regents of the University of California. > > >> * Copyright(c) 2010-2014 Intel Corporation. > > >> * Copyright(c) 2014 6WIND S.A. > > >> + * Copyright(c) 2022 SmartShare Systems. > > >> * All rights reserved. > > >> */ > > >> > > >> @@ -15,6 +16,7 @@ > > >> * IP-related defines > > >> */ > > >> > > >> +#include <stdbool.h> > > >> #include <stdint.h> > > >> > > >> #ifdef RTE_EXEC_ENV_WINDOWS > > >> @@ -162,20 +164,40 @@ __rte_raw_cksum(const void *buf, size_t len, > > >> uint32_t sum) > > >> { > > >> /* extend strict-aliasing rules */ > > >> typedef uint16_t __attribute__((__may_alias__)) u16_p; > > >> - const u16_p *u16_buf = (const u16_p *)buf; > > >> - const u16_p *end = u16_buf + len / sizeof(*u16_buf); > > >> + const u16_p *u16_buf; > > >> + const u16_p *end; > > >> + uint32_t bsum = 0; > > >> + const bool unaligned = (uintptr_t)buf & 1; > > >> + > > >> + /* if buffer is unaligned, keeping it byte order independent */ > > >> + if (unlikely(unaligned)) { > > >> + uint16_t first = 0; > > >> + if (unlikely(len == 0)) > > >> + return 0; > > >> + ((unsigned char *)&first)[1] = *(const unsigned > > char *)buf; > > >> + bsum += first; > > >> + buf = RTE_PTR_ADD(buf, 1); > > >> + len--; > > >> + } > > >> > > >> + /* aligned access for compiler auto-vectorization */ > > > > The compiler will be able to auto vectorize even unaligned accesses, > > just with different instructions. From what I can tell, there's no > > performance impact, at least not on the x86_64 systems I tried on. > > > > I think you should remove the first special case conditional and use > > memcpy() instead of the cumbersome __may_alias__ construct to retrieve > > the data. > > > > Here: > https://www.agner.org/optimize/instruction_tables.pdf > it lists the latency of vmovdqa (aligned) as 6 cycles and the latency for > vmovdqu (unaligned) as 7 cycles. So I guess there can be some difference. > Although in practice I'm not sure what difference it makes. I've not seen any > difference in runtime between the two versions. >
Correction to my comment: Those stats are for some older CPU. For some newer CPUs such as Tiger Lake the stats seem to be the same regardless of aligned or unaligned. > > >> + u16_buf = (const u16_p *)buf; > > >> + end = u16_buf + len / sizeof(*u16_buf); > > >> for (; u16_buf != end; ++u16_buf) > > >> - sum += *u16_buf; > > >> + bsum += *u16_buf; > > >> > > >> /* if length is odd, keeping it byte order independent */ > > >> if (unlikely(len % 2)) { > > >> uint16_t left = 0; > > >> *(unsigned char *)&left = *(const unsigned char > > *)end; > > >> - sum += left; > > >> + bsum += left; > > >> } > > >> > > >> - return sum; > > >> + /* if buffer is unaligned, swap the checksum bytes */ > > >> + if (unlikely(unaligned)) > > >> + bsum = (bsum & 0xFF00FF00) >> 8 | (bsum & > > 0x00FF00FF) << 8; > > >> + > > >> + return sum + bsum; > > >> } > > >> > > >> /** > > >> -- > > >> 2.17.1 > > > > > > @Emil, thank you for thoroughly reviewing the previous versions. > > > > > > If your test succeeds and you are satisfied with the patch, remember > > > to > > reply with a "Tested-by" tag for patchwork. > > >