> -----Original Message----- > From: Morten Brørup <m...@smartsharesystems.com> > Sent: den 17 juni 2022 11:07 > To: Emil Berg <emil.b...@ericsson.com> > Cc: sta...@dpdk.org; bugzi...@dpdk.org; hof...@lysator.liu.se; > olivier.m...@6wind.com; dev@dpdk.org > Subject: RE: [PATCH] net: fix checksum with unaligned buffer > > > From: Morten Brørup [mailto:m...@smartsharesystems.com] > > Sent: Friday, 17 June 2022 10.45 > > > > With this patch, the checksum can be calculated on an unligned part of > > a packet buffer. > > I.e. the buf parameter is no longer required to be 16 bit aligned. > > > > The DPDK invariant that packet buffers must be 16 bit aligned remains > > unchanged. > > This invariant also defines how to calculate the 16 bit checksum on an > > unaligned part of a packet buffer. > > > > Bugzilla ID: 1035 > > Cc: sta...@dpdk.org > > > > Signed-off-by: Morten Brørup <m...@smartsharesystems.com> > > --- > > lib/net/rte_ip.h | 17 +++++++++++++++-- > > 1 file changed, 15 insertions(+), 2 deletions(-) > > > > diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h index > > b502481670..8e301d9c26 100644 > > --- a/lib/net/rte_ip.h > > +++ b/lib/net/rte_ip.h > > @@ -162,9 +162,22 @@ __rte_raw_cksum(const void *buf, size_t len, > > uint32_t sum) { > > /* extend strict-aliasing rules */ > > typedef uint16_t __attribute__((__may_alias__)) u16_p; > > - const u16_p *u16_buf = (const u16_p *)buf; > > - const u16_p *end = u16_buf + len / sizeof(*u16_buf); > > + const u16_p *u16_buf; > > + const u16_p *end; > > + > > + /* if buffer is unaligned, keeping it byte order independent */ > > + if (unlikely((uintptr_t)buf & 1)) { > > + uint16_t first = 0; > > + if (unlikely(len == 0)) > > + return 0; > > + ((unsigned char *)&first)[1] = *(const unsigned > char *)buf; > > + sum += first; > > + buf = (const void *)((uintptr_t)buf + 1); > > + len--; > > + } > > > > + u16_buf = (const u16_p *)buf; > > + end = u16_buf + len / sizeof(*u16_buf); > > for (; u16_buf != end; ++u16_buf) > > sum += *u16_buf; > > > > -- > > 2.17.1 > > @Emil, can you please test this patch with an unaligned buffer on your > application to confirm that it produces the expected result.
Hi! I tested the patch. It doesn't seem to produce the same results. I think the problem is that it always starts summing from an even address, the sum should always start from the first byte according to the checksum specification. Can I instead propose something Mattias Rönnblom sent me? -------------------------------------------------------------------------------------------------------------- const void *end = RTE_PTR_ADD(buf, (len / sizeof(uint16_t)) * sizeof(uint16_t)); for (; buf != end; buf = RTE_PTR_ADD(buf, sizeof(uint16_t))) { uint16_t v; memcpy(&v, buf, sizeof(uint16_t)); sum += v; } /* if length is odd, keeping it byte order independent */ if (unlikely(len % 2)) { uint16_t left = 0; *(unsigned char *)&left = *(const unsigned char *)end; sum += left; } -------------------------------------------------------------------------------------------------------------- Note that the last block is the same as before. Amazingly I see no measurable performance hit from this compared to the previous one (-O3, march=native). Looking at the previous the loop body may compile to (x86): -------------------------------------------------------------------------------------------------------------- vmovdqa (%rdx),%xmm1 vpmovzxwd %xmm1,%xmm0 vpsrldq $0x8,%xmm1,%xmm1 vpmovzxwd %xmm1,%xmm1 vpaddd %xmm1,%xmm0,%xmm0 cmp $0xf,%rax jbe 0x7ff7a0dfb1a9 -------------------------------------------------------------------------------------------------------------- while Mattias' memcpy solution: -------------------------------------------------------------------------------------------------------------- vmovdqu (%rcx),%ymm0 add $0x20,%rcx vpmovzxwd %xmm0,%ymm1 vextracti128 $0x1,%ymm0,%xmm0 vpmovzxwd %xmm0,%ymm0 vpaddd %ymm0,%ymm1,%ymm0 vpaddd %ymm0,%ymm2,%ymm2 cmp %r9,%rcx jne 0x555555556380 -------------------------------------------------------------------------------------------------------------- Thus two extra instructions in the loop, but I suspect it may be memory bound, leading to no measurable performance difference. Any comments? /Emil