On Thu, Jun 30, 2022 at 6:32 PM Morten Brørup <m...@smartsharesystems.com> wrote: > > > From: Mattias Rönnblom [mailto:hof...@lysator.liu.se] > > Sent: Tuesday, 28 June 2022 08.28 > > > > On 2022-06-27 22:21, Morten Brørup wrote: > > >> From: Mattias Rönnblom [mailto:hof...@lysator.liu.se] > > >> Sent: Monday, 27 June 2022 19.23 > > >> > > >> On 2022-06-27 15:22, Morten Brørup wrote: > > >>>> From: Emil Berg [mailto:emil.b...@ericsson.com] > > >>>> Sent: Monday, 27 June 2022 14.51 > > >>>> > > >>>>> From: Emil Berg > > >>>>> Sent: den 27 juni 2022 14:46 > > >>>>> > > >>>>>> From: Mattias Rönnblom <hof...@lysator.liu.se> > > >>>>>> Sent: den 27 juni 2022 14:28 > > >>>>>> > > >>>>>> On 2022-06-23 14:51, Morten Brørup wrote: > > >>>>>>>> From: Morten Brørup [mailto:m...@smartsharesystems.com] > > >>>>>>>> Sent: Thursday, 23 June 2022 14.39 > > >>>>>>>> > > >>>>>>>> With this patch, the checksum can be calculated on an > > unaligned > > >>>> buffer. > > >>>>>>>> I.e. the buf parameter is no longer required to be 16 bit > > >>>> aligned. > > >>>>>>>> > > >>>>>>>> The checksum is still calculated using a 16 bit aligned > > pointer, > > >>>> so > > >>>>>>>> the compiler can auto-vectorize the function's inner loop. > > >>>>>>>> > > >>>>>>>> When the buffer is unaligned, the first byte of the buffer is > > >>>>>>>> handled separately. Furthermore, the calculated checksum of > > the > > >>>>>>>> buffer is byte shifted before being added to the initial > > >>>> checksum, > > >>>>>>>> to compensate for the checksum having been calculated on the > > >>>> buffer > > >>>>>>>> shifted by one byte. > > >>>>>>>> > > >>>>>>>> v4: > > >>>>>>>> * Add copyright notice. > > >>>>>>>> * Include stdbool.h (Emil Berg). > > >>>>>>>> * Use RTE_PTR_ADD (Emil Berg). > > >>>>>>>> * Fix one more typo in commit message. Is 'unligned' even a > > >>>> word? > > >>>>>>>> v3: > > >>>>>>>> * Remove braces from single statement block. > > >>>>>>>> * Fix typo in commit message. > > >>>>>>>> v2: > > >>>>>>>> * Do not assume that the buffer is part of an aligned packet > > >>>> buffer. > > >>>>>>>> > > >>>>>>>> Bugzilla ID: 1035 > > >>>>>>>> Cc: sta...@dpdk.org > > >>>>>>>> > > >>>>>>>> Signed-off-by: Morten Brørup <m...@smartsharesystems.com> > > > > > > [...] > > > > > >>>>>> > > >>>>>> The compiler will be able to auto vectorize even unaligned > > >>>> accesses, > > >>>>>> just with different instructions. From what I can tell, there's > > no > > >>>>>> performance impact, at least not on the x86_64 systems I tried > > on. > > >>>>>> > > >>>>>> I think you should remove the first special case conditional and > > >>>> use > > >>>>>> memcpy() instead of the cumbersome __may_alias__ construct to > > >>>> retrieve > > >>>>>> the data. > > >>>>>> > > >>>>> > > >>>>> Here: > > >>>>> https://www.agner.org/optimize/instruction_tables.pdf > > >>>>> it lists the latency of vmovdqa (aligned) as 6 cycles and the > > >> latency > > >>>> for > > >>>>> vmovdqu (unaligned) as 7 cycles. So I guess there can be some > > >>>> difference. > > >>>>> Although in practice I'm not sure what difference it makes. I've > > >> not > > >>>> seen any > > >>>>> difference in runtime between the two versions. > > >>>>> > > >>>> > > >>>> Correction to my comment: > > >>>> Those stats are for some older CPU. For some newer CPUs such as > > >> Tiger > > >>>> Lake the stats seem to be the same regardless of aligned or > > >> unaligned. > > >>>> > > >>> > > >>> I agree that the memcpy method is more elegant and easy to read. > > >>> > > >>> However, we would need to performance test the modified checksum > > >> function with a large number of CPUs to prove that we don't > > introduce a > > >> performance regression on any CPU architecture still supported by > > DPDK. > > >> And Emil already found a CPU where it costs 1 extra cycle per 16 > > bytes, > > >> which adds up to a total of ca. 91 extra cycles on a 1460 byte TCP > > >> packet. > > >>> > > >> > > >> I think you've misunderstood what latency means in such tables. It's > > a > > >> data dependency thing, not a measure of throughput. The throughput > > is > > >> *much* higher. My guess would be two such instruction per clock. > > >> > > >> For your 1460 bytes example, my Zen3 AMD needs performs identical > > with > > >> both the current DPDK implementation, your patch, and a memcpy()- > > ified > > >> version of the current implementation. They all need ~130 clock > > >> cycles/packet, with warm caches. IPC is 3 instructions per cycle, > > but > > >> obvious not all instructions are SIMD. > > > > > > You're right, I wasn't thinking deeper about it before extrapolating. > > > > > > Great to see some real numbers! I wish someone would do the same > > testing on an old ARM CPU, so we could also see the other end of the > > scale. > > > > > > > I've ran it on an ARM A72. For the aligned 1460 bytes case I got: > > Current DPDK ~572 cc. Your patch: ~578 cc. Memcpy-fied: ~573 cc. They > > performed about the same for all unaligned/aligned and sizes I tested. > > This platform (or could be GCC version as well) doesn't suffer from the > > unaligned performance degradation your patch showed on my AMD machine. > > > > >> The main issue with checksumming on the CPU is, in my experience, > > not > > >> that you don't have enough compute, but that you trash the caches. > > > > > > Agree. I have noticed that x86 has "non-temporal" instruction > > variants to load/store data without trashing the cache entirely. > > > > > > A variant of the checksum function using such instructions might be > > handy. > > > > > > > Yes, although you may need to prefetch the payload for good > > performance. > > > > > Variants of the memcpy function using such instructions might also be > > handy for some purposes, e.g. copying the contents of packets, where > > the original and/or copy will not accessed shortly thereafter. > > > > > > > Indeed and I think it's been discussed on the list. There's some work > > to > > get it right, since alignment requirement and the fact a different > > memory model is used for those SIMD instructions causes trouble for a > > generic implementation. (For x86_64.) > > I just posted an RFC [1] for such memcpy() and memset() functions, > so let's see how it fans out. > > [1] > http://inbox.dpdk.org/dev/98cbd80474fa8b44bf855df32c47dc35d87...@smartserver.smartshare.dk/T/#u > > > > > >>> So I opted for a solution with zero changes to the inner loop, so > > no > > >> performance retesting is required (for the previously supported use > > >> cases, where the buffer is aligned). > > >>> > > >> > > >> You will see performance degradation with this solution as well, > > under > > >> certain conditions. For unaligned 100 bytes of data, the current > > DPDK > > >> implementation and the memcpy()-fied version needs ~21 cc/packet. > > Your > > >> patch needs 54 cc/packet. > > > > > > Yes, it's a tradeoff. I exclusively aimed at maintaining performance > > for the case with aligned buffers (under all circumstances, with all > > CPUs etc.), and ignored how it affects the performance for the case > > with unaligned buffers. > > > > > > Unlike this patch, the memcpy() variant has no additional branches > > for the unaligned case, so its performance should be generally > > unaffected by the buffer being aligned or not. However, I don't have > > sufficient in-depth CPU knowledge to say if this also applies to RISCV > > and older ARM CPUs still supported by DPDK. > > > > > > > I don't think avoiding RISCV non-catastrophic regressions triumphs > > improving performance on mainstream CPUs and avoiding code quality > > regressions. > +1 +1. In general RISC-V spec leaves the unaligned load/store handling to implementation (it might fault, it might not). The U74 core that I have at hand allows unaligned reads/writes. Though it's not a platform for performance evaluation (time measurement causes a trap to firmware), so I won't say anything on that.
-- Best Regards, Stanisław Kardach