On 2022-06-27 22:21, Morten Brørup wrote:
From: Mattias Rönnblom [mailto:hof...@lysator.liu.se]
Sent: Monday, 27 June 2022 19.23

On 2022-06-27 15:22, Morten Brørup wrote:
From: Emil Berg [mailto:emil.b...@ericsson.com]
Sent: Monday, 27 June 2022 14.51

From: Emil Berg
Sent: den 27 juni 2022 14:46

From: Mattias Rönnblom <hof...@lysator.liu.se>
Sent: den 27 juni 2022 14:28

On 2022-06-23 14:51, Morten Brørup wrote:
From: Morten Brørup [mailto:m...@smartsharesystems.com]
Sent: Thursday, 23 June 2022 14.39

With this patch, the checksum can be calculated on an unaligned
buffer.
I.e. the buf parameter is no longer required to be 16 bit
aligned.

The checksum is still calculated using a 16 bit aligned pointer,
so
the compiler can auto-vectorize the function's inner loop.

When the buffer is unaligned, the first byte of the buffer is
handled separately. Furthermore, the calculated checksum of the
buffer is byte shifted before being added to the initial
checksum,
to compensate for the checksum having been calculated on the
buffer
shifted by one byte.

v4:
* Add copyright notice.
* Include stdbool.h (Emil Berg).
* Use RTE_PTR_ADD (Emil Berg).
* Fix one more typo in commit message. Is 'unligned' even a
word?
v3:
* Remove braces from single statement block.
* Fix typo in commit message.
v2:
* Do not assume that the buffer is part of an aligned packet
buffer.

Bugzilla ID: 1035
Cc: sta...@dpdk.org

Signed-off-by: Morten Brørup <m...@smartsharesystems.com>

[...]


The compiler will be able to auto vectorize even unaligned
accesses,
just with different instructions. From what I can tell, there's no
performance impact, at least not on the x86_64 systems I tried on.

I think you should remove the first special case conditional and
use
memcpy() instead of the cumbersome __may_alias__ construct to
retrieve
the data.


Here:
https://www.agner.org/optimize/instruction_tables.pdf
it lists the latency of vmovdqa (aligned) as 6 cycles and the
latency
for
vmovdqu (unaligned) as 7 cycles. So I guess there can be some
difference.
Although in practice I'm not sure what difference it makes. I've
not
seen any
difference in runtime between the two versions.


Correction to my comment:
Those stats are for some older CPU. For some newer CPUs such as
Tiger
Lake the stats seem to be the same regardless of aligned or
unaligned.


I agree that the memcpy method is more elegant and easy to read.

However, we would need to performance test the modified checksum
function with a large number of CPUs to prove that we don't introduce a
performance regression on any CPU architecture still supported by DPDK.
And Emil already found a CPU where it costs 1 extra cycle per 16 bytes,
which adds up to a total of ca. 91 extra cycles on a 1460 byte TCP
packet.


I think you've misunderstood what latency means in such tables. It's a
data dependency thing, not a measure of throughput. The throughput is
*much* higher. My guess would be two such instruction per clock.

For your 1460 bytes example, my Zen3 AMD needs performs identical with
both the current DPDK implementation, your patch, and a memcpy()-ified
version of the current implementation. They all need ~130 clock
cycles/packet, with warm caches. IPC is 3 instructions per cycle, but
obvious not all instructions are SIMD.

You're right, I wasn't thinking deeper about it before extrapolating.

Great to see some real numbers! I wish someone would do the same testing on an 
old ARM CPU, so we could also see the other end of the scale.


I've ran it on an ARM A72. For the aligned 1460 bytes case I got: Current DPDK ~572 cc. Your patch: ~578 cc. Memcpy-fied: ~573 cc. They performed about the same for all unaligned/aligned and sizes I tested. This platform (or could be GCC version as well) doesn't suffer from the unaligned performance degradation your patch showed on my AMD machine.

The main issue with checksumming on the CPU is, in my experience, not
that you don't have enough compute, but that you trash the caches.

Agree. I have noticed that x86 has "non-temporal" instruction variants to 
load/store data without trashing the cache entirely.

A variant of the checksum function using such instructions might be handy.


Yes, although you may need to prefetch the payload for good performance.

Variants of the memcpy function using such instructions might also be handy for 
some purposes, e.g. copying the contents of packets, where the original and/or 
copy will not accessed shortly thereafter.


Indeed and I think it's been discussed on the list. There's some work to get it right, since alignment requirement and the fact a different memory model is used for those SIMD instructions causes trouble for a generic implementation. (For x86_64.)

So I opted for a solution with zero changes to the inner loop, so no
performance retesting is required (for the previously supported use
cases, where the buffer is aligned).


You will see performance degradation with this solution as well, under
certain conditions. For unaligned 100 bytes of data, the current DPDK
implementation and the memcpy()-fied version needs ~21 cc/packet. Your
patch needs 54 cc/packet.

Yes, it's a tradeoff. I exclusively aimed at maintaining performance for the 
case with aligned buffers (under all circumstances, with all CPUs etc.), and 
ignored how it affects the performance for the case with unaligned buffers.

Unlike this patch, the memcpy() variant has no additional branches for the 
unaligned case, so its performance should be generally unaffected by the buffer 
being aligned or not. However, I don't have sufficient in-depth CPU knowledge 
to say if this also applies to RISCV and older ARM CPUs still supported by DPDK.


I don't think avoiding RISCV non-catastrophic regressions triumphs improving performance on mainstream CPUs and avoiding code quality regressions.

I don't have strong feelings for this patch, so you can provide a memcpy() 
based alternative patch, and we will let the community decide which one they 
prefer.


This is what my conversion looks like:

static inline uint32_t
__rte_raw_cksum(const void *buf, size_t len, uint32_t sum)
{
        const void *end;

        for (end = RTE_PTR_ADD(buf, (len/sizeof(uint16_t)) * sizeof(uint16_t));
             buf != end; buf = RTE_PTR_ADD(buf, sizeof(uint16_t))) {
                uint16_t v;

                memcpy(&v, buf, sizeof(uint16_t));
                sum += v;
        }

        /* if length is odd, keeping it byte order independent */
        if (unlikely(len % 2)) {
                uint8_t last;
                uint16_t left = 0;

                memcpy(&last, end, 1);
                *(unsigned char *)&left = last;
                sum += left;
        }

        return sum;
}

Maybe the maintainer can have an opinion, before I provide a real patch, to save me some work. I would really like to contribute some performance autotests in that case. I mean, you could be right and the performance could be totally off on some platform.

But the old version didn't support unaligned accesses? In many compiler
flag/machine combinations it did.

It crashed in some cases. That was the point of the bug report [1], and the 
reason to provide a patch.

[1] https://bugs.dpdk.org/show_bug.cgi?id=1035

Reply via email to