Re: [PATCH v2] eal: fix unaligned loads/stores in rte_memcpy_generic

2022-01-24 Thread Georg Sauthoff
Hello,

Stephen Hemminger wrote (Sun, 16 Jan 2022 08:32:20 -0800):

> I would propose that DPDK have same kind of define as the kernel
> for SAFE_UNALIGNED_ACCESS.  The C standard has to apply to all architectures
> but DPDK will make the choice to be fast rather than standards conformant.

perhaps it's worth mentioning that the Linux Kernel is compiled with
-fno-strict-aliasing, because it contains code which violates the C
strict aliasing rules. Such code yields undefined behavior and is thus
unsafe when compiling with optmizing compilers such as GCC and Clang, by
default. But since the Linux supplies -fno-strict-aliasing its code is
safe, in the Linux Kernel context.

In contrast, DPDK isn't compiled with -fno-strict-aliasing, in general.
Only a few drivers are built with -Wno-strict-aliasing.

Thus, one has to be careful when importing (or being inspired) by the
above mentioned kernel defines.

Especially, it looks like version 5 of this patch yields undefined
behavior because it violates strict-aliasing rules.
It's possible that GCC makes some extra guarantees for the used
constructs, even when compiling without -fno-strict-aliasing. But I'm
not aware of any.

Please note that there is a reason GCC enables -fstrict-aliasing when
compiling with optimizations: Performance
IOW, -fno-strict-aliasing has performance implications, thus one is
advised to not use it, if possible.
IOW, when violating strict-aliasng rules one can easily end up with
low-performing and/or unsafe (buggy) code.

I haven't inspected the DPDK drivers which supply -Wno-strict-aliasing -
but I wouldn't be surprised if there aren't better alternatives. Meaning
writing the code in a way that doesn't require -Wno-strict-aliasing.

BTW, are there still systems that DPDK supports but have a memcpy() which
performs worse than rte_memcpy()?


Best regards,
Georg


Re: [dpdk-dev] [PATCH 1/1] net: fix aliasing issue in checksum computation

2021-10-16 Thread Georg Sauthoff
Hello,

On Fri, Oct 15, 2021 at 04:39:02PM +0200, Olivier Matz wrote:
> On Sat, Sep 18, 2021 at 01:49:30PM +0200, Georg Sauthoff wrote:
> > That means a superfluous cast is removed and aliasing through a uint8_t
> > pointer is eliminated. Note that uint8_t doesn't have the same
> > strict-aliasing properties as unsigned char.
> 
> Interesting. Out of curiosity, do you have links that explains
> this?

yes, I do. https://stefansf.de/post/type-based-alias-analysis/ has some
nice examples and explains some things. Especially, it makes the point
that it's the access that matters for yielding undefined behaviour (i.e.
when violating strict-aliasing rules) and not the cast itself:

"N.B. the standard only speaks about the type of an object and the type
of an lvalue in order to access an object. Thus a pointer to an object
x may be converted arbitrarily often to arbitrary object pointer
types, and therefore even to incompatible types, as long as every
access to x is done through an lvalue which type conforms to C11
section 6.5 paragraph 7."

Section 'Character Type' in that article also addresses how uint8_t
isn't special as unsigned char while quoting the standard and
referencing below Bugzilla bug.

Another good article on strict aliasing:

https://gustedt.wordpress.com/2016/08/17/effective-types-and-aliasing/

 
> I found these, but these are just discussions:
>   
> https://stackoverflow.com/questions/16138237/when-is-uint8-t-%E2%89%A0-unsigned-char
>   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66110

I like the Bugzilla link as it shows how some code benefits from
uint8_t not having the same aliasing requirements as e.g. unsigned char.
Thus, it's an example of why compiler developers might be motivated to
decide against making uint8_t a typedef of unsigned char, since the
standard doesn't require it.

> What about rewording the sentence "uint8_t doesn't have the same
> strict-aliasing properties as unsigned char" to clarify that unsigned
> char may alias, but uint8_t may not?

I can change

"That means a superfluous cast is removed and aliasing through a uint8_t
pointer is eliminated. Note that uint8_t doesn't have the same
strict-aliasing properties as unsigned char."

to

"That means a superfluous cast is removed and aliasing through a uint8_t
pointer is eliminated. NB: The C standard specifies that a unsigned char
pointer may alias while the C standard doesn't include such requirement
for uint8_t pointers."

Better?


Best regards
Georg

-- 
'This function is not fully implemented in some standard libraries. For
 example, gcc and clang always return zero even though the device is
 non-deterministic. In comparison, Visual Studio always returns 32, and
 boost.random returns 10.'
  (http://en.cppreference.com/w/cpp/numeric/random/random_device/entropy, 2014)


Re: [dpdk-dev] [PATCH 1/1] net: fix aliasing issue in checksum computation

2021-10-16 Thread Georg Sauthoff
Hello,

On Sat, Oct 16, 2021 at 10:21:03AM +0200, Morten Brørup wrote:
> I have given this some more thoughts.
> 
> Most bytes transferred in real life are transferred in large packets,
> so faster processing of large packets is a great improvement!
> 
> Furthermore, a quick analysis of a recent packet sample from an ISP
> customer of ours shows that less than 8 % of the packets are odd size.
> Would you consider adding an unlikely() to the branch handling the odd
> byte at the end?

sure, I don't see a problem with adding unlikely() there.

I'll post a version 2 of that patch then, tomorrow.

Best regards
Georg

-- 
"No one can write decently who is distrustful of the reader's
intelligence, or whose attitude is patronizing." (William Strunk,
Jr. and E.B. White, The Elements of Style, p. 70, 1959)


[dpdk-dev] [PATCH v2 0/1] net: fix aliasing issue in checksum computation

2021-10-17 Thread Georg Sauthoff
The current __rte_raw_cksum() function violates the C strict-aliasing
rules since it uses a uint8_t pointer to access a trailing byte.

This patch also fixes a superfluous cast, i.e.:

uintptr_t ptr = (uintptr_t)buf;
typedef uint16_t __attribute__((__may_alias__)) u16_p;
const u16_p *u16_buf = (const u16_p *)ptr;

Transitive casting involving uintptr_t doesn't solve anything here.
It also doesn't help with fixing a strict-aliasing issue here.

The patch also simplifies the main loop, i.e. it eliminates the manually
unrolled loop

while (len >= (sizeof(*u16_buf) * 4)) {
sum += u16_buf[0];
sum += u16_buf[1];
sum += u16_buf[2];
sum += u16_buf[3];
len -= sizeof(*u16_buf) * 4;
u16_buf += 4;
}

since modern C compilers are in a better position to decide which level of
unrolling is optimal for the target architecture.

See also https://godbolt.org/z/6rYbYGnj7 which shows how GCC auto-vectorizes
the simplified loop using AVX instructions, when compiling for Haswell. When
looking at the number of instructions in the compiled code, the new version
is half as big as the existing one.

Signed-off-by: Georg Sauthoff 
---
v2:
* Reword commit message (detail aliasing implications of uint8_t)
* Add unlikely()

Georg Sauthoff (1):
  net: fix aliasing issue in checksum computation

 lib/net/rte_ip.h | 27 ---
 1 file changed, 8 insertions(+), 19 deletions(-)

-- 
2.31.1



[dpdk-dev] [PATCH v2 1/1] net: fix aliasing issue in checksum computation

2021-10-17 Thread Georg Sauthoff
That means a superfluous cast is removed and aliasing through a uint8_t
pointer is eliminated. NB: The C standard specifies that a unsigned char
pointer may alias while the C standard doesn't include such requirement
for uint8_t pointers.

Also simplified the loop since a modern C compiler can speed up (i.e.
auto-vectorize) it in a similar way. For example, GCC auto-vectorizes it
for Haswell using AVX registers while halving the number of instructions
in the generated code.

Signed-off-by: Georg Sauthoff 
---
 lib/net/rte_ip.h | 27 ---
 1 file changed, 8 insertions(+), 19 deletions(-)

diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h
index 05948b69b7..1b8c6519a9 100644
--- a/lib/net/rte_ip.h
+++ b/lib/net/rte_ip.h
@@ -141,29 +141,18 @@ rte_ipv4_hdr_len(const struct rte_ipv4_hdr *ipv4_hdr)
 static inline uint32_t
 __rte_raw_cksum(const void *buf, size_t len, uint32_t sum)
 {
-   /* workaround gcc strict-aliasing warning */
-   uintptr_t ptr = (uintptr_t)buf;
+   /* extend strict-aliasing rules */
typedef uint16_t __attribute__((__may_alias__)) u16_p;
-   const u16_p *u16_buf = (const u16_p *)ptr;
-
-   while (len >= (sizeof(*u16_buf) * 4)) {
-   sum += u16_buf[0];
-   sum += u16_buf[1];
-   sum += u16_buf[2];
-   sum += u16_buf[3];
-   len -= sizeof(*u16_buf) * 4;
-   u16_buf += 4;
-   }
-   while (len >= sizeof(*u16_buf)) {
+   const u16_p *u16_buf = (const u16_p *)buf;
+   const u16_p *end = u16_buf + len / sizeof(*u16_buf);
+
+   for (; u16_buf != end; ++u16_buf)
sum += *u16_buf;
-   len -= sizeof(*u16_buf);
-   u16_buf += 1;
-   }
 
-   /* if length is in odd bytes */
-   if (len == 1) {
+   /* if length is odd, keeping it byte order independent */
+   if (unlikely(len % 2)) {
uint16_t left = 0;
-   *(uint8_t *)&left = *(const uint8_t *)u16_buf;
+   *(unsigned char*)&left = *(const unsigned char *)end;
sum += left;
}
 
-- 
2.31.1



[dpdk-dev] [PATCH 0/1] net: fix aliasing issue in checksum computation

2021-09-18 Thread Georg Sauthoff
The current __rte_raw_cksum() function violates the C strict-aliasing
rules since it uses a uint8_t pointer to access a trailing byte.

This patch also fixes a superfluous cast, i.e.:

uintptr_t ptr = (uintptr_t)buf;
typedef uint16_t __attribute__((__may_alias__)) u16_p;
const u16_p *u16_buf = (const u16_p *)ptr;

Transitive casting involving uintptr_t doesn't solve anything here.
It also doesn't help with fixing a strict-aliasing issue here.

The patch also simplifies the main loop, i.e. it eliminates the manually
unrolled loop

while (len >= (sizeof(*u16_buf) * 4)) {
sum += u16_buf[0];
sum += u16_buf[1];
sum += u16_buf[2];
sum += u16_buf[3];
len -= sizeof(*u16_buf) * 4;
u16_buf += 4;
}

since modern C compilers are in a better position to decide which level of
unrolling is optimal for the target architecture.

See also https://godbolt.org/z/6rYbYGnj7 which shows how GCC auto-vectorizes
the simplified loop using AVX instructions, when compiling for Haswell. When
looking at the number of instructions in the compiled code, the new version
is half as big as the existing one.

Georg Sauthoff (1):
  net: fix aliasing issue in checksum computation

 lib/net/rte_ip.h | 27 ---
 1 file changed, 8 insertions(+), 19 deletions(-)

-- 
2.31.1



[dpdk-dev] [PATCH 1/1] net: fix aliasing issue in checksum computation

2021-09-18 Thread Georg Sauthoff
That means a superfluous cast is removed and aliasing through a uint8_t
pointer is eliminated. Note that uint8_t doesn't have the same
strict-aliasing properties as unsigned char.

Also simplified the loop since a modern C compiler can speed up (i.e.
auto-vectorize) it in a similar way. For example, GCC auto-vectorizes it
for Haswell using AVX registers while halving the number of instructions
in the generated code.

Signed-off-by: Georg Sauthoff 
---
 lib/net/rte_ip.h | 27 ---
 1 file changed, 8 insertions(+), 19 deletions(-)

diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h
index 05948b69b7..386db94c85 100644
--- a/lib/net/rte_ip.h
+++ b/lib/net/rte_ip.h
@@ -141,29 +141,18 @@ rte_ipv4_hdr_len(const struct rte_ipv4_hdr *ipv4_hdr)
 static inline uint32_t
 __rte_raw_cksum(const void *buf, size_t len, uint32_t sum)
 {
-   /* workaround gcc strict-aliasing warning */
-   uintptr_t ptr = (uintptr_t)buf;
+   /* extend strict-aliasing rules */
typedef uint16_t __attribute__((__may_alias__)) u16_p;
-   const u16_p *u16_buf = (const u16_p *)ptr;
-
-   while (len >= (sizeof(*u16_buf) * 4)) {
-   sum += u16_buf[0];
-   sum += u16_buf[1];
-   sum += u16_buf[2];
-   sum += u16_buf[3];
-   len -= sizeof(*u16_buf) * 4;
-   u16_buf += 4;
-   }
-   while (len >= sizeof(*u16_buf)) {
+   const u16_p *u16_buf = (const u16_p *)buf;
+   const u16_p *end = u16_buf + len / sizeof(*u16_buf);
+
+   for (; u16_buf != end; ++u16_buf)
sum += *u16_buf;
-   len -= sizeof(*u16_buf);
-   u16_buf += 1;
-   }
 
-   /* if length is in odd bytes */
-   if (len == 1) {
+   /* if length is odd, keeping it byte order independent */
+   if (len % 2) {
uint16_t left = 0;
-   *(uint8_t *)&left = *(const uint8_t *)u16_buf;
+   *(unsigned char*)&left = *(const unsigned char *)end;
sum += left;
}
 
-- 
2.31.1