> X86 always allows unaligned access. Irregardless of what tools say. > Why impose additional overhead in performance critical code.
Let me preface my response by saying that I'm not a C compiler developer. Hopefully someone who is will read this and chime in. I agree that X86 allows unaligned store/load. However, the C standard doesn't, and says that it's undefined behavior. This means that the code relies on undefined behavior. It may do the right thing all the time, almost all the time, some of the time... it's undefined. It may work now but it may stop working in the future. Here's a good discussion on SO about unaligned accesses in C on x86: https://stackoverflow.com/questions/46790550/c-undefined-behavior-strict-aliasing-rule-or-incorrect-alignment/46790815#46790815 There's no way to do the unaligned store/load in C (that I know of) without invoking undefined behavior. I can see 2 options, either write the code in assembly, or use some other C construct that doesn't rely on undefined behavior. While the for loop may seem slower than the other options, it surprisingly results in fewer load/store operations in certain scenarios. For example, if n == 15 and it's known at compile-time, the compiler will generate 2 overlapping qword load/store operations (rather than the 4 that are currently being done with the current code). All that being said, I can go back to something similar to my first patch. Using inline assembly, and making sure this time that it works for 32-bit too. I will post a patch in a few minutes that does exactly that. Maintainers can then chime in with their preferred option.