Joseph S. Myers wrote: >I don't know if it was proposed in this context, but the ARM EABI has >various __aeabi_mem* functions for calls known to have particular >alignment and the idea is relevant to other platforms if you provide such >functions with the compiler. The compiler could also generate calls to >different functions depending on the -march options and so save the >runtime CPU check cost (you could have options to call either generic >versions, or versions for a particular CPU, depending on whether you are >building a generic binary for CPU-X-or-newer or a binary just for CPU X).
memcpy in the Intel and Mac libraries, as well as my own code, have different branches for different alignments and different CPU instruction sets. The runtime cost for this branching is negligible compared to the gain, even when the byte count is small. No need to bother the programmer with different versions.
You can just copy the code from the Mac library, or from me.