Hi, This is another comment on builtin/header comments. While for alignment/size bounds there is no reason to use builtin as gcc should also optimize header version with if (__builtin_constant_p(n<42) && n < 42)
there is information that you could get only with gcc help. Its that while speculative loads are easy library couldn't do speculative writes. One of performance issues is that copying single byte with library routine is expensive. Mainly as its at tail of unpredicted branches as bigger range is more probable, a extreme would be my following benchmark where copying one byte with avx2 memcpy is slower than copying 700 bytes as benchmark caused loop to be likely and well predicted while one byte is quite unlikely http://kam.mff.cuni.cz/~ondra/benchmark_string/haswell/memcpy_profile/results_rand/result.html Now as proposal in reality most of time increasing memcpy size to multiple of 8 would improve performance and didn't change semantics of program as these bytes are not accessed by application. Same with strcpy that would just copy 8-byte blocks instead spending time finding correct size. However that would be hard to do. Also this applies to vectorizer in general which with freshly allocated memory could afford simpler path that does few extra writes. So instead there could be gcc optimization that detects this by checking that its just allocated memory so writing beyond allocated boundary wouldn't change that its uninitialized. For example here you could do only one 8-byte load/store instead of three. #include <string.h> char foo[8]; char *fo() { char bar[8]; memcpy(bar, foo, 7); return strdup(bar); } When exact size isn't know it would call a function say memcpy_j/strcpy_j that could write extra bytes until 8-byte boundary, I plan these to speedup strdup. So comments, could it be generalized more? Or it is too much work without much benefit?