http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58863
--- Comment #4 from Ali Baharev <ali.baharev at gmail dot com> --- My mistake, sorry. So, you are saying that the default alignment is 8 byte for loops? The funny thing is, this code runs 15% faster, if any of the followings are passed: -Os -O2 -fno-align-loops -fno-align-functions -O2 -fno-omit-frame-pointer At least on my machine and in this case, 16 byte alignment is better (or any multiple of 16 byte). -march=native has no effect on the performance.