https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363
--- Comment #16 from Andrew Pinski <pinskia at gcc dot gnu.org> --- (In reply to Vineet Gupta from comment #15) > The problem is is indeed gone. I need to analyze the assembly fully how it > prevents the bad case. e.g. I'm still not comfortable seeing the loop > entered with following and it doing 8 byte ldd/std when we know it should > only do 2 at a time. Why? It is called a "vectorization" optimization. Where we are vectorizing the 2 byte load/stores into a 4x2 vector load/stores.