https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665
--- Comment #16 from wilco at gcc dot gnu.org --- (In reply to wilco from comment #14) > (In reply to PeteVine from comment #13) > > Still, the 5% regression must have happened very recently. The fast gcc was > > built on 20170220 and the slow one yesterday, using the original patch. Once > > again, switching away from Cortex-A53 codegen restores the expected > > performance. > > The issue is due to inefficient code generated for unsigned modulo: > > umull x0, w0, w4 > umull x1, w1, w4 > lsr x0, x0, 32 > lsr x1, x1, 32 > lsr w0, w0, 6 > lsr w1, w1, 6 > > It seems the Cortex-A53 scheduler isn't modelling this correctly. When I > manually remove the redundant shifts I get a 15% speedup. I'll have a look. See https://gcc.gnu.org/ml/gcc-patches/2017-04/msg01415.html