https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665
--- Comment #17 from wilco at gcc dot gnu.org --- (In reply to wilco from comment #16) > (In reply to wilco from comment #14) > > (In reply to PeteVine from comment #13) > > > Still, the 5% regression must have happened very recently. The fast gcc > > > was > > > built on 20170220 and the slow one yesterday, using the original patch. > > > Once > > > again, switching away from Cortex-A53 codegen restores the expected > > > performance. > > > > The issue is due to inefficient code generated for unsigned modulo: > > > > umull x0, w0, w4 > > umull x1, w1, w4 > > lsr x0, x0, 32 > > lsr x1, x1, 32 > > lsr w0, w0, 6 > > lsr w1, w1, 6 > > > > It seems the Cortex-A53 scheduler isn't modelling this correctly. When I > > manually remove the redundant shifts I get a 15% speedup. I'll have a look. > > See https://gcc.gnu.org/ml/gcc-patches/2017-04/msg01415.html The redundant LSRs and SDIV are removed on latest trunk. Although my patch above hasn't gone in, I get a 15% speedup on Cortex-A53 with -mcpu=cortex-a53 and 8% with -mcpu=cortex-a72.