200 is slower than clang's

wilco at gcc dot gnu.org Fri, 05 May 2017 04:22:26 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665


--- Comment #17 from wilco at gcc dot gnu.org ---
(In reply to wilco from comment #16)
> (In reply to wilco from comment #14)
> > (In reply to PeteVine from comment #13)
> > > Still, the 5% regression must have happened very recently. The fast gcc 
> > > was
> > > built on 20170220 and the slow one yesterday, using the original patch. 
> > > Once
> > > again, switching away from Cortex-A53 codegen restores the expected
> > > performance.
> > 
> > The issue is due to inefficient code generated for unsigned modulo:
> > 
> >         umull   x0, w0, w4
> >         umull   x1, w1, w4
> >         lsr     x0, x0, 32
> >         lsr     x1, x1, 32
> >         lsr     w0, w0, 6
> >         lsr     w1, w1, 6
> > 
> > It seems the Cortex-A53 scheduler isn't modelling this correctly. When I
> > manually remove the redundant shifts I get a 15% speedup. I'll have a look.
> 
> See https://gcc.gnu.org/ml/gcc-patches/2017-04/msg01415.html

The redundant LSRs and SDIV are removed on latest trunk. Although my patch
above hasn't gone in, I get a 15% speedup on Cortex-A53 with -mcpu=cortex-a53
and 8% with -mcpu=cortex-a72.

[Bug middle-end/79665] gcc's signed (x*x)/200 is slower than clang's

Reply via email to