[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

rsandifo at gcc dot gnu.org via Gcc-bugs Tue, 29 Sep 2020 05:27:41 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789


rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rsandifo at gcc dot gnu.org

--- Comment #35 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> 
---
(In reply to rguent...@suse.de from comment #24)
> On September 27, 2020 4:56:43 AM GMT+02:00, crazylht at gmail dot com
> <gcc-bugzi...@gcc.gnu.org> wrote:
> >https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789
> >
> >--- Comment #22 from Hongtao.liu <crazylht at gmail dot com> ---
> >>One of my workmates found that if we disable vectorization for
> >SPEC2017 >525.x264_r function sub4x4_dct in source file
> >x264_src/common/dct.c with ?>explicit function attribute
> >__attribute__((optimize("no-tree-vectorize"))), it >can speed up by 4%.
> >
> >For CLX, if we disable slp vectorization in sub4x4_dct by 
> >__attribute__((optimize("no-tree-slp-vectorize"))), it can also speed
> >up by 4%.
> >
> >> Thanks Richi! Should we take care of this case? or neglect this kind
> >of
> >> extension as "no instruction"? I was intent to handle it in target
> >specific
> >> code, but it isn't recorded into cost vector while it seems too heavy
> >to do
> >> the bb_info slp_instances revisits in finish_cost.
> >
> >For i386 backend unsigned char --> unsigned short is no "no
> >instruction", but
> >in this case
> >---
> >1033  _134 = MEM[(pixel *)pix1_295 + 2B];                              
> >        
> >1034  _135 = (short unsigned int) _134;
> >---
> >
> >It could be combined and optimized to 
> >---
> >movzbl  19(%rcx), %r8d
> >---
> >
> >So, if "unsigned char" variable is loaded from memory, then the
> >convertion
> >would also be "no instruction", i'm not sure if backend cost model
> >could handle
> >such situation.
> 
> I think all attempts to address this from the side of the scalar cost is
> going to be difficult and fragile..
Agreed FWIW.  Even in rtl, the kinds of conversion we're talking
about could be removed, such as by proving that the upper bits are
already correct, by combining the extension with other instructions
so that it becomes “free” again, or by ree.  Proving that the upper
bits are already correct isn't uncommon: gimple has to make a choice
between signed and unsigned types even if both choices would be
correct, whereas rtl is sign-agnostic for storage.

So it's not obvious to me that trying model things at this level is
going to be right more often than it's wrong.

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

Reply via email to