AVX2: missed simplification — low32×low32→u64 vectorized multiply expands to generic u64-mul sequence instead of single vpmuludq

pinskia at gcc dot gnu.org via Gcc-bugs Fri, 27 Feb 2026 22:18:43 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124271


--- Comment #8 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Hongtao Liu from comment #7)
> (In reply to Hongtao Liu from comment #6)
> > > 
> > > > +      unsigned int inner_prec = outer_prec / 2;
> > > > +      poly_uint64 outer_nelts = TYPE_VECTOR_SUBPARTS (type);
> > > > +      tree inner_scalar = build_nonstandard_integer_type (inner_prec, 
> > > > 1);
> > > 
> > > I think 1 here should be `TYPE_UNSIGNED (TREE_TYPE (type))` otherwise you
> > > get a type mismatch I think.
> 
> static inline  long mul32( long a,  long b) {
>   return (a & 0xffffffff) * (b & 0xffffffff);
> }
> 
> void many_mul3(long* __restrict a,
>                const  long* __restrict b)
> {
>     for (int i = 0; i < 1024; i++)
>         a[i] = mul32(a[i], b[i]);
> }
> 
> Looks like vec_widen_mult_even_expr allow unsigned input and signed output
>   <bb 3> [local count: 1063004408]:
>   # ivtmp.20_32 = PHI <ivtmp.20_19(5), 0(2)>
>   vect__4.11_22 = MEM <const vector(4) long int> [(const long int *)b_10(D)
> + ivtmp.20_32 * 1];
>   vect__6.7_26 = MEM <vector(4) long int> [(long int *)a_11(D) + ivtmp.20_32
> * 1];
>   _17 = VIEW_CONVERT_EXPR<vector(8) unsigned int>(vect__4.11_22);
>   _24 = VIEW_CONVERT_EXPR<vector(8) unsigned int>(vect__6.7_26);
>   vect__16.13_18 = WIDEN_MULT_EVEN_EXPR <_17, _24>;
>   MEM <vector(4) long int> [(long int *)a_11(D) + ivtmp.20_32 * 1] =
> vect__16.13_18;
>   ivtmp.20_19 = ivtmp.20_32 + 32;
>   if (ivtmp.20_19 != 8192)
> 
> 
> But using TYPE_UNSIGNED (TREE_TYPE (type)) instead of 1 will generate
> vec_widen_smult_even_optab which is incorrect

Oh yes, I misunderstood the code :). Yes this is a zero_extend widen multiply.



> 
> 
> Others changed as suggested
> 
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 7f16fd4e081..d99d83bab84 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -436,6 +436,32 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>                                      { zeros; })
>                            { ones; } { zeros; })))))))))
> 
> +#if GIMPLE
> +/*    (x & lowhalf_mask) * (y & lowhalf_mask)
> +    -> VEC_WIDEN_MULT_EVEN (VIEW_CONVERT (x), VIEW_CONVERT (y)).  */
> +(simplify
> + (mult (bit_and @0 VECTOR_CST@2)
> +      (bit_and @1 @2))
> + (if (uniform_vector_p (@2)
> +      && TYPE_VECTOR_SUBPARTS (type).is_constant ()
> +      && TYPE_VECTOR_SUBPARTS (type).to_constant () > 1)

I was going to ask why the > 1 and then I remembered the /2 below.
Also why not just:
known_gt (TYPE_VECTOR_SUBPARTS (type), 1)

I am not sure you need to know if it a constant or not, just that it has at
least 2 elements.

> +   (with
> +    {
> +      auto elem = wi::to_wide (uniform_vector_p (@2));
> +      unsigned int outer_prec = TYPE_PRECISION (TREE_TYPE (type));
> +      unsigned int inner_prec = outer_prec / 2;
> +      poly_uint64 outer_nelts = TYPE_VECTOR_SUBPARTS (type);
> +      tree inner_scalar = build_nonstandard_integer_type (inner_prec, 1);
> +      tree inner_type = build_vector_type (inner_scalar, outer_nelts * 2);
> +    }
> +    (if (GET_MODE_CLASS (TYPE_MODE (inner_type)) == MODE_VECTOR_INT
> +        && elem == wi::mask (inner_prec, false, outer_prec)
> +        && optab_handler (vec_widen_umult_even_optab,
> +                          TYPE_MODE (inner_type)) != CODE_FOR_nothing)
> +       (vec_widen_mult_even (view_convert:inner_type @0)
> +                           (view_convert:inner_type @1))))))
> +#endif
> +

Yes this looks simplier and looks good.

[Bug target/124271] x86/AVX2: missed simplification — low32×low32→u64 vectorized multiply expands to generic u64-mul sequence instead of single vpmuludq

Reply via email to