Re: [RFC] Pattern matching on plain emulation of 64x64->128 integer multiplication

Richard Biener Mon, 08 Dec 2025 02:17:33 -0800

On Mon, Dec 8, 2025 at 8:45 AM Feng Xue OS <[email protected]> wrote:
>
> >>>> A possible approach to reference is detection of tabled-based CTZ in 
> >>>> ssa-forward pass, which might be a suitable position for this pattern, 
> >>>> in that ssa-forward pass is invoked more than one time, some happen 
> >>>> before reassociation pass. But simplification of the 64x64->128 pattern 
> >>>> should be classified as mathematics related optimization, and is better 
> >>>> to put it in tree-ssa-mathopt.cc for logical consistency. In the file, 
> >>>> there is a pass_optimize_widening_mul that is very close to what we 
> >>>> want, but it is too late to keep entirety of the pattern against 
> >>>> reassociation. So I consider adding a dedicated pass like 
> >>>> pass_cse_reciprocals, meanwhile, place it prior to reassociation, and 
> >>>> the major procedure is manually coding based matching, and only some 
> >>>> leaf patterns are defined via match.pd.
> >>
> >>> As you figured we already have highpart multiplication recognition.
> >>> It's the late reassoc pass that is the problem, given it applies
> >>> reassoc_width?  Or is the early reassoc pass also problematic?  We do
> >>> have some special-casing of fma patters in reassoc, so
> >>> in theory we could handle widening mult candidates similarly, but then
> >>> doing widen-mult detection where we do cse_sincos/reciprocals
> >>> (those back-to-back passes should ideally share their CFG walk) is a
> >>> viable approach.  I would suggest against the ssa-forward pass,
> >>> esp. against trying to introduce highpart multiplication before
> >>> vectorization or other loop optimizations.
> >>
> >> Current highpart multiplication and widen-mul recognition only handle cases
> >> where operations involves truncation from large integer type to small one. 
> >> An
> >> example from regression testcases:
> >>
> >> unsigned long long foo(unsigned long long x, unsigned long long y)
> >> {
> >>   return ((__uint128_t)x * (__uint128_t)y) >> 64;
> >> }
> >>
> >> That is, __uint128_t is explicitly used in source code, then it is nothing 
> >> about
> >> addition, and suffers no impact from reassociation. However, the above
> >> pattern is completely emulated with uint64_t operations, which
> >> is unaware of 128-bit integer operation literally. It is synthesized from 
> >> small
> >> type to large, this is the difference.
> >>
> >> I'm afraid we might not specially handle the pattern as fma in reassoc, in 
> >> that
> >> it is much complex.
> >
> > I think it's important to identify the parts of your
> >
> > void multiply_uint64_to_uint128(uint64_t op0, uint64_t op1, uint64_t *res)
> >     {
> >       uint64_t op0_lo = op0 & 0x00000000FFFFFFFF;
> >       uint64_t op1_lo = op1 & 0x00000000FFFFFFFF;
> >       uint64_t op0_hi = op0 >> 32;
> >       uint64_t op1_hi = op1 >> 32;
> >       uint64_t mul_lo = op0_lo * op1_lo;
> >       uint64_t mul_mi_01 = op0_hi * op1_lo;
> >       uint64_t mul_mi_10 = op1_hi * op0_lo;
> >       uint64_t mul_hi = op0_hi * op1_hi;
> >       uint64_t sum_mi = mul_mi_01 + mul_mi_10;
> >       uint64_t sum_mi_carry = sum_mi < mul_mi_01;
> >       uint64_t sum_32 = (mul_lo >> 32) + (sum_mi & 0x00000000FFFFFFFF);
> >
> >       res[1] = mul_hi + (sum_mi_carry << 32) + (sum_mi >> 32) + (sum_32 >> 
> > 32);
> >       res[0] = (sum_32 << 32) | (mul_lo & 0x00000000FFFFFFFF);
> >     }
> >
> > example that make up the highpart multiply to recognize since for example
> > changing the sink from a store of the two 64bit parts to something else will
> > have the chance to disrupt things.  If we take away the lowpart of the 
> > result,
> > one key "element" is computation of the carry and that we operate on
> > the 32bit halves of the two inputs.  There is another pass which I think 
> > faces
> > similar issues and that's the CRC (loop) recognition which uses symbolic
> > execution to recognize the overall effect of a series of operations.  Maybe
> > such could be leveraged when starting from the op0/op1 definition side
> > and also be less dependent on particular association of the individual
> > operations?
> >
> > Richard.
>
> I went through how CRC pattern is recognized via symbolic execution. My 
> feeling
> is that adopting the same approach in this mult pattern does not resolve the 
> problem
> that pattern sequence might be shuffled with irrelevant statements due to 
> reassoc,
>
>    pattern_part1 + (irrelevant_part + pattern_part2) + ...
>
> It is still not that easy to filter out the irrelevant stuff in linear code, 
> while since
> CRC candidate is a loop, a pattern hit is either all or none regarding to 
> statements
> in the loop.
>
> Moreover, symbolic execution is based on per-bit analysis, that is naturally
> fit to the scenario of CRC which is mainly composed by xor bit operations.
> But for a multiplication, output from symbolic execution becomes a series of
> bit-wise operations described in the manner of low level bool logic according
> to which ALU performs a mult. And this new kind of bit-based ops pattern
> tends to be rather complex.
>
> In the previous comments, you mentioned it is better to place mult-highpart
> recognition after loop optimizations, is it because this will disrupt vect 
> pattern
> formation?


Yes, it's going to risk less regressions if you do not move it across the
loop optimization pipeline.  We also don't have to worry about the case
where we can vectorize the original sequence but not a highpart multiply.

> I see that mult-highpart is only generated in vect_recog_divmod_pattern,
> while there is no overlap between emulated mult pattern and divmod pattern.
> Or any other potential issue I missed?
>
> Regards,
> Feng

Re: [RFC] Pattern matching on plain emulation of 64x64->128 integer multiplication

Reply via email to