>>>> A possible approach to reference is detection of tabled-based CTZ in 
>>>> ssa-forward pass, which might be a suitable position for this pattern, in 
>>>> that ssa-forward pass is invoked more than one time, some happen before 
>>>> reassociation pass. But simplification of the 64x64->128 pattern should be 
>>>> classified as mathematics related optimization, and is better to put it in 
>>>> tree-ssa-mathopt.cc for logical consistency. In the file, there is a 
>>>> pass_optimize_widening_mul that is very close to what we want, but it is 
>>>> too late to keep entirety of the pattern against reassociation. So I 
>>>> consider adding a dedicated pass like pass_cse_reciprocals, meanwhile, 
>>>> place it prior to reassociation, and the major procedure is manually 
>>>> coding based matching, and only some leaf patterns are defined via 
>>>> match.pd.
>>
>>> As you figured we already have highpart multiplication recognition.
>>> It's the late reassoc pass that is the problem, given it applies
>>> reassoc_width?  Or is the early reassoc pass also problematic?  We do
>>> have some special-casing of fma patters in reassoc, so
>>> in theory we could handle widening mult candidates similarly, but then
>>> doing widen-mult detection where we do cse_sincos/reciprocals
>>> (those back-to-back passes should ideally share their CFG walk) is a
>>> viable approach.  I would suggest against the ssa-forward pass,
>>> esp. against trying to introduce highpart multiplication before
>>> vectorization or other loop optimizations.
>>
>> Current highpart multiplication and widen-mul recognition only handle cases
>> where operations involves truncation from large integer type to small one. An
>> example from regression testcases:
>>
>> unsigned long long foo(unsigned long long x, unsigned long long y)
>> {
>>   return ((__uint128_t)x * (__uint128_t)y) >> 64;
>> }
>>
>> That is, __uint128_t is explicitly used in source code, then it is nothing 
>> about
>> addition, and suffers no impact from reassociation. However, the above
>> pattern is completely emulated with uint64_t operations, which
>> is unaware of 128-bit integer operation literally. It is synthesized from 
>> small
>> type to large, this is the difference.
>>
>> I'm afraid we might not specially handle the pattern as fma in reassoc, in 
>> that
>> it is much complex.
> 
> I think it's important to identify the parts of your
> 
> void multiply_uint64_to_uint128(uint64_t op0, uint64_t op1, uint64_t *res)
>     {
>       uint64_t op0_lo = op0 & 0x00000000FFFFFFFF;
>       uint64_t op1_lo = op1 & 0x00000000FFFFFFFF;
>       uint64_t op0_hi = op0 >> 32;
>       uint64_t op1_hi = op1 >> 32;
>       uint64_t mul_lo = op0_lo * op1_lo;
>       uint64_t mul_mi_01 = op0_hi * op1_lo;
>       uint64_t mul_mi_10 = op1_hi * op0_lo;
>       uint64_t mul_hi = op0_hi * op1_hi;
>       uint64_t sum_mi = mul_mi_01 + mul_mi_10;
>       uint64_t sum_mi_carry = sum_mi < mul_mi_01;
>       uint64_t sum_32 = (mul_lo >> 32) + (sum_mi & 0x00000000FFFFFFFF);
> 
>       res[1] = mul_hi + (sum_mi_carry << 32) + (sum_mi >> 32) + (sum_32 >> 
> 32);
>       res[0] = (sum_32 << 32) | (mul_lo & 0x00000000FFFFFFFF);
>     }
> 
> example that make up the highpart multiply to recognize since for example
> changing the sink from a store of the two 64bit parts to something else will
> have the chance to disrupt things.  If we take away the lowpart of the result,
> one key "element" is computation of the carry and that we operate on
> the 32bit halves of the two inputs.  There is another pass which I think faces
> similar issues and that's the CRC (loop) recognition which uses symbolic
> execution to recognize the overall effect of a series of operations.  Maybe
> such could be leveraged when starting from the op0/op1 definition side
> and also be less dependent on particular association of the individual
> operations?
> 
> Richard.

I went through how CRC pattern is recognized via symbolic execution. My feeling
is that adopting the same approach in this mult pattern does not resolve the 
problem
that pattern sequence might be shuffled with irrelevant statements due to 
reassoc,

   pattern_part1 + (irrelevant_part + pattern_part2) + ...

It is still not that easy to filter out the irrelevant stuff in linear code, 
while since
CRC candidate is a loop, a pattern hit is either all or none regarding to 
statements
in the loop.

Moreover, symbolic execution is based on per-bit analysis, that is naturally
fit to the scenario of CRC which is mainly composed by xor bit operations.
But for a multiplication, output from symbolic execution becomes a series of
bit-wise operations described in the manner of low level bool logic according
to which ALU performs a mult. And this new kind of bit-based ops pattern
tends to be rather complex.

In the previous comments, you mentioned it is better to place mult-highpart
recognition after loop optimizations, is it because this will disrupt vect 
pattern
formation? I see that mult-highpart is only generated in 
vect_recog_divmod_pattern,
while there is no overlap between emulated mult pattern and divmod pattern.
Or any other potential issue I missed?

Regards,
Feng

Reply via email to