RE: Complex multiply optimization working?

Tamar Christina via Gcc Mon, 11 Apr 2022 05:04:45 -0700

HI,

> -----Original Message-----
> From: Andrew Stubbs <a...@codesourcery.com>
> Sent: Monday, April 11, 2022 12:19 PM
> To: GCC Development <gcc@gcc.gnu.org>
> Cc: Tamar Christina <tamar.christ...@arm.com>
> Subject: Complex multiply optimization working?
> 
> Hi all,
> 
> I've been looking at implementing the complex multiply patterns for the
> amdgcn port, but I'm not getting the code I was hoping for. When I try to use
> the patterns on x86_64 or AArch64 they don't seem to work there either, so
> is there something wrong with the middle-end? I've tried both current HEAD
> and GCC 11.


They work fine in both GCC 11 and HEAD https://godbolt.org/z/Mxxz6qWbP 
Did you actually enable the instructions?

The fully unrolled form doesn't get detected at -Ofast because the SLP 
vectorizer doesn't
detect TWO_OPERAND nodes as a constructor, see 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104406

note:   Final SLP tree for instance 0x2debde0:
note:   node 0x2cdf900 (max_nunits=2, refcnt=2) vector(2) double
note:   op template: _463 = _457 * _460;
note:           stmt 0 _463 = _457 * _460;
note:           stmt 1 _464 = _458 * _459;
note:           children 0x2cdf990 0x2cdfa20
note:   node 0x2cdf990 (max_nunits=2, refcnt=2) vector(2) double
note:   op template: _457 = REALPART_EXPR <MEM[(complexT *)a_101(D) + 512B]>;
note:           stmt 0 _457 = REALPART_EXPR <MEM[(complexT *)a_101(D) + 512B]>;
note:           stmt 1 _458 = IMAGPART_EXPR <MEM[(complexT *)a_101(D) + 512B]>;
note:           load permutation { 64 65 }
note:   node 0x2cdfa20 (max_nunits=2, refcnt=2) vector(2) double
note:   op template: _460 = IMAGPART_EXPR <MEM[(complexT *)b_102(D) + 512B]>;
note:           stmt 0 _460 = IMAGPART_EXPR <MEM[(complexT *)b_102(D) + 512B]>;
note:           stmt 1 _459 = REALPART_EXPR <MEM[(complexT *)b_102(D) + 512B]>;
note:           load permutation { 65 64 }

in the general case, were these to be scalars the benefits are dubious because 
of the moving between
register files.

At -O3 it works fine (no -Ofast canonization rules rewriting the form) but the 
cost of the loop is too high to be profitable.
You have to disable The cost model to get it to vectorize where it would use 
them https://godbolt.org/z/MsGq84WP9
And the vectorizer is right here, the scalar code is cheaper.

The various canonicalization differences at -Ofast makes many different forms
I.e. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104408

But yes, detection is working as intended, but some -Ofast cases are not 
detected yet.

> 
> The example shown in the internals manual is a simple loop multiplying two
> arrays of complex numbers, and writing the results to a third. I had expected
> that it would use the largest vectorization factor available, with the
> real/imaginary numbers in even/odd lanes as described, but the
> vectorization factor is only 2 (so, a single complex number), and I have to 
> set
> -fvect-cost-model=unlimited to get even that.
> 
> I tried another example with SLP and that too uses the cmul patterns only for
> a single real/imaginary pair.
> 
> Did proper vectorization of cmul ever really work? There is a case in the
> testsuite for the pattern match, but it isn't in a loop.
> 

There are both SLP and LOOP variants in the testsuite. All the patterns are 
inside of a loop
The mul tests are generated from 
https://github.com/gcc-mirror/gcc/blob/master/gcc/testsuite/gcc.dg/vect/complex/complex-mul-template.c

Where the tests that use of this template instructs the vectorizer to unroll 
some cases
and others they're kept as a loop. So both are tested in the testsuite.

> Thanks
> 
> Andrew
> 
> P.S. I attached my testcase, in case I'm doing something stupid.

Both work https://godbolt.org/z/Mxxz6qWbP and https://godbolt.org/z/MsGq84WP9,

Regards,
Tamar

> 
> P.P.S. The manual says the pattern is "cmulm4", etc., but it's actually
> "cmulm3" in the implementation.

RE: Complex multiply optimization working?

Reply via email to