https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93008
--- Comment #14 from Chris Elrod ---
To me, an "inline" function is one that the compiler inlines.
It just happens that the `inline` keyword also means both comdat semantics, and
possibly hiding the symbol to make it internal (-fvisibility-inline
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110027
--- Comment #9 from Chris Elrod ---
> Interestingly this seems to be only reproducible on Arch Linux. Other gcc
> 13.1.1 builds, Fedora for instance, seem to behave correctly.
I haven't tried that reproducer on Fedora with gcc 13.2.1, which c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114276
--- Comment #1 from Chris Elrod ---
Created attachment 57652
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57652&action=edit
assembly from adding `-S`
Version: 13.2.1
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: elrodc at gmail dot com
Target Milestone: ---
Created attachment 57651
--> https://gcc.gnu.org/bugzi
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112824
--- Comment #8 from Chris Elrod ---
> If it's designed the way you want it to be, another issue would be like,
> should we lower 512-bit vector builtins/intrinsic to ymm/xmm when
> -mprefer-vector-width=256, the answer is we'd rather not.
To
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112824
--- Comment #6 from Chris Elrod ---
Hongtao Liu, I do think that one should ideally be able to get optimal codegen
when using 512-bit builtin vectors or vector intrinsics, without needing to set
`-mprefer-vector-width=512` (and, currently, also
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112824
--- Comment #3 from Chris Elrod ---
> I thought I hit the important cases, but my non-minimal example still gets
> unnecessary register splits and stack spills, so maybe I missed places, or
> perhaps there's another issue.
Adding the unroll p
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112824
--- Comment #2 from Chris Elrod ---
https://godbolt.org/z/3648aMTz8
Perhaps a simpler diff is that you can reproduce by uncommenting the pragma,
but codegen becomes good with it.
template
constexpr auto operator*(OuterDualUA2 a, OuterDualUA2
b
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112824
--- Comment #1 from Chris Elrod ---
Here I have added a godbolt example where I manually unroll the array, where
GCC generates excellent code https://godbolt.org/z/sd4bhGW7e
I'm not sure it is 100% optimal, but with an inner Dual size of `7`, on
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: elrodc at gmail dot com
Target Milestone: ---
I am not sure which component to place this under, but selected
tree-optimization as I suspect this is some sort of alias analysis failure
preventing the
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111493
--- Comment #2 from Chris Elrod ---
Note that it also shows up in gcc-13. I put gcc-14 as the version to indicate
that I confirmed it is still a problem on latest trunk. Not sure what the
policy is on which version we should report.
: normal
Priority: P3
Component: c++
Assignee: unassigned at gcc dot gnu.org
Reporter: elrodc at gmail dot com
Target Milestone: ---
Two example programs:
> #include
> constexpr auto foo(const auto &A, int i, int j)
> requires(requires(decltype
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89929
--- Comment #32 from Chris Elrod ---
Ha, I accidentally misreported my gcc version. I was already using 12.1.1.
Using x86-64-v4 worked, excellent! Thanks.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89929
--- Comment #30 from Chris Elrod ---
> #if defined(__clang__)
> #define MULTIVERSION
> \
> __attribute__((target_clones("avx512dq", "avx2", "default")))
> #else
> #define MULTIVERSION
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89929
Chris Elrod changed:
What|Removed |Added
CC||elrodc at gmail dot com
--- Comment #29
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95899
--- Comment #2 from Chris Elrod ---
Interesting. Compiling with:
gcc -march=native -fvariable-expansion-in-unroller -Ofast -funroll-loops -S
dot.c -o dot.s
Yields:
```
.L4:
vmovupd (%rdi,%r11), %zmm9
vmovupd 64(%rdi,%r11), %zmm
: 10.1.1
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: middle-end
Assignee: unassigned at gcc dot gnu.org
Reporter: elrodc at gmail dot com
Target Milestone: ---
Created attachment 48784
--> https://gcc.gnu.org/bugzi
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
--- Comment #54 from Chris Elrod ---
I commented elsewhere, but I built trunk a few days ago with H.J.Lu's patches
(attached here) and Thomas Koenig's inlining patches.
With these patches, g++ and all versions of the Fortran code produced excelle
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
--- Comment #35 from Chris Elrod ---
> rsqrt:
> .LFB12:
> .cfi_startproc
> vrsqrt28ps (%rsi), %zmm0
> vmovups %zmm0, (%rdi)
> vzeroupper
> ret
>
> (huh? isn't there a NR step missing?)
>
I assume
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
--- Comment #32 from Chris Elrod ---
(In reply to Marc Glisse from comment #31)
> (In reply to Chris Elrod from comment #30)
> > gcc caclulates the rsqrt directly
>
> No, vrsqrt14ps is just the first step in calculating sqrt here (slightly
> dif
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
--- Comment #30 from Chris Elrod ---
gcc still (In reply to Marc Glisse from comment #29)
> The main difference I can see is that clang computes rsqrt directly, while
> gcc first computes sqrt and then computes the inverse. Also gcc seems afraid
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
--- Comment #28 from Chris Elrod ---
Created attachment 45501
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45501&action=edit
Minimum working example of the rsqrt problem. Can be compiled with: gcc -Ofast
-S -march=skylake-avx512 -mprefer-
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
--- Comment #27 from Chris Elrod ---
g++ -mrecip=all -O3 -fno-signed-zeros -fassociative-math -freciprocal-math
-fno-math-errno -ffinite-math-only -fno-trapping-math -fdump-tree-optimized -S
-march=native -shared -fPIC -mprefer-vector-width=512
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
--- Comment #26 from Chris Elrod ---
> You can try enabling -mrecip to see RSQRT in .optimized - there's
> probably late 1/sqrt optimization on RTL.
No luck. The full commands I used:
gfortran -Ofast -mrecip -S -fdump-tree-optimized -march=nati
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
--- Comment #24 from Chris Elrod ---
The dump looks like this:
vect__67.78_217 = SQRT (vect__213.77_225);
vect_ui33_68.79_248 = { 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0,
1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0,
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
--- Comment #22 from Chris Elrod ---
Okay. I did that, and the time went from about 4.25 microseconds down to 4.0
microseconds. So that is an improvement, but accounts for only a small part of
the difference with the LLVM-compilers.
-O3 -fno-mat
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
--- Comment #20 from Chris Elrod ---
To add a little more:
I used inline asm for direct access to the rsqrt instruction "vrsqrt14ps" in
Julia. Without adding a Newton step, the answers are wrong beyond just a couple
significant digits.
With the N
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
--- Comment #19 from Chris Elrod ---
To add a little more:
I used inline asm for direct access to the rsqrt instruction "vrsqrt14ps" in
Julia. Without adding a Newton step, the answers are wrong beyond just a couple
significant digits.
With the N
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
--- Comment #18 from Chris Elrod ---
I can confirm that the inlined packing does allow gfortran to vectorize the
loop. So allowing packing to inline does seem (to me) like an optimization well
worth making.
However, performance seems to be ab
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
--- Comment #14 from Chris Elrod ---
It's not really reproducible across runs:
$ time ./gfortvectests
Transpose benchmark completed in 22.7010765
SIMD benchmark completed in 1.37529969
All are equal: F
All are approximately equa
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
--- Comment #12 from Chris Elrod ---
Created attachment 45363
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45363&action=edit
Fortran program for running benchmarks.
Okay, thank you.
I attached a Fortran program you can run to benchmark
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
--- Comment #10 from Chris Elrod ---
(In reply to Thomas Koenig from comment #9)
> Hm.
>
> It would help if your benchmark was complete, so I could run it.
>
I don't suppose you happen to have and be familiar with Julia? If you (or
someone els
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
--- Comment #8 from Chris Elrod ---
Created attachment 45358
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45358&action=edit
gfortran compiled assembly for the tranposed version of the original code.
Here is the assembly for the loop body
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
--- Comment #7 from Chris Elrod ---
Created attachment 45357
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45357&action=edit
Assembly generated by Flang compiler on the original version of the code.
This is the main loop body in the Flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
--- Comment #6 from Chris Elrod ---
Created attachment 45356
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45356&action=edit
Code to demonstrate that transposing makes things slower.
Thomas Koenig, I am well aware that Fortran is column m
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
--- Comment #3 from Chris Elrod ---
Created attachment 45353
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45353&action=edit
g++ assembly output
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
--- Comment #2 from Chris Elrod ---
Created attachment 45352
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45352&action=edit
gfortran assembly output
Component: fortran
Assignee: unassigned at gcc dot gnu.org
Reporter: elrodc at gmail dot com
Target Milestone: ---
Created attachment 45350
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45350&action=edit
Fortran version of vectorization test.
I am attaching Fortra
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
--- Comment #1 from Chris Elrod ---
Created attachment 45351
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45351&action=edit
C++ version of the vectorization test case.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57992
--- Comment #4 from Chris Elrod ---
Created attachment 45016
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45016&action=edit
Assembly from compiling gfortran_internal_pack_test.f90
The code takes in sets of 3-length vectors and 3x3 symmet
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57992
Chris Elrod changed:
What|Removed |Added
CC||elrodc at gmail dot com
--- Comment #3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86625
--- Comment #7 from Chris Elrod ---
(In reply to Chris Elrod from comment #6)
> However, for column 23 (2944/128 = 23) with -O3 and column 25 for -O2 of the
> 32 columns of A
Correction: it was the 16x13 version that used stack data after loadin
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86625
--- Comment #6 from Chris Elrod ---
(In reply to Richard Biener from comment #3)
> If you see spilling on the manually unrolled loop register pressure is
> somehow an issue.
In the matmul kernel:
D = A * X
where D is 16x14, A is 16xN, and X is N
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86625
--- Comment #5 from Chris Elrod ---
Created attachment 44424
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44424&action=edit
Smaller avx512 kernel that still spills into the stack
This generated 18 total `vmovapd` (I think there'd ideally
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86625
--- Comment #4 from Chris Elrod ---
Created attachment 44423
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44423&action=edit
8x16 * 16x6 kernel for avx2.
Here is a scaled down version to reproduce most of the the problem for
avx2-capable
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86625
--- Comment #2 from Chris Elrod ---
Created attachment 44418
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44418&action=edit
Code to reproduce slow vectorization pattern and unnecessary loads & stores
(Sorry if this goes to the bottom ins
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: rtl-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: elrodc at gmail dot com
Target Milestone: ---
I wasn't sure where to put this.
I posted in the Fortran gcc mailing l
47 matches
Mail list logo