[Bug target/94135] New: PPC: subfic instead of neg used for rotate right
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94135 Bug ID: 94135 Summary: PPC: subfic instead of neg used for rotate right Product: gcc Version: 8.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- Input: unsigned int rotr32(unsigned int v, unsigned int r) { return (v>>r)|(v<<(32-r)); } unsigned long long rotr64(unsigned long long v, unsigned long long r) { return (v>>r)|(v<<(64-r)); } Command line: gcc -O2 -save-temps rotr.C Output: _Z6rotr32jj: .LFB0: .cfi_startproc subfic 4,4,32 rotlw 3,3,4 blr .long 0 .byte 0,9,0,0,0,0,0,0 .cfi_endproc _Z6rotr64yy: .LFB1: .cfi_startproc subfic 4,4,64 rotld 3,3,4 blr .long 0 .byte 0,9,0,0,0,0,0,0 .cfi_endproc subfic is a 2 cycle instruction, but can be replaced by 1 cycle instruction neg. rotr32(v,r) = rotl32(v,32-r) = rotl32(v,(32-r)%32) = rotl32(v,(-r)%32))= rotl32(v,-r) as long as you have a modulo rotate like rotlw/rlwnm. Same for 64-bit.
[Bug target/94135] PPC: subfic instead of neg used for rotate right
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94135 --- Comment #2 from Jens Seifert --- POWER8 Processor User’s Manual for the Single-Chip Module: addi addis add add. subf subf. addic subfic adde addme subfme addze. subfze neg neg. nego 1 - 2 cycles (GPR) 2 cycles (XER) 5 cycles (CR) 6/cycle, 2/cycle (with XER or CR updates) CA is part of XER. 1-2 cycles versus 2 cycles.
[Bug target/94135] PPC: subfic instead of neg used for rotate right
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94135 --- Comment #4 from Jens Seifert --- Setting CA in XER increases issue to issue latency by 1 on Power8. See: Table 10-14. Issue-to-Issue Latencies In addition, setting the CA restricts instruction reordering.
[Bug c++/94297] New: std::replace internal compiler error
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94297 Bug ID: 94297 Summary: std::replace internal compiler error Product: gcc Version: 8.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- #include #include void patch(std::string& s) { std::replace(s.begin(),s.end(),'.','-'); } gcc replace.C In file included from /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/uniform_int_dist.h:35, from /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/stl_algo.h:66, from /opt/rh/devtoolset-8/root/usr/include/c++/8/algorithm:62, from replace.C:1: /opt/rh/devtoolset-8/root/usr/include/c++/8/limits:1677:7: internal compiler error: Segmentation fault max() _GLIBCXX_USE_NOEXCEPT { return __DBL_MAX__; } ^~~ Please submit a full bug report, with preprocessed source if appropriate. See <http://bugzilla.redhat.com/bugzilla> for instructions. Preprocessed source stored into /tmp/ccFTVYLT.out file, please attach this to your bugreport.
[Bug target/94297] PPCLE std::replace internal compiler error
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94297 Jens Seifert changed: What|Removed |Added Summary|std::replace internal |PPCLE std::replace internal |compiler error |compiler error Component|c++ |target CC||wschmidt at gcc dot gnu.org --- Comment #1 from Jens Seifert --- Same code compile on x86 on same version of compiler.
[Bug target/94297] PPCLE std::replace internal compiler error
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94297 --- Comment #3 from Jens Seifert --- Created attachment 48110 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48110&action=edit Pre-processed file created using -save-temps
[Bug target/94297] PPCLE std::replace internal compiler error
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94297 --- Comment #5 from Jens Seifert --- No options. Same failure with -O2. System is a RHEL 7.5. Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/opt/rh/devtoolset-8/root/usr/libexec/gcc/ppc64le-redhat-linux/8/lto-wrapper Target: ppc64le-redhat-linux Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,fortran,lto --prefix=/opt/rh/devtoolset-8/root/usr --mandir=/opt/rh/devtoolset-8/root/usr/share/man --infodir=/opt/rh/devtoolset-8/root/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared --enable-threads=posix --enable-checking=release --enable-targets=powerpcle-linux --disable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --with-linker-hash-style=gnu --with-default-libstdcxx-abi=gcc4-compatible --enable-plugin --enable-initfini-array --with-isl=/builddir/build/BUILD/gcc-8.3.1-20190311/obj-ppc64le-redhat-linux/isl-install --disable-libmpx --enable-gnu-indirect-function --enable-secureplt --with-long-double-128 --with-cpu-32=power8 --with-tune-32=power8 --with-cpu-64=power8 --with-tune-64=power8 --build=ppc64le-redhat-linux Thread model: posix gcc version 8.3.1 20190311 (Red Hat 8.3.1-3) (GCC) No error with: gcc -std=gnu++98 replace.C gcc -std=gnu++03 replace.C Error with: gcc -std=gnu++11 replace.C gcc -std=gnu++17 replace.C
[Bug target/94519] New: PPC: ICE: Segmentation fault on -DBL_MAX
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94519 Bug ID: 94519 Summary: PPC: ICE: Segmentation fault on -DBL_MAX Product: gcc Version: 8.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- Input: #include static const double dsmall[] = { -DBL_MAX }; gcc ccerr.C ccerr.C:3:1: internal compiler error: Segmentation fault static const double dsmall[] = { -DBL_MAX }; ^~ Please submit a full bug report, with preprocessed source if appropriate. See <http://bugzilla.redhat.com/bugzilla> for instructions. Preprocessed source stored into /tmp/cc3rsmv0.out file, please attach this to your bugreport.
[Bug target/94519] PPC: ICE: Segmentation fault on -DBL_MAX
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94519 Jens Seifert changed: What|Removed |Added Status|RESOLVED|CLOSED --- Comment #2 from Jens Seifert --- Environmental issue. gcc was compiled with GMP 6.0.0 support but is picking up a GMP 5.0.1
[Bug target/94297] PPCLE std::replace internal compiler error
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94297 Jens Seifert changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #7 from Jens Seifert --- Too old libgmp got picked up. Setting LD_LIBRARY_PATH=/lib64 solved the issue.
[Bug target/94297] PPCLE std::replace internal compiler error
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94297 --- Comment #8 from Jens Seifert --- Too old libgmp got picked up. Setting LD_LIBRARY_PATH=/lib64 solved the issue.
[Bug target/94297] PPCLE std::replace internal compiler error
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94297 Jens Seifert changed: What|Removed |Added Status|RESOLVED|CLOSED --- Comment #9 from Jens Seifert --- Too old libgmp got picked up. Setting LD_LIBRARY_PATH=/lib64 solved the issue.
[Bug target/95704] New: PPC: int128 shifts should be implemented branchless
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95704 Bug ID: 95704 Summary: PPC: int128 shifts should be implemented branchless Product: gcc Version: 8.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- Created attachment 48741 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48741&action=edit input with branchless 128-bit shifts PowerPC processors don't like branches and branch mispredicts lead to large overhead. shift left/right unsigned __in128 can be implemented in 8 instructions which can be processed on 2 pipelines almost in parallel leading to ~5 cycle latency on Power 7 and 8. shift right algebraic __int128 can be implemented in 10 instructions. Overall comparable in latency of the branching code. In attached file you find the branch less implementations in C. And I know that this is using undefined behavior. But the resulting assembly is the interesting part. The unnecessary rldicl 8,5,0,32 at the beginning of the routines are also not necessary.
[Bug target/95704] PPC: int128 shifts should be implemented branchless
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95704 --- Comment #1 from Jens Seifert --- Created attachment 48742 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48742&action=edit assembly
[Bug target/95704] PPC: int128 shifts should be implemented branchless
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95704 --- Comment #3 from Jens Seifert --- GCC 8.3 generates: _Z3shloy: .LFB0: .cfi_startproc addi 9,5,-64 cmpwi 7,9,0 blt 7,.L2 sld 4,3,9 li 3,0 blr .p2align 4,,15 .L2: srdi 9,3,1 subfic 10,5,63 sld 4,4,5 srd 9,9,10 sld 3,3,5 or 4,9,4 blr .long 0 .byte 0,9,0,0,0,0,0,0 .cfi_endproc 8 instructions if taking L2. The branch free code I propsed: _Z15shl_branch_lessoy: .LFB1: .cfi_startproc rldicl 5,5,0,32 subfic 9,5,64 addi 10,5,-64 sld 10,3,10 srd 9,3,9 sld 4,4,5 or 9,9,10 or 4,9,4 sld 3,3,5 blr 8 instructions no branch. Almost everything can be executed in parallel. rldicl 5,5,0,32 gets added by gcc, which is not necessary.
[Bug target/95704] PPC: int128 shifts should be implemented branchless
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95704 --- Comment #5 from Jens Seifert --- Power9 code is branchfree but not good at all. _Z3shloy: .LFB0: .cfi_startproc addi 8,5,-64 subfic 6,5,63 srdi 10,3,1 li 7,0 sld 4,4,5 sld 5,3,5 cmpwi 7,8,0 srd 10,10,6 sld 3,3,8 or 4,10,4 isel 5,5,7,28 isel 4,4,3,28 mr 3,5 blr 13 instructions.
[Bug target/95737] New: PPC: Unnecessary extsw after negative less than
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95737 Bug ID: 95737 Summary: PPC: Unnecessary extsw after negative less than Product: gcc Version: 8.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- unsigned long long negativeLessThan(unsigned long long a, unsigned long long b) { return -(a < b); } gcc -m64 -O2 -save-temps negativeLessThan.C creates _Z16negativeLessThanyy: .LFB0: .cfi_startproc subfc 4,4,3 subfe 3,3,3 extsw 3,3 blr The extsw is not necessary.
[Bug target/95737] PPC: Unnecessary extsw after negative less than
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95737 Jens Seifert changed: What|Removed |Added Status|RESOLVED|UNCONFIRMED Resolution|DUPLICATE |--- --- Comment #3 from Jens Seifert --- This is different as the extsw also happens if the result gets used e.g. followed by a andc, which is my case. I obviously oversimplified the sample. It has nothing to do with function result and ABI requirements. gcc assume that the result of -(a < b) implemented by subfc, subfe is signed 32-bit. But the result is already 64-bit. unsigned long long branchlesconditional(unsigned long long a, unsigned long long b, unsigned long long c) { unsigned long long mask = -(a < b); return c &~ mask; } results in _Z20branchlesconditionalyyy: .LFB1: .cfi_startproc subfc 4,4,3 subfe 3,3,3 not 3,3 extsw 3,3 and 3,3,5 blr expected subfc subfe andc
[Bug c++/93012] New: PPC: inefficient 64-bit constant generation (upper = lower)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93012 Bug ID: 93012 Summary: PPC: inefficient 64-bit constant generation (upper = lower) Product: gcc Version: 8.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- unsigned long long msk66() { return 0xULL; } gcc -maix64 -O2 const.C -save-temps Output: ._Z5msk66v: LFB..0: lis 3,0x ori 3,3,0x sldi 3,3,32 oris 3,3,0x ori 3,3,0x blr Any 64-bit constant that has matching upper 32-bit and lower 32-bit can be created using 3 instructions construct 32-bit lower part and then use rldimi to duplicate into upper part of register. Sample: lis 3, 26214 ori 3, 3, 26214 rldimi 3, 3, 32, 0
[Bug c++/93013] New: PPC: optimization leads around modulo leads to incorrect result
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93013 Bug ID: 93013 Summary: PPC: optimization leads around modulo leads to incorrect result Product: gcc Version: 8.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- Input: int mod(int x, int y, int &z) { z = x % y; if (y == 0) { // division by zero return 1; } else if (y == -1) { // gcc removes this branch, which leads to wrong results for -2^31 % -1 z=0; } return 0; } gcc -maix64 -O2 modulo.C -save-temps Output: ._Z3modiiRi: LFB..0: divw 9,3,4 mr 10,3 cntlzw 3,4 srwi 3,3,5 mullw 9,9,4 subf 9,9,10 stw 9,0(5) blr For input x=-2^31 y=-1, the result is expected to be 0. As modulo is emulated using x-(x/y)*y and x/y for x=-2^31 y=-1 is undefined on PowerPC, the result is incorrect if the branch gets optimized away.
[Bug tree-optimization/93013] PPC: optimization around modulo leads to incorrect result
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93013 --- Comment #7 from Jens Seifert --- The modulo at the beginning was done for optimization purpose. As the divide takes long and the special cases are extreme edge cases, it is wise to execute the divide as early as possible on PPC as divide on PPC does not produce signals on bad input. Thank you for citing the C standard regarding another undefined behavior section.
[Bug c++/93123] New: Lacking basic optimization around __int128 bitwise operations against constants
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93123 Bug ID: 93123 Summary: Lacking basic optimization around __int128 bitwise operations against constants Product: gcc Version: 8.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- unsigned __int128 and128WithConst(unsigned __int128 a) { unsigned __int128 c128 = (((unsigned __int128)(~0ULL)) << 64) | ((unsigned __int128)(~0xFULL)); return a & c128; } gcc -O2 -maix64 -save-temps int128.C Output: ._Z12andWithConsto: LFB..0: li 10,-1 li 11,-16 and 3,3,10 and 4,4,11 blr Expected result: Single instruction: rldicr on low part (register 4). Bitwise and with 0xFF..FF is a no-op. Bitwise and with 0xFF..F0 can be done using rldicr.
[Bug target/93126] New: PPC altivec -Wunused-but-set-variable false positive
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93126 Bug ID: 93126 Summary: PPC altivec -Wunused-but-set-variable false positive Product: gcc Version: 8.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- Input: #include double vmax(double a, double b) { #ifdef _BIG_ENDIAN const long PREF = 0; #else const long PREF = 1; #endif vector double va = vec_promote(a,PREF); vector double vb = vec_promote(b,PREF); return vec_extract(vec_max(va, vb), PREF); } gcc -O2 -maix64 -mcpu=power7 -maltivec -Wall warn.C: In function 'double vmax(double, double)': warn.C:6:15: warning: variable 'PREF' set but not used [-Wunused-but-set-variable] const long PREF = 0; ^~~~
[Bug target/93127] PPC altivec vec_promote creates unnecessary xxpermdi instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93127 Jens Seifert changed: What|Removed |Added Target||powerpc-*-*-* --- Comment #1 from Jens Seifert --- Command line: gcc -O2 -maix64 -mcpu=power7 -maltivec warn.C -save-temps
[Bug target/93127] New: PPC altivec vec_promote creates unnecessary xxpermdi instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93127 Bug ID: 93127 Summary: PPC altivec vec_promote creates unnecessary xxpermdi instruction Product: gcc Version: 8.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- vec_promote can leave half of register undefined and therefore should not issue extra instruction. Input: #include double vmax(double a, double b) { #ifdef _BIG_ENDIAN const int PREF = 0; #else const int PREF = 1; #endif vector double va = vec_promote(a,PREF); vector double vb = vec_promote(b,PREF); return vec_extract(vec_max(va, vb), PREF); } Output: ._Z4vmaxdd: LFB..0: xxpermdi 2,2,2,0 xxpermdi 1,1,1,0 xvmaxdp 1,1,2 # vec_extract to same register blr Both xxpermdi are unnecessary.
[Bug target/93128] New: PPC small floating point constants can be constructed using vector operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93128 Bug ID: 93128 Summary: PPC small floating point constants can be constructed using vector operations Product: gcc Version: 8.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- Input: #include double d2() { return 2.0; } vector double v2() { return vec_splats(2.0); } gcc -O2 -maix64 -mcpu=power7 -maltivec const.C gcc uses load from constant area. Better alternative for "integer" values -15.0..+16.0. vspltisw 0, xvcvsxwdp 1,32 0.0 already get constructed using xxlxor, which is great. Similar things can be done for vector float v2f() { return vec_splats(2.0f); }
[Bug target/93129] New: PPC memset not using vector instruction on >= Power8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93129 Bug ID: 93129 Summary: PPC memset not using vector instruction on >= Power8 Product: gcc Version: 8.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- Input: void memclear16(char *p) { memset(p, 0, 16); } void memFF16(char *p) { memset(p, 0xFF, 16); } Output: ._Z10memclear16Pc: LFB..0: li 9,0 std 9,0(3) std 9,8(3) blr ._Z7memFF16Pc: LFB..1: mflr 0 li 5,16 li 4,255 std 0,16(1) stdu 1,-112(1) LCFI..0: bl .memset nop addi 1,1,112 LCFI..1: ld 0,16(1) mtlr 0 LCFI..2: blr Expected output: vsplitb + vector store Unaligned vector stores only perform on >= Power8, vector stores should be used only on >= Power8. On Power7 you would need to know it is at least 8-byte aligned. vsplitb has the -16..+15 limit. On Power9 a splat for 0.255 exists.
[Bug target/93128] PPC small floating point constants can be constructed using vector operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93128 --- Comment #1 from Jens Seifert --- Wrong number range for Power7: -16..15
[Bug target/93130] New: PCC simple memset not inlined
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93130 Bug ID: 93130 Summary: PCC simple memset not inlined Product: gcc Version: 8.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- Input: void memspace16(char *p) { memset(p, ' ', 16); } Expected result: li 4,0x2020 rldimi 4,4,16,0 rldimi 4,4,32,0 std 4,0(3) Splatting the memset input to 64-bit can be done using li + 2xrldimi. But also ._Z13memspace16OptPc: LFB..3: lis 9,0x2020 ori 9,9,0x2020 sldi 9,9,32 oris 9,9,0x2020 ori 9,9,0x2020 std 9,0(3) std 9,8(3) blr would perform better than the function call to memset.
[Bug target/93176] New: PPC: inefficient 64-bit constant consecutive ones
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93176 Bug ID: 93176 Summary: PPC: inefficient 64-bit constant consecutive ones Product: gcc Version: 8.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- All 64-bit constants containing a sequence of ones can be constructed with 2 instructions (li/lis + rldicl). gcc creates up to 5 instructions. Input: unsigned long long onesLI() { return 0x0000ULL; // expected: li 3,0xFF00 ; rldicl 3,3,0,8 } unsigned long long onesLIS() { return 0x0000ULL; // expected: lis 3,0xFF00 ; rldicl 3,3,0,8 } unsigned long long onesHI() { return 0x0000ULL; // expected: lis 3,0x ; rldicl 3,3,8,8 } Command line: gcc -O2 -maix64 -save-temps const.C Output: ._Z6onesLIv: LFB..2: lis 3,0xff ori 3,3,0x sldi 3,3,32 oris 3,3,0x ori 3,3,0xff00 blr ._Z7onesLISv: LFB..3: lis 3,0xff ori 3,3,0x sldi 3,3,32 oris 3,3,0xff00 blr ._Z6onesHIv: LFB..4: lis 3,0xff ori 3,3,0xff00 sldi 3,3,32 blr
[Bug target/93178] New: PPC: inefficient 64-bit constant generation if msb is off in low 16 bit
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93178 Bug ID: 93178 Summary: PPC: inefficient 64-bit constant generation if msb is off in low 16 bit Product: gcc Version: 8.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- Input: unsigned long long hi16msbon_low16msboff() { return 0x87654321ULL; // expected: li 3,0x4321 ; oris 3,0x8765 } Command line: gcc -O2 -maix64 -save-temps const.C Output: ._Z21hi16msbon_low16msboffv: LFB..1: lis 3,0x8765 ori 3,3,0x4321 rldicl 3,3,0,32 blr
[Bug c++/93448] New: PPC: missing builtin for DFP quantize(dqua,dquai,dquaq,dquaiq)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93448 Bug ID: 93448 Summary: PPC: missing builtin for DFP quantize(dqua,dquai,dquaq,dquaiq) Product: gcc Version: 8.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- I am currently porting an application to PPCLE and found that I am lacking compiler builtins for decimal floating point quantize on _Decimal128/_Decimal64. Any plans to add them ? Any workarounds ? E.g. did I miss a inline asm constraint for _Decimal128 register even/odd pair ?
[Bug target/93449] New: PCC: Missing conversion builtin from vector to _Decimal128 and vice versa
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93449 Bug ID: 93449 Summary: PCC: Missing conversion builtin from vector to _Decimal128 and vice versa Product: gcc Version: 8.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- I am currently porting an application from AIX to PPCLE and found that I am lacking compiler builtins for transforming vector input into _Decimal128 and vice versa. This can be done on PowerPC using xxlor + xxpermdi (2 instructions on >= Power7). The conversion routines __builtin_denbcdq/__builtin_ddedpdq are based on _Decimal128 input and output. The missing piece is vector to _Decimal128 and vice versa. Background: >From and to vector register I can load/store variable length BCD decimals. BCD Decimal can be converted to _Decimal128. Then I can perform multiply and divide. Then can be converted back to BCD, but then again I want to store a BCD decimal which might not be 16-byte in size.
[Bug target/93453] New: PPC: rldimi not taken into account to avoid shift+or
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93453 Bug ID: 93453 Summary: PPC: rldimi not taken into account to avoid shift+or Product: gcc Version: 8.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- 2 samples: unsigned long long load8r(unsigned long long *in) { return __builtin_bswap64(*in); } unsigned long long rldimi(unsigned int hi, unsigned int lo) { return (((unsigned long long)hi) << 32) | ((unsigned long long)lo); } Command line: gcc -maix64 -mcpu=power6 -save-temps -O2 rldimi.C Even if number range is known to not cause conflicts shift+or does not get replaces by rldimi. Output: ._Z6load8rPy: LFB..0: addi 9,3,4 lwbrx 3,0,3 lwbrx 10,0,9 sldi 10,10,32 or 3,3,10 blr ._Z6rldimijj: LFB..1: sldi 3,3,32 or 3,3,4 blr
[Bug target/93449] PPC: Missing conversion builtin from vector to _Decimal128 and vice versa
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93449 --- Comment #2 from Jens Seifert --- #include typedef float _Decimal128 __attribute__((mode(TD))); _Decimal128 bcdtodpd(vector double v) { _Decimal128 res; memcpy(&res, &v, sizeof(res)); res = __builtin_denbcdq(0, res); return res; } _Decimal128 bcdtodpd_opt(vector double bcd) { _Decimal128 res; __asm__ volatile("xxlor 4,%x1,%x1;\n" "xxpermdi 5,%x1,%x1,3;\n" "denbcdq 0,%0,4":"=d"(res):"v"(bcd):"vs36","vs37"); return res; } vector double dpdtobcd(_Decimal128 dpd) { _Decimal128 bcd = __builtin_ddedpdq(0, dpd); vector double res; memcpy(&res, &bcd, sizeof(res)); return res; } vector double dpdtobcd_opt(_Decimal128 dpd) { vector double res; __asm__ volatile("ddedpdq 0,4,%1;\n" "xxpermdi %x0,4,5,0":"=v"(res):"d"(dpd):"vs36","vs37"); return res; } The non-inline assembly show store/load (very slow). The assembly version does the conversion from vector to _Decimal128 with optimal sequence for Power7 and above.
[Bug target/93448] PPC: missing builtin for DFP quantize(dqua,dquai,dquaq,dquaiq)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93448 --- Comment #4 from Jens Seifert --- The inline asm constraint "d" works. Thank you.
[Bug target/93449] PPC: Missing conversion builtin from vector to _Decimal128 and vice versa
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93449 --- Comment #4 from Jens Seifert --- Power8 has bcdadd which can be only combined with _Decimal128 if you have some kind of conversion in between BCDs stored in vector register and _Decimal128. On Power9 vec_load_len/vec_store_len can be used to load variable length BCDs. On Power7/8 I can load variable length BCDs as well (with more instructions), but overall it is desirable to have the possibility to convert vector to _Decimal128 and vice versa. I suppose I can survive with inline assembly like below. The assembly works for p7-p9 with optimal speed. The memcpy inline between vector and _Decimal128 is not optimal for -mcpu=power7-9. Always a store/load (lacking XNOP) ending up in load-hit-store issue.
[Bug target/93570] New: PPC: __builtin_mtfsf does not return a value
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93570 Bug ID: 93570 Summary: PPC: __builtin_mtfsf does not return a value Product: gcc Version: 8.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- Documentation says: double __builtin_mtfsf(const int,double) Not documented in 8.3.0, but somehow works, nevertheless looks like the prototype is wrong and should be void __builtin_mtfsf(const int,double) double mtfsf(double x) { return __builtin_mtfsf(0xFF,x); } returns flm.C:9:34: error: void value not ignored as it ought to be return __builtin_mtfsf(0xFF, x); Is __builtin_mtfsf returning void ? Is it safe to use __builtin_mtfsf with 8.3.0 ?
[Bug target/93571] New: PPC: fmr gets used instead of faster xxlor
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93571 Bug ID: 93571 Summary: PPC: fmr gets used instead of faster xxlor Product: gcc Version: 8.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- fmr is a 6 cycle instruction on Power8. Why is gcc not using the 2 cycle xxlor instruction ) Input: double setflm(double x) { double r = __builtin_mffs(); __builtin_mtfsf(0xFF, x); return r; } Command line: gcc -maix64 -O2 -save-temps flm.C -mcpu=power8 Output: ._Z6setflmd: LFB..0: mffs 0 mtfsf 255,1 fmr 1,0 blr
[Bug target/70928] Load simple float constants via VSX operations on PowerPC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70928 Jens Seifert changed: What|Removed |Added CC||jens.seifert at de dot ibm.com --- Comment #4 from Jens Seifert --- values -16.0..+15.0. vspltisw 0, xvcvsxwdp 32,32 values -16.0f..+15.0f vspltisw 0, xvcvsxwsp 32,32 -0.0 / 0x8000 xxlxor 32,32,32 xvnabsdp 32,32 or xvnegdp 32,32 -0.0f / 0x8000 xxlxor 32,32,32 xvnabssp 32,32 or xvnegsp 32,32 0x7FFF vspltisw 0,-1 xvabsdp 32,32 0x7FFF vspltisw 0,-1 xvabssp 32,32
[Bug target/98020] New: PPC: mfvsrwz+extsw not merge to mtvsrwa
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98020 Bug ID: 98020 Summary: PPC: mfvsrwz+extsw not merge to mtvsrwa Product: gcc Version: 8.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- int extract(vector signed int v) { return v[2]; } Command line: gcc -mcpu=power8 -maltivec -m64 -O3 -save-temps extract.C Output: _Z7extractDv4_i: .LFB0: .cfi_startproc mfvsrwz 3,34 extsw 3,3 blr
[Bug target/98124] New: Z: Load and test LTDBR instruction gets not used for comparison against 0.0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98124 Bug ID: 98124 Summary: Z: Load and test LTDBR instruction gets not used for comparison against 0.0 Product: gcc Version: 8.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- #include double sign(double in) { return in == 0.0 ? 0.0 : copysign(1.0, in); } Command line: gcc m64 -O2 -save-temps copysign.C Output: _Z4signd: .LFB234: .cfi_startproc larl%r5,.L8 lzdr%f2 cdbr%f0,%f2 je .L6 ld %f2,.L9-.L8(%r5) cpsdr %f0,%f0,%f2 br %r14 Use of LTDBR expected instead of lzdr%f2 + cdbr%f0,%f2
[Bug target/98020] PPC: mfvsrwz+extsw not merged to mtvsrwa
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98020 Jens Seifert changed: What|Removed |Added Status|WAITING |RESOLVED Resolution|--- |INVALID --- Comment #2 from Jens Seifert --- I thought they are symmetric.
[Bug target/100693] New: PPC: missing 64-bit addg6s
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100693 Bug ID: 100693 Summary: PPC: missing 64-bit addg6s Product: gcc Version: 8.3.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- gcc only provides unsigned int __builtin_addg6s (unsigned int, unsigned int); but addg6s is a 64-bit operation. I require unsigned long long __builtin_addg6s (unsigned long long, unsigned long long); . I for now use inline assembly.
[Bug target/100694] New: PPC: initialization of __int128 is very inefficient
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100694 Bug ID: 100694 Summary: PPC: initialization of __int128 is very inefficient Product: gcc Version: 8.3.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- Initializing a __int128 from 2 64-bit integers is implemented very inefficient. The most natural code which works good on all other platforms generate additional 2 li 0 + 2 or instructions. void test2(unsigned __int128* res, unsigned long long hi, unsigned long long lo) { unsigned __int128 i = hi; i <<= 64; i |= lo; *res = i; } _Z5test2Poyy: .LFB15: .cfi_startproc li 8,0 li 11,0 or 10,5,8 or 11,11,4 std 10,0(3) std 11,8(3) blr .long 0 .byte 0,9,0,0,0,0,0,0 .cfi_endproc While for the above sample, "+" instead "|" solves the issues, it generates addc+addz in other more complicated scenarsion. The most ugly workaround I can think of I now use as workaround. void test4(unsigned __int128* res, unsigned long long hi, unsigned long long lo) { union { unsigned __int128 i; struct { unsigned long long lo; unsigned long long hi; } s; } u; u.s.lo = lo; u.s.hi = hi; *res = u.i; } This generates the expected code sequence in all cases I have looked at. _Z5test4Poyy: .LFB17: .cfi_startproc std 5,0(3) std 4,8(3) blr .long 0 .byte 0,9,0,0,0,0,0,0 .cfi_endproc Please merge li 0 + or to nop.
[Bug c/100808] New: PPC: ISA 3.1 builtin documentation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100808 Bug ID: 100808 Summary: PPC: ISA 3.1 builtin documentation Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- https://gcc.gnu.org/onlinedocs/gcc/Basic-PowerPC-Built-in-Functions-Available-on-ISA-3_002e1.html#Basic-PowerPC-Built-in-Functions-Available-on-ISA-3_002e1 Please improve the documentation: - Avoid additional "int" unsigned long long int => unsigned long long - add missing line breaks between builtins - remove semicolons
[Bug c++/100809] New: PPC: __int128 divide/modulo does not use P10 instructions vdivsq/vdivuq
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100809 Bug ID: 100809 Summary: PPC: __int128 divide/modulo does not use P10 instructions vdivsq/vdivuq Product: gcc Version: 10.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- unsigned __int128 div(unsigned __int128 a, unsigned __int128 b) { return a/b; } __int128 div(__int128 a, __int128 b) { return a/b; } gcc -mcpu=power10 -save-temps -O2 int128.C Output: _Z3divoo: .LFB0: .cfi_startproc .localentry _Z3divoo,1 mflr 0 std 0,16(1) stdu 1,-32(1) .cfi_def_cfa_offset 32 .cfi_offset 65, 16 bl __udivti3@notoc addi 1,1,32 .cfi_def_cfa_offset 0 ld 0,16(1) mtlr 0 .cfi_restore 65 blr .long 0 .byte 0,9,0,1,128,0,0,0 .cfi_endproc .LFE0: .size _Z3divoo,.-_Z3divoo .globl __divti3 .align 2 .p2align 4,,15 .globl _Z3divnn .type _Z3divnn, @function _Z3divnn: .LFB1: .cfi_startproc .localentry _Z3divnn,1 mflr 0 std 0,16(1) stdu 1,-32(1) .cfi_def_cfa_offset 32 .cfi_offset 65, 16 bl __divti3@notoc addi 1,1,32 .cfi_def_cfa_offset 0 ld 0,16(1) mtlr 0 .cfi_restore 65 blr .long 0 .byte 0,9,0,1,128,0,0,0 .cfi_endproc Expected is the use of vdivsq/vdivuq. GCC version: /opt/rh/devtoolset-10/root/usr/bin/gcc -v Using built-in specs. COLLECT_GCC=/opt/rh/devtoolset-10/root/usr/bin/gcc COLLECT_LTO_WRAPPER=/opt/rh/devtoolset-10/root/usr/libexec/gcc/ppc64le-redhat-linux/10/lto-wrapper Target: ppc64le-redhat-linux Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,fortran,lto --prefix=/opt/rh/devtoolset-10/root/usr --mandir=/opt/rh/devtoolset-10/root/usr/share/man --infodir=/opt/rh/devtoolset-10/root/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared --enable-threads=posix --enable-checking=release --enable-targets=powerpcle-linux --disable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --with-linker-hash-style=gnu --with-default-libstdcxx-abi=gcc4-compatible --enable-plugin --enable-initfini-array --with-isl=/builddir/build/BUILD/gcc-10.2.1-20200804/obj-ppc64le-redhat-linux/isl-install --disable-libmpx --enable-gnu-indirect-function --enable-secureplt --with-long-double-128 --with-cpu-32=power8 --with-tune-32=power8 --with-cpu-64=power8 --with-tune-64=power8 --build=ppc64le-redhat-linux Thread model: posix Supported LTO compression algorithms: zlib gcc version 10.2.1 20200804 (Red Hat 10.2.1-2) (GCC)
[Bug c++/100809] PPC: __int128 divide/modulo does not use P10 instructions vdivsq/vdivuq
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100809 --- Comment #1 from Jens Seifert --- Same applies to modulo.
[Bug c/100808] PPC: ISA 3.1 builtin documentation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100808 --- Comment #1 from Jens Seifert --- https://gcc.gnu.org/onlinedocs/gcc/PowerPC-AltiVec-Built-in-Functions-Available-on-ISA-3_002e1.html vector unsigned long long int vec_gnb (vector unsigned __int128, const unsigned char) should be unsigned long long int vec_gnb (vector unsigned __int128, const unsigned char) vgnb instruction returns result in GPR.
[Bug target/100866] New: PPC: Inefficient code for vec_revb(vector unsigned short) < P9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100866 Bug ID: 100866 Summary: PPC: Inefficient code for vec_revb(vector unsigned short) < P9 Product: gcc Version: 8.3.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- Input: vector unsigned short revb(vector unsigned short a) { return vec_revb(a); } creates: _Z4revbDv8_t: .LFB1: .cfi_startproc .LCF1: 0: addis 2,12,.TOC.-.LCF1@ha addi 2,2,.TOC.-.LCF1@l .localentry _Z4revbDv8_t,.-_Z4revbDv8_t addis 9,2,.LC1@toc@ha addi 9,9,.LC1@toc@l lvx 0,0,9 xxlnor 32,32,32 vperm 2,2,2,0 blr Optimal code sequence: vector unsigned short revb_pwr7(vector unsigned short a) { return vec_rl(a, vec_splats((unsigned short)8)); } _Z9revb_pwr7Dv8_t: .LFB2: .cfi_startproc .localentry _Z9revb_pwr7Dv8_t,1 vspltish 0,8 vrlh 2,2,0 blr
[Bug target/100867] New: z13: Inefficient code for vec_revb(vector unsigned short)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100867 Bug ID: 100867 Summary: z13: Inefficient code for vec_revb(vector unsigned short) Product: gcc Version: 10.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- Input: vector unsigned short revb(vector unsigned short a) { return vec_revb(a); } Creates: _Z4revbDv4_j: .LFB1: .cfi_startproc larl%r5,.L4 vl %v0,.L5-.L4(%r5),3 vperm %v24,%v24,%v24,%v0 br %r14 Optimal code sequence: vector unsigned short revb_z13(vector unsigned short a) { return vec_rli(a, 8); } Creates: _Z8revb_z13Dv8_t: .LFB5: .cfi_startproc verllh %v24,%v24,8 br %r14 .cfi_endproc
[Bug target/100868] New: PPC: Inefficient code for vec_reve(vector double)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100868 Bug ID: 100868 Summary: PPC: Inefficient code for vec_reve(vector double) Product: gcc Version: 8.3.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- Input: vector double reve(vector double a) { return vec_reve(a); } creates: _Z4reveDv2_d: .LFB3: .cfi_startproc .LCF3: 0: addis 2,12,.TOC.-.LCF3@ha addi 2,2,.TOC.-.LCF3@l .localentry _Z4reveDv2_d,.-_Z4reveDv2_d addis 9,2,.LC2@toc@ha addi 9,9,.LC2@toc@l lvx 0,0,9 xxlnor 32,32,32 vperm 2,2,2,0 blr Optimal sequence would be: vector double reve_pwr7(vector double a) { return vec_xxpermdi(a,a,2); } which creates: _Z9reve_pwr7Dv2_d: .LFB4: .cfi_startproc xxpermdi 34,34,34,2 blr
[Bug target/100869] New: z13: Inefficient code for vec_reve(vector double)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100869 Bug ID: 100869 Summary: z13: Inefficient code for vec_reve(vector double) Product: gcc Version: 10.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- Input: vector double reve(vector double a) { return vec_reve(a); } creates: _Z4reveDv2_d: .LFB3: .cfi_startproc larl%r5,.L12 vl %v0,.L13-.L12(%r5),3 vperm %v24,%v24,%v24,%v0 br %r14 Optimal code sequence: vector double reve_z13(vector double a) { return vec_permi(a,a,2); } creates: _Z6reve_2Dv2_d: .LFB6: .cfi_startproc vpdi%v24,%v24,%v24,4 br %r14 .cfi_endproc
[Bug target/100871] New: z14: vec_doublee maps to wrong builtin in vecintrin.h
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100871 Bug ID: 100871 Summary: z14: vec_doublee maps to wrong builtin in vecintrin.h Product: gcc Version: 10.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- #include Input: vector double doublee(vector float a) { return vec_doublee(a); } cause compile error: vec.C: In function ‘__vector(2) double doublee(__vector(4) float)’: vec.C:43:10: error: ‘__builtin_s390_vfll’ was not declared in this scope; did you mean ‘__builtin_s390_vflls’? 43 |return vec_doublee(a); | ^~~~ | __builtin_s390_vflls vec_doublee in vec_intrin.h should call __builtin_s390_vflls vector double doublee_fix(vector float a) { return __builtin_s390_vflls(a); }
[Bug target/100808] PPC: ISA 3.1 builtin documentation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100808 --- Comment #3 from Jens Seifert --- - Avoid additional "int" unsigned long long int => unsigned long long Why? Those are exactly the same types! Yes, but the rest of the documentation uses unsigned long long. This is just for consistency with existing documentation.
[Bug target/100926] New: PPCLE: Inefficient code for vec_xl_be(unsigned short *) < P9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100926 Bug ID: 100926 Summary: PPCLE: Inefficient code for vec_xl_be(unsigned short *) < P9 Product: gcc Version: 8.3.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- Input: vector unsigned short load_be(unsigned short *c) { return vec_xl_be(0L, c); } creates: _Z7load_bePt: .LFB6: .cfi_startproc .LCF6: 0: addis 2,12,.TOC.-.LCF6@ha addi 2,2,.TOC.-.LCF6@l .localentry _Z7load_bePt,.-_Z7load_bePt addis 9,2,.LC4@toc@ha lxvw4x 34,0,3 addi 9,9,.LC4@toc@l lvx 0,0,9 vperm 2,2,2,0 blr Optimal sequence: vector unsigned short load_be_opt2(unsigned short *c) { vector signed int vneg16; __asm__("vspltisw %0,-16":"=v"(vneg16)); vector unsigned int tmp = vec_xl_be(0L, (unsigned int *)c); tmp = vec_rl(tmp, (vector unsigned int)vneg16); return (vector unsigned short)tmp; } creates: _Z12load_be_opt2Pt: .LFB8: .cfi_startproc lxvw4x 34,0,3 #APP # 77 "vec.C" 1 vspltisw 0,-16 # 0 "" 2 #NO_APP vrlw 2,2,0 blr rotate left (-16) = rotate right (+16) as only the 5 bits get evaluated. Please note that the inline assembly is required, because vec_splats(-16) gets converted into a very inefficient constant generation. vector unsigned short load_be_opt(unsigned short *c) { vector signed int vneg16 = vec_splats(-16); vector unsigned int tmp = vec_xl_be(0L, (unsigned int *)c); tmp = vec_rl(tmp, (vector unsigned int)vneg16); return (vector unsigned short)tmp; } creates: _Z11load_be_optPt: .LFB7: .cfi_startproc li 9,48 lxvw4x 34,0,3 vspltisw 0,0 mtvsrd 33,9 xxspltw 33,33,1 vsubuwm 0,0,1 vrlw 2,2,0 blr
[Bug target/100930] New: PPC: Missing builtins for P9 vextsb2w, vextsb2w, vextsb2d, vextsh2d, vextsw2d
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100930 Bug ID: 100930 Summary: PPC: Missing builtins for P9 vextsb2w, vextsb2w, vextsb2d, vextsh2d, vextsw2d Product: gcc Version: 8.3.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- Using the same names like xlC appreciated: vec_extsbd, vec_extsbw, vec_extshd, vec_extshw, vec_extswd
[Bug target/101041] New: z13: Inefficient handling of vector register passed to function
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101041 Bug ID: 101041 Summary: z13: Inefficient handling of vector register passed to function Product: gcc Version: 8.3.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- #include vector unsigned long long mul64(vector unsigned long long a, vector unsigned long long b) { return a * b; } creates: _Z5mul64Dv2_yS_: .LFB9: .cfi_startproc ldgr%f4,%r15 .cfi_register 15, 18 lay %r15,-192(%r15) .cfi_def_cfa_offset 352 vst %v24,160(%r15),3 vst %v26,176(%r15),3 lg %r2,160(%r15) lg %r1,176(%r15) lgr %r4,%r2 lg %r0,168(%r15) lgr %r2,%r1 lg %r1,184(%r15) lgr %r5,%r0 lgr %r3,%r1 vlvgp %v2,%r4,%r5 vlvgp %v0,%r2,%r3 vlgvg %r4,%v2,0 vlgvg %r1,%v2,1 vlgvg %r2,%v0,0 vlgvg %r3,%v0,1 msgr%r2,%r4 msgr%r1,%r3 lgdr%r15,%f4 .cfi_restore 15 .cfi_def_cfa_offset 160 vlvgp %v24,%r2,%r1 br %r14 Store to stack of v24,v26, then lg+lgr for all 4 parts, then constructing new vector register v0 and v2 and then extract the 4 elements again using vlgvg. Expected 4 * vlgvg + 2 * msgr + vlvgp
[Bug target/100866] PPC: Inefficient code for vec_revb(vector unsigned short) < P9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100866 --- Comment #7 from Jens Seifert --- Regarding vec_revb for vector unsigned int. I agree that revb: .LFB0: .cfi_startproc vspltish %v1,8 vspltisw %v0,-16 vrlh %v2,%v2,%v1 vrlw %v2,%v2,%v0 blr works. But in this case, I would prefer the vperm approach assuming that the loaded constant for the permute vector can be re-used multiple times. But please get rid of the xxlnor 32,32,32. That does not make sense after loading a constant. Change the constant that need to be loaded.
[Bug target/100866] PPC: Inefficient code for vec_revb(vector unsigned short) < P9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100866 --- Comment #9 from Jens Seifert --- I know that if I would use vec_perm builtin as an end user, that you then need to fulfill to the LE specification, but you can always optimize the code as you like as long as it creates correct results afterwards. load constant xxlnor constant can always be transformed to load inverse constant.
[Bug target/108396] New: PPCLE: vec_vsubcuq missing
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108396 Bug ID: 108396 Summary: PPCLE: vec_vsubcuq missing Product: gcc Version: 12.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- Input: #include vector unsigned __int128 vsubcuq(vector unsigned __int128 a, vector unsigned __int128 b) { return vec_vsubcuq(a, b); } Command line: gcc -m64 -O2 -maltivec -mcpu=power8 text.C Output: : In function '__vector unsigned __int128 vsubcuq(__vector unsigned __int128, __vector unsigned __int128)': :6:12: error: 'vec_vsubcuq' was not declared in this scope; did you mean 'vec_vsubcuqP'? 6 | return vec_vsubcuq(a, b); |^~~ |vec_vsubcuqP Compiler returned: 1
[Bug c++/108560] New: builtin_va_arg_pack_len is documented to return size_t, but actually returns int
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108560 Bug ID: 108560 Summary: builtin_va_arg_pack_len is documented to return size_t, but actually returns int Product: gcc Version: 12.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- #include bool test(const char *fmt, size_t numTokens, ...) { return __builtin_va_arg_pack_len() != numTokens; } Compiled with -Wsign-compare results in: : In function 'bool test(const char*, size_t, ...)': :5:40: warning: comparison of integer expressions of different signedness: 'int' and 'size_t' {aka 'long unsigned int'} [-Wsign-compare] 5 | return __builtin_va_arg_pack_len() != numTokens; |^~~~ :5:37: error: invalid use of '__builtin_va_arg_pack_len ()' 5 | return __builtin_va_arg_pack_len() != numTokens; |~^~ Compiler returned: 1 Documentation: https://gcc.gnu.org/onlinedocs/gcc/Constructing-Calls.html indicates a size_t return type Built-in Function: size_t __builtin_va_arg_pack_len ()
[Bug target/106770] PPCLE: Unnecessary xxpermdi before mfvsrd
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106770 --- Comment #4 from Jens Seifert --- PPCLE with no special option means -mcpu=power8 -maltivec (altivecle to be mor precise). vec_promote(, 1) should be a noop on ppcle. But value gets splatted to both left and right part of vector register. => 2 unnecesary xxpermdi The rest of the operations are done on left and right part. vec_extract(, 1) should be noop on ppcle. But value gets taken from right part of register which requires a xxpermdi Overall 3 unnecessary xxpermdi. Don't know why the right part of register gets "preferred".
[Bug target/106770] PPCLE: Unnecessary xxpermdi before mfvsrd
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106770 --- Comment #6 from Jens Seifert --- The left part of VSX registers overlaps with floating point registers, that is why no register xxpermdi is required and mfvsrd can access all (left) parts of VSX registers directly. The xxpermdi x,y,y,3 indicates to me that gcc prefers right part of register which might also cause the xxpermdi at the beginning. At the end the mystery is why gcc adds 3 xxpermdi to the code.
[Bug target/106525] New: s390: Inefficient branchless conditionals for unsigned long long
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106525 Bug ID: 106525 Summary: s390: Inefficient branchless conditionals for unsigned long long Product: gcc Version: 11.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- Created attachment 53409 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53409&action=edit source code 1) -(a > b) clgr%r2,%r3 lhi %r2,0 alcr%r2,%r2 sllg%r2,%r2,63 srag%r2,%r2,63 Last 2 could be merged to LCDFR. But optimal is: slgrk %r2,%r3,%r2 slbgr %r2,%r2 lgfr %r2,%r2 Note: lgfr is not required => 2 instructions only. 2) -(a <= b) slgr%r3,%r2 lhi %r2,0 alcr%r2,%r2 sllg%r2,%r2,63 srag%r2,%r2,63 Last 2 could be merged to LCDFR. But optimal is: clgr %r2,%r3 slbgr %r2,%r2 lgfr%r2,%r2 Note: lgfr is not required => 2 instructions only. 3) unsigned 64-bit compare for qsort (a > b) - (a < b) clgr%r2,%r3 lhi %r1,0 alcr%r1,%r1 clgr%r3,%r2 lhi %r2,0 alcr%r2,%r2 srk %r2,%r1,%r2 lgfr%r2,%r2 Optimal: slgrk %r1,%r2,%r3 slgrk 0,%r3,%r2 slbgr %r2,%r3 slbgr %r1,%r2 lgfr %r2,%r1 Note: lgfr not required => 4 instructions only
[Bug target/106536] New: P9: gcc does not detect setb pattern
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106536 Bug ID: 106536 Summary: P9: gcc does not detect setb pattern Product: gcc Version: 11.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- int compare2(unsigned long long a, unsigned long long b) { return (a > b ? 1 : (a < b ? -1 : 0)); } Output: _Z8compare2yy: cmpld 0,3,4 bgt 0,.L5 mfcr 3,128 rlwinm 3,3,1,1 neg 3,3 blr .L5: li 3,1 blr .long 0 .byte 0,9,0,0,0,0,0,0 clang generates: _Z8compare2yy: # @_Z8compare2yy cmpld 3, 4 setb 3, 0 extsw 3, 3 blr .long 0 .quad 0
[Bug target/106592] New: s390: Inefficient branchless conditionals for long long
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106592 Bug ID: 106592 Summary: s390: Inefficient branchless conditionals for long long Product: gcc Version: 11.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- Created attachment 53443 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53443&action=edit source code long long gtRef(long long a, long long b) { return a > b; } Generates: cgr %r2,%r3 lghi%r1,0 lghi%r2,1 locgrnh %r2,%r1 Better sequence: cgr %r2,%r3 lghi %r2,0 alcgr %r2,%r2 long long leMaskRef(long long a, long long b) { return -(a <= b); } Generates: cgr %r2,%r3 lhi %r1,0 lhi %r2,1 locrnle %r2,%r1 sllg%r2,%r2,63 srag%r2,%r2,63 Better sequence: cgr %r2,%r3 slbgr %r2,%r2 long long gtMaskRef(long long a, long long b) { return -(a > b); } Generates: cgr %r2,%r3 lhi %r1,0 lhi %r2,1 locrnh %r2,%r1 sllg%r2,%r2,63 srag%r2,%r2,63 Better sequence: cgr %r2,%r3 lghi %r2,0 alcgr %r2,%r2 lcgr %r2,%r2
[Bug target/106598] New: s390: Inefficient branchless conditionals for int
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106598 Bug ID: 106598 Summary: s390: Inefficient branchless conditionals for int Product: gcc Version: 11.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- int lt(int a, int b) { return a < b; } generates: cr %r2,%r3 lhi %r1,1 lhi %r2,0 locrnl %r1,%r2 lgfr%r2,%r1 br %r14 int ltOpt(int a, int b) { long long x = a; long long y = b; return ((unsigned long long)(x - y)) >> 63; } better: sgr %r2,%r3 srlg%r2,%r2,63 br %r14 int ltMask(int a, int b) { return -(a < b); } generates: cr %r2,%r3 lhi %r1,1 lhi %r2,0 locrnl %r1,%r2 sllg%r1,%r1,63 srag%r2,%r1,63 int ltMaskOpt(int a, int b) { long long x = a; long long y = b; return (x - y) >> 63; } better: sgr %r2,%r3 srag%r2,%r2,63 br %r14 int leMask(int a, int b) { return -(a <= b); } generates: cr %r2,%r3 lhi %r1,1 lhi %r2,0 locrnle %r1,%r2 sllg%r1,%r1,63 srag%r2,%r1,63 br %r14 int leMaskOpt(int a, int b) { int c; __asm__("cr %1,%2\n\tslbgr %0,%0":"=r"(c):"r"(a),"r"(b):"cc"); // slbgr create a 64-bit mask => lgfr would not be required return c; } better: cr %r2,%r3 slbgr %r2,%r2 lgfr%r2,%r2 <= not necessary br %r14 int le(int a, int b) { return a <= b; } generates: cr %r2,%r3 lhi %r1,1 lhi %r2,0 locrnle %r1,%r2 lgfr%r2,%r1 br %r14 int leOpt(int a, int b) { unsigned long long c; __asm__("cr %1,%2\n\tslbgr %0,%0":"=r"(c):"r"(a),"r"(b):"cc"); return (c >> 63); } better: cr %r2,%r3 slbgr %r2,%r2 srlg%r2,%r2,63 br %r14
[Bug target/106701] New: s390: Compiler does not take into account number range limitation to avoid subtract from immediate
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106701 Bug ID: 106701 Summary: s390: Compiler does not take into account number range limitation to avoid subtract from immediate Product: gcc Version: 11.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- unsigned long long subfic(unsigned long long a) { if (a > 15) __builtin_unreachable(); return 15 - a; } With clang on x86 subtract from immediate gets translated to xor: _Z6subficy: # @_Z6subficy mov rax, rdi xor rax, 15 ret Platforms like 390 and x86 which have no subtract from immediate would benefit from this optimization: gcc currently generates: _Z6subficy: lghi%r1,15 sgr %r1,%r2 lgr %r2,%r1 br %r14
[Bug target/106769] New: PPCLE: vec_extract(vector unsigned int) unnecessary rldicl after mfvsrwz
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106769 Bug ID: 106769 Summary: PPCLE: vec_extract(vector unsigned int) unnecessary rldicl after mfvsrwz Product: gcc Version: 11.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- #include unsigned int extr(vector unsigned int v) { return vec_extract(v, 2); } Generates: _Z4extrDv4_j: .LFB1: .cfi_startproc mfvsrwz 3,34 rldicl 3,3,0,32 blr .long 0 .byte 0,9,0,0,0,0,0,0 .cfi_endproc The rldicl is not necessary as mfvsrwz already wiped out the upper 32 bits of the register.
[Bug target/106770] New: PPCLE: Unnecessary xxpermdi before mfvsrd
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106770 Bug ID: 106770 Summary: PPCLE: Unnecessary xxpermdi before mfvsrd Product: gcc Version: 11.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- #include int cmp2(double a, double b) { vector double va = vec_promote(a, 1); vector double vb = vec_promote(b, 1); vector long long vlt = (vector long long)vec_cmplt(va, vb); vector long long vgt = (vector long long)vec_cmplt(vb, va); vector signed long long vr = vec_sub(vlt, vgt); return vec_extract(vr, 1); } Generates: _Z4cmp2dd: .LFB1: .cfi_startproc xxpermdi 1,1,1,0 xxpermdi 2,2,2,0 xvcmpgtdp 33,2,1 xvcmpgtdp 32,1,2 vsubudm 0,1,0 xxpermdi 0,32,32,3 mfvsrd 3,0 extsw 3,3 blr The unnecessary xxpermdi for vec_promote are already reported in another bugzilla case. mfvsrd can access all 64 vector registers directly and xxpermdi is not required. mfvsrd 3,32 expected instead xxpermdi 0,32,32,3 + mfvsrd 3,0
[Bug target/106770] PPCLE: Unnecessary xxpermdi before mfvsrd
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106770 --- Comment #1 from Jens Seifert --- vec_extract(vr, 1) should extract the left element. But xxpermdi x,x,x,3 extracts the right element. Looks like a bug in vec_extract for PPCLE and not a problem regarding unnecessary xxpermdi.
[Bug target/106770] PPCLE: Unnecessary xxpermdi before mfvsrd
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106770 --- Comment #2 from Jens Seifert --- vec_extract(vr, 1) should extract the left element. But xxpermdi x,x,x,3 extracts the right element. Looks like a bug in vec_extract for PPCLE and not a problem regarding unnecessary xxpermdi. Using assembly for the subtract: int cmp3(double a, double b) { vector double va = vec_promote(a, 0); vector double vb = vec_promote(b, 0); vector long long vlt = (vector long long)vec_cmplt(va, vb); vector long long vgt = (vector long long)vec_cmplt(vb, va); vector signed long long vr; __asm__ volatile("vsubudm %0,%1,%2":"=v"(vr):"v"(vlt),"v"(vgt):); //vector signed long long vr = vec_sub(vlt, vgt); return vec_extract(vr, 1); } generates: _Z4cmp3dd: .LFB2: .cfi_startproc xxpermdi 1,1,1,0 xxpermdi 2,2,2,0 xvcmpgtdp 32,2,1 xvcmpgtdp 33,1,2 #APP # 34 "cmpdouble.C" 1 vsubudm 0,0,1 # 0 "" 2 #NO_APP mfvsrd 3,32 extsw 3,3 " Looks like the compile knows about the vec_promote doing splat and at the end extracts the non-preferred right element instead of the expected left element.
[Bug target/104268] New: 390: inefficient vec_popcnt for 16-bit for z13
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104268 Bug ID: 104268 Summary: 390: inefficient vec_popcnt for 16-bit for z13 Product: gcc Version: 10.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- #include vector unsigned short popcnt(vector unsigned short a) { return vec_popcnt(a); } Generates with -march=z13 _Z6popcntDv8_t: .LFB1: .cfi_startproc vzero %v0 vpopct %v24,%v24,0 vleib %v0,8,7 vsrlb %v0,%v24,%v0 vab %v24,%v24,%v0 vgbm%v0,21845 vn %v24,%v24,%v0 br %r14 .cfi_endproc Optimal sequence would be: vector unsigned short popcnt_opt(vector unsigned short a) { vector unsigned short r = (vector unsigned short)vec_popcnt((vector unsigned char)a); vector unsigned short b = vec_rli(r, 8); r = r + b; r = r >> 8; return r; } _Z10popcnt_optDv8_t: .LFB3: .cfi_startproc vpopct %v24,%v24,0 verllh %v0,%v24,8 vah %v24,%v0,%v24 vesrlh %v24,%v24,8 br %r14 .cfi_endproc
[Bug target/103106] New: PPC: Missing builtin for P9 vmsumudm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103106 Bug ID: 103106 Summary: PPC: Missing builtin for P9 vmsumudm Product: gcc Version: 8.3.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- I can't find builtin for vmsumudm instruction. I also found nothing in the Power vector instrinsic programming reference. https://openpowerfoundation.org/?resource_lib=power-vector-intrinsic-programming-reference
[Bug target/103731] New: 390: inefficient 64-bit constant generation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103731 Bug ID: 103731 Summary: 390: inefficient 64-bit constant generation Product: gcc Version: 8.3.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- unsigned long long M8() { return 0x; } Generates: .LC0: .quad 0x .text .align 8 .globl _Z2M8v .type _Z2M8v, @function _Z2M8v: .LFB0: .cfi_startproc lgrl%r2,.LC0 br %r14 .cfi_endproc Expected 2 instructions: load immediate + insert immedate(IIHF) instead of LOAD
[Bug target/103743] New: PPC: Inefficient equality compare for large 64-bit constants having only 16-bit relevant bits in high part
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103743 Bug ID: 103743 Summary: PPC: Inefficient equality compare for large 64-bit constants having only 16-bit relevant bits in high part Product: gcc Version: 8.3.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- int overflow(); int negOverflow(long long in) { if (in == 0x8000LL) { return overflow(); } return 0; } Generates: negOverflow(long long): .quad .L.negOverflow(long long),.TOC.@tocbase,0 .L.negOverflow(long long): li 9,-1 rldicr 9,9,0,0 cmpd 0,3,9 beq 0,.L10 li 3,0 blr .L10: mflr 0 std 0,16(1) stdu 1,-112(1) bl overflow() nop addi 1,1,112 ld 0,16(1) mtlr 0 blr .long 0 .byte 0,9,0,1,128,0,0,0 Instead of: li 9,-1 rldicr 9,9,0,0 cmpd 0,3,9 Expected output: rotldi 3,3,1 cmpdi 0,3,1 This should be only applied if constant fits into 16-bit and if those 16-bit are in the first 32-bit.
[Bug target/102117] New: s390: Inefficient code for 64x64=128 signed multiply for <= z13
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102117 Bug ID: 102117 Summary: s390: Inefficient code for 64x64=128 signed multiply for <= z13 Product: gcc Version: 8.3.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- __int128 imul128(long long a, long long b) { return (__int128)a * (__int128)b; } creates sequence with 3 multiplies: _Z7imul128xx: .LFB0: .cfi_startproc ldgr%f2,%r12 .cfi_register 12, 17 ldgr%f0,%r13 .cfi_register 13, 16 lgr %r13,%r3 mlgr%r12,%r4 srag%r1,%r3,63 msgr%r1,%r4 srag%r4,%r4,63 msgr%r4,%r3 agr %r4,%r1 agr %r12,%r4 stmg%r12,%r13,0(%r2) lgdr%r13,%f0 .cfi_restore 13 lgdr%r12,%f2 .cfi_restore 12 br %r14 .cfi_endproc The following sequence only requires 1 multiply: __int128 imul128_opt(long long a, long long b) { unsigned __int128 x = (unsigned __int128)(unsigned long long)a; unsigned __int128 y = (unsigned __int128)(unsigned long long)b; unsigned long long t1 = (a >> 63) & a; unsigned long long t2 = (b >> 63) & b; unsigned __int128 u128 = x * y; unsigned long long hi = (u128 >> 64) - (t1 + t2); unsigned long long lo = (unsigned long long)u128; unsigned __int128 res = hi; res <<= 64; res |= lo; return (__int128)res; } _Z11imul128_optxx: .LFB1: .cfi_startproc ldgr%f2,%r12 .cfi_register 12, 17 ldgr%f0,%r13 .cfi_register 13, 16 lgr %r13,%r3 mlgr%r12,%r4 lgr %r1,%r3 srag%r3,%r3,63 ngr %r3,%r1 srag%r1,%r4,63 ngr %r4,%r1 agr %r3,%r4 sgrk%r3,%r12,%r3 stg %r13,8(%r2) lgdr%r12,%f2 .cfi_restore 12 lgdr%r13,%f0 .cfi_restore 13 stg %r3,0(%r2) br %r14 .cfi_endproc
[Bug target/102117] s390: Inefficient code for 64x64=128 signed multiply for <= z13
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102117 --- Comment #1 from Jens Seifert --- Sorry small bug in optimal sequence. __int128 imul128_opt(long long a, long long b) { unsigned __int128 x = (unsigned __int128)(unsigned long long)a; unsigned __int128 y = (unsigned __int128)(unsigned long long)b; unsigned long long t1 = (a >> 63) & b; unsigned long long t2 = (b >> 63) & a; unsigned __int128 u128 = x * y; unsigned long long hi = (u128 >> 64) - (t1 + t2); unsigned long long lo = (unsigned long long)u128; unsigned __int128 res = hi; res <<= 64; res |= lo; return (__int128)res; } _Z11imul128_optxx: .LFB1: .cfi_startproc ldgr%f2,%r12 .cfi_register 12, 17 ldgr%f0,%r13 .cfi_register 13, 16 lgr %r13,%r3 mlgr%r12,%r4 srag%r1,%r3,63 ngr %r1,%r4 srag%r4,%r4,63 ngr %r4,%r3 agr %r4,%r1 sgrk%r4,%r12,%r4 stg %r13,8(%r2) lgdr%r12,%f2 .cfi_restore 12 lgdr%r13,%f0 .cfi_restore 13 stg %r4,0(%r2) br %r14 .cfi_endproc
[Bug target/102265] New: s390: Inefficient code for __builtin_ctzll
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102265 Bug ID: 102265 Summary: s390: Inefficient code for __builtin_ctzll Product: gcc Version: 10.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- unsigned long long ctzll(unsigned long long x) { return __builtin_ctzll(x); } creates: lcgr%r1,%r2 ngr %r2,%r1 lghi%r1,63 flogr %r2,%r2 sgrk%r2,%r1,%r2 lgfr%r2,%r2 br %r14 Optimal sequence for z15 uses population count, for all others use ^ 63 instead of 63 -. unsigned long long ctzll_opt(unsigned long long x) { #if __ARCH__ >= 13 return __builtin_popcountll((x-1) & ~x); #else return __builtin_clzll(x & -x) ^ 63; #endif } < z15: lcgr%r1,%r2 ngr %r2,%r1 flogr %r2,%r2 xilf%r2,63 lgfr%r2,%r2 br %r14 => 1 instruction saved. z15: .cfi_startproc lay %r1,-1(%r2) ncgrk %r2,%r1,%r2 popcnt %r2,%r2,8 br %r14 .cfi_endproc => On z15 only 3 instructions required.
[Bug target/86160] Implement isinf on PowerPC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86160 --- Comment #4 from Jens Seifert --- I am looking forward to get Power9 optimization using xststdcdp etc.
[Bug target/107757] New: PPCLE: Inefficient vector constant creation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107757 Bug ID: 107757 Summary: PPCLE: Inefficient vector constant creation Product: gcc Version: 12.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- Due to the fact that vslw, vsld, vsrd, ... only use the modulo of bit width for shifting, the combination with 0xFF..FF vector can be used to create vector constants for: vec_splats(-0.0) or vec_splats(1ULL << 31) and scalar -0.0 vec_splats(-0.0f) or vec_splats(1U << 31) vec_splats((short)0x8000) with only 2 2-cycle vector instructions. Sample: vector long long lsb64() { return vec_splats(1LL); } creates: lsb64(): .LCF5: addi 2,2,.TOC.-.LCF5@l addis 9,2,.LC12@toc@ha addi 9,9,.LC12@toc@l lvx 2,0,9 blr .long 0 .byte 0,9,0,0,0,0,0,0 while: vector long long lsb64_opt() { vector long long a = vec_splats(~0LL); __asm__("vsrd %0,%0,%0":"=v"(a):"v"(a),"v"(a)); return a; } creates: lsb64_opt(): vspltisw 2,-1 vsrd 2,2,2 blr .long 0 .byte 0,9,0,0,0,0,0,0
[Bug target/107949] New: PPC: Unnecessary rlwinm after lbzx
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107949 Bug ID: 107949 Summary: PPC: Unnecessary rlwinm after lbzx Product: gcc Version: 12.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- extern unsigned char magic1[256]; unsigned int hash(const unsigned char inp[4]) { const unsigned long long INIT = 0x1ULL; unsigned long long h1 = INIT; h1 = magic1[((unsigned long long)inp[0]) ^ h1]; h1 = magic1[((unsigned long long)inp[1]) ^ h1]; h1 = magic1[((unsigned long long)inp[2]) ^ h1]; h1 = magic1[((unsigned long long)inp[3]) ^ h1]; return h1; } #ifdef __powerpc__ #define lbzx(b,c) ({ unsigned long long r; __asm__("lbzx %0,%1,%2":"=r"(r):"b"(b),"r"(c)); r; }) unsigned int hash2(const unsigned char inp[4]) { const unsigned long long INIT = 0x1ULL; unsigned long long h1 = INIT; h1 = lbzx(magic1, inp[0] ^ h1); h1 = lbzx(magic1, inp[1] ^ h1); h1 = lbzx(magic1, inp[2] ^ h1); h1 = lbzx(magic1, inp[3] ^ h1); return h1; } #endif Extra rlwinm get added. hash(unsigned char const*): .LCF0: addi 2,2,.TOC.-.LCF0@l lbz 9,0(3) addis 10,2,.LC0@toc@ha ld 10,.LC0@toc@l(10) lbz 6,1(3) lbz 7,2(3) lbz 8,3(3) xori 9,9,0x1 lbzx 9,10,9 xor 9,9,6 rlwinm 9,9,0,0xff <= not necessary lbzx 9,10,9 xor 9,9,7 rlwinm 9,9,0,0xff <= not necessary lbzx 9,10,9 xor 9,9,8 rlwinm 9,9,0,0xff <= not necessary lbzx 3,10,9 blr .long 0 .byte 0,9,0,0,0,0,0,0 hash2(unsigned char const*): .LCF1: addi 2,2,.TOC.-.LCF1@l lbz 7,0(3) lbz 8,1(3) lbz 10,2(3) lbz 6,3(3) addis 9,2,.LC1@toc@ha ld 9,.LC1@toc@l(9) xori 7,7,0x1 lbzx 7,9,7 xor 8,8,7 lbzx 8,9,8 xor 10,10,8 lbzx 10,9,10 xor 10,6,10 lbzx 3,9,10 rldicl 3,3,0,32 blr Tiny sample: unsigned long long tiny(const unsigned char *inp) { return inp[0] ^ inp[1]; } tiny(unsigned char const*): lbz 9,0(3) lbz 10,1(3) xor 3,9,10 rlwinm 3,3,0,0xff blr .long 0 .byte 0,9,0,0,0,0,0,0 unsigned long long tiny2(const unsigned char *inp) { unsigned long long a = inp[0]; unsigned long long b = inp[1]; return a ^ b; } tiny2(unsigned char const*): lbz 9,0(3) lbz 10,1(3) xor 3,9,10 rlwinm 3,3,0,0xff blr .long 0 .byte 0,9,0,0,0,0,0,0 lbz/lbzx creates a value 0 <= x < 256. xor of 2 such values does not change value range.
[Bug target/107949] PPC: Unnecessary rlwinm after lbzx
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107949 --- Comment #1 from Jens Seifert --- hash2 is only provided to show how the code should look like (without rlwinm).
[Bug target/108048] New: PPCLE: gcc does not recognize that lbzx does zero extend
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108048 Bug ID: 108048 Summary: PPCLE: gcc does not recognize that lbzx does zero extend Product: gcc Version: 12.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- extern unsigned char magic1[256]; unsigned int hash(const unsigned char inp[4]) { const unsigned long long INIT = 0x1ULL; unsigned long long h1 = INIT; h1 = magic1[((unsigned long long)inp[0]) ^ h1]; h1 = magic1[((unsigned long long)inp[1]) ^ h1]; h1 = magic1[((unsigned long long)inp[2]) ^ h1]; h1 = magic1[((unsigned long long)inp[3]) ^ h1]; return h1; } Generates: hash(unsigned char const*): .LCF0: addi 2,2,.TOC.-.LCF0@l lbz 9,0(3) addis 10,2,.LC0@toc@ha ld 10,.LC0@toc@l(10) lbz 6,1(3) lbz 7,2(3) lbz 8,3(3) xori 9,9,0x1 lbzx 9,10,9 xor 9,9,6 rlwinm 9,9,0,0xff <= unnecessary lbzx 9,10,9 xor 9,9,7 rlwinm 9,9,0,0xff <= unnecessary lbzx 9,10,9 xor 9,9,8 rlwinm 9,9,0,0xff <= unnecessary lbzx 3,10,9 blr All XOR operations are done in unsigned long long (64-bit). gcc adds unnecessary rlwinm. lbz and lbzx does zero extension (no cleanup of upper bits required).
[Bug rtl-optimization/107949] PPC: Unnecessary rlwinm after lbzx
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107949 --- Comment #3 from Jens Seifert --- *** Bug 108048 has been marked as a duplicate of this bug. ***
[Bug target/108048] PPCLE: gcc does not recognize that lbzx does zero extend
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108048 Jens Seifert changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |DUPLICATE --- Comment #1 from Jens Seifert --- duplicate of https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107949 *** This bug has been marked as a duplicate of bug 107949 ***
[Bug target/108049] New: s390: Compiler adds extra zero extend after xoring 2 zero extended values
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108049 Bug ID: 108049 Summary: s390: Compiler adds extra zero extend after xoring 2 zero extended values Product: gcc Version: 12.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- Same issue for PPC: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107949 extern unsigned char magic1[256]; unsigned int hash(const unsigned char inp[4]) { const unsigned long long INIT = 0x1ULL; unsigned long long h1 = INIT; h1 = magic1[((unsigned long long)inp[0]) ^ h1]; h1 = magic1[((unsigned long long)inp[1]) ^ h1]; h1 = magic1[((unsigned long long)inp[2]) ^ h1]; h1 = magic1[((unsigned long long)inp[3]) ^ h1]; return h1; } hash(unsigned char const*): llgc%r4,1(%r2) <= zero extends to 64-bit lgrl%r1,.LC0 llgc%r3,0(%r2) <= zero extends to 64-bit xilf%r3,1 llgc%r3,0(%r3,%r1) xr %r3,%r4 <= should be 64-bit xor llgc%r4,2(%r2) <= zero extends to 64-bit llgcr %r3,%r3 <= unnecessary llgc%r2,3(%r2) llgc%r3,0(%r3,%r1) xr %r3,%r4 <= should be 64-bit xor llgcr %r3,%r3 <= unnecessary llgc%r3,0(%r3,%r1) <= zero extends to 64-bit xrk %r2,%r3,%r2 <= should be 64-bit xor llgcr %r2,%r2 <= unnecessary llgc%r2,0(%r2,%r1) br %r14 Smaller sample: unsigned long long tiny2(const unsigned char *inp) { unsigned long long a = inp[0]; unsigned long long b = inp[1]; return a ^ b; } tiny2(unsigned char const*): llgc%r1,0(%r2) llgc%r2,1(%r2) xrk %r2,%r1,%r2 llgcr %r2,%r2 br %r14
[Bug target/108049] s390: Compiler adds extra zero extend after xoring 2 zero extended values
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108049 --- Comment #1 from Jens Seifert --- Sample above got compiled with -march=z196
[Bug c/106043] New: Power10: lacking vec_blendv builtins
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106043 Bug ID: 106043 Summary: Power10: lacking vec_blendv builtins Product: gcc Version: 11.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- Missing builtins for vector instructions xxblendvb, xxblendvw, xxblendvd, xxblendvd. #include vector int blendv(vector int a, vector int b, vector int c) { return vec_blendv(a, b, c); }
[Bug target/106043] Power10: lacking vec_blendv builtins
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106043 --- Comment #1 from Jens Seifert --- Found in documentation: https://gcc.gnu.org/onlinedocs/gcc-11.3.0/gcc/PowerPC-AltiVec-Built-in-Functions-Available-on-ISA-3_002e1.html#PowerPC-AltiVec-Built-in-Functions-Available-on-ISA-3_002e1
[Bug target/106043] Power10: lacking vec_blendv builtins
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106043 Jens Seifert changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |INVALID --- Comment #2 from Jens Seifert --- Also found in altivec.h
[Bug target/114376] New: s390: Inefficient __builtin_bswap16
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114376 Bug ID: 114376 Summary: s390: Inefficient __builtin_bswap16 Product: gcc Version: 13.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- unsigned short swap16(unsigned short in) { return __builtin_bswap16(in); } generates -O3 -march=z196 swap16(unsigned short): lrvr%r2,%r2 srl %r2,16 llghr %r2,%r2 br %r14 More efficient for 64-bit is: unsigned short swap16_2(unsigned short in) { return __builtin_bswap64(in) >> 48; } Which generates: swap16_2(unsigned short): lrvgr %r2,%r2 srlg%r2,%r2,48 br %r14 For 31-bit lrvr should be used.
[Bug target/93176] PPC: inefficient 64-bit constant consecutive ones
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93176 --- Comment #7 from Jens Seifert --- What happened ? Still waiting for improvement.
[Bug target/93176] PPC: inefficient 64-bit constant consecutive ones
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93176 --- Comment #10 from Jens Seifert --- Looks like no patch in the area got delivered. I did a small test for unsigned long long c() { return 0xULL; } gcc 13.2.0: li 3,0 ori 3,3,0x sldi 3,3,32 expected: li 3, -1 rldic 3, 3, 32, 16 All consecutive ones can be created with li + rldic. The rotate eliminates the bits on the right and the clear the bits on the left as described below: li t,-1 rldic d,T,MB,63-ME
[Bug target/115973] PPCLE: Inefficient code for __builtin_uaddll_overflow and __builtin_addcll
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115973 --- Comment #2 from Jens Seifert --- Assembly that better integrates: unsigned long long addc_opt(unsigned long long a, unsigned long long b, unsigned long long *res) { unsigned long long rc; __asm__("addc %0,%2,%3;\n\tsubfe %1,%1,%1":"=r"(*res),"=r"(rc):"r"(a),"r"(b):"xer"); return rc + 1; } Output: .L.addc_opt(unsigned long long, unsigned long long, unsigned long long*): addc 9,3,4; subfe 3,3,3 std 9,0(5) addi 3,3,1 blr Power10 code for __builtin_uaddll_overflow is okay: unsigned long long addc(unsigned long long a, unsigned long long b, unsigned long long *res) { return __builtin_uaddll_overflow(a, b, res); } .L.addc(unsigned long long, unsigned long long, unsigned long long*): add 4,3,4 cmpld 0,4,3 std 4,0(5) setbc 3,0 blr
[Bug target/116649] New: PPC: Suboptimal code for __builtin_bcdadd_ovf on Power10
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116649 Bug ID: 116649 Summary: PPC: Suboptimal code for __builtin_bcdadd_ovf on Power10 Product: gcc Version: 14.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- unsigned long long bcdadd(vector __int128 a, vector __int128 b, vector __int128 *c) { return __builtin_bcdadd_ov(a, b, 0); } creates: bcdadd(__int128 __vector(1), __int128 __vector(1), __int128 __vector(1)*): .quad .L.bcdadd(__int128 __vector(1), __int128 __vector(1), __int128 __vector(1)*),.TOC.@tocbase,0 .L.bcdadd(__int128 __vector(1), __int128 __vector(1), __int128 __vector(1)*): bcdadd. 2,2,3,0 mfcr 3,2 rlwinm 3,3,28,1 blr .long 0 .byte 0,9,0,0,0,0,0,0 while use of setbc expected. bcdadd. 2,2,3,0 setbc 3,27 blr
[Bug target/115355] New: PPCLE: Auto-vectorization creates wrong code for Power9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115355 Bug ID: 115355 Summary: PPCLE: Auto-vectorization creates wrong code for Power9 Product: gcc Version: 12.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- Input setToIdentity.C: #include #include #include void setToIdentityGOOD(unsigned long long *mVec, unsigned int mLen) { for (unsigned long long i = 0; i < mLen; i++) { mVec[i] = i; } } void setToIdentityBAD(unsigned long long *mVec, unsigned int mLen) { for (unsigned int i = 0; i < mLen; i++) { mVec[i] = i; } } unsigned long long vec1[100]; unsigned long long vec2[100]; int main(int argc, char *argv[]) { unsigned int l = argc > 1 ? atoi(argv[1]) : 29; setToIdentityGOOD(vec1, l); setToIdentityBAD(vec2, l); if (memcmp(vec1, vec2, l*sizeof(vec1[0])) != 0) { for (unsigned int i = 0; i < l; i++) { printf("%llu %llu\n", vec1[i], vec2[i]); } } else { printf("match\n"); } return 0; } Fails gcc -O3 -mcpu=power9 -m64 setToIdentity.C -save-temps -fverbose-asm -o pwr9.exe -mno-isel Good: gcc -O3 -mcpu=power8 -m64 setToIdentity.C -save-temps -fverbose-asm -o pwr8.exe -mno-isel "-mno-isel" is only specified to reduce the diff. Failing output: pwr9.exe 0 0 1 1 2 0 3 4294967296 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 23 23 24 24 25 25 26 26 27 27 28 28 4th element contains wrong data.
[Bug target/115355] PPCLE: Auto-vectorization creates wrong code for Power9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115355 --- Comment #1 from Jens Seifert --- Same issue with gcc 13.2.1
[Bug target/115355] [12/13/14/15 Regression] vectorization exposes wrong code on P9 LE starting from r12-4496
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115355 --- Comment #10 from Jens Seifert --- Does this affect loop vectorize and slp vectorize ? -fno-tree-loop-vectorize avoids loop vectorization to be performed and workarounds this issue. Does the same problems also affect SLP vectorization, which does not take place in this sample. In other words, do I need -fno-tree-loop-vectorize or -fno-tree-vectorize to workaround this bug ?