from:"jens.seifert at de dot ibm.com"

[Bug target/94135] New: PPC: subfic instead of neg used for rotate right

2020-03-11 Thread jens.seifert at de dot ibm.com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94135

Bug ID: 94135
   Summary: PPC: subfic instead of neg used for rotate right
   Product: gcc
   Version: 8.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

Input:

unsigned int rotr32(unsigned int v, unsigned int r)
{
   return (v>>r)|(v<<(32-r));
}

unsigned long long rotr64(unsigned long long v, unsigned long long r)
{
   return (v>>r)|(v<<(64-r));
}

Command line:
gcc -O2 -save-temps rotr.C

Output:
_Z6rotr32jj:
.LFB0:
.cfi_startproc
subfic 4,4,32
rotlw 3,3,4
blr
.long 0
.byte 0,9,0,0,0,0,0,0
.cfi_endproc

_Z6rotr64yy:
.LFB1:
.cfi_startproc
subfic 4,4,64
rotld 3,3,4
blr
.long 0
.byte 0,9,0,0,0,0,0,0
.cfi_endproc

subfic is a 2 cycle instruction, but can be replaced by 1 cycle instruction
neg.
rotr32(v,r) = rotl32(v,32-r) = rotl32(v,(32-r)%32) = rotl32(v,(-r)%32))=
rotl32(v,-r) as long as you have a modulo rotate like rotlw/rlwnm.

Same for 64-bit.

[Bug target/94135] PPC: subfic instead of neg used for rotate right

2020-03-11 Thread jens.seifert at de dot ibm.com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94135

--- Comment #2 from Jens Seifert  ---
POWER8 Processor User’s Manual for the Single-Chip Module:

addi addis add add. subf subf. addic subfic adde addme subfme addze. subfze neg
neg. nego

1 - 2 cycles (GPR)
2 cycles (XER)
5 cycles (CR)

6/cycle, 2/cycle (with XER or CR updates)

CA is part of XER.

1-2 cycles versus 2 cycles.

[Bug target/94135] PPC: subfic instead of neg used for rotate right

2020-03-16 Thread jens.seifert at de dot ibm.com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94135

--- Comment #4 from Jens Seifert  ---
Setting CA in XER increases issue to issue latency by 1 on Power8.
See:
Table 10-14. Issue-to-Issue Latencies

In addition, setting the CA restricts instruction reordering.

[Bug c++/94297] New: std::replace internal compiler error

2020-03-24 Thread jens.seifert at de dot ibm.com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94297

Bug ID: 94297
   Summary: std::replace internal compiler error
   Product: gcc
   Version: 8.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

#include 
#include 

void patch(std::string& s)
{
   std::replace(s.begin(),s.end(),'.','-');
}

gcc replace.C

In file included from
/opt/rh/devtoolset-8/root/usr/include/c++/8/bits/uniform_int_dist.h:35,
 from
/opt/rh/devtoolset-8/root/usr/include/c++/8/bits/stl_algo.h:66,
 from /opt/rh/devtoolset-8/root/usr/include/c++/8/algorithm:62,
 from replace.C:1:
/opt/rh/devtoolset-8/root/usr/include/c++/8/limits:1677:7: internal compiler
error: Segmentation fault
   max() _GLIBCXX_USE_NOEXCEPT { return __DBL_MAX__; }
   ^~~
Please submit a full bug report,
with preprocessed source if appropriate.
See <http://bugzilla.redhat.com/bugzilla> for instructions.
Preprocessed source stored into /tmp/ccFTVYLT.out file, please attach this to
your bugreport.

[Bug target/94297] PPCLE std::replace internal compiler error

2020-03-24 Thread jens.seifert at de dot ibm.com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94297

Jens Seifert  changed:

   What|Removed |Added

Summary|std::replace internal   |PPCLE std::replace internal
   |compiler error  |compiler error
  Component|c++ |target
 CC||wschmidt at gcc dot gnu.org

--- Comment #1 from Jens Seifert  ---
Same code compile on x86 on same version of compiler.

[Bug target/94297] PPCLE std::replace internal compiler error

2020-03-24 Thread jens.seifert at de dot ibm.com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94297

--- Comment #3 from Jens Seifert  ---
Created attachment 48110
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48110&action=edit
Pre-processed file created using -save-temps

[Bug target/94297] PPCLE std::replace internal compiler error

2020-03-24 Thread jens.seifert at de dot ibm.com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94297

--- Comment #5 from Jens Seifert  ---
No options. Same failure with -O2. System is a RHEL 7.5.

Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/opt/rh/devtoolset-8/root/usr/libexec/gcc/ppc64le-redhat-linux/8/lto-wrapper
Target: ppc64le-redhat-linux
Configured with: ../configure --enable-bootstrap
--enable-languages=c,c++,fortran,lto --prefix=/opt/rh/devtoolset-8/root/usr
--mandir=/opt/rh/devtoolset-8/root/usr/share/man
--infodir=/opt/rh/devtoolset-8/root/usr/share/info
--with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared
--enable-threads=posix --enable-checking=release
--enable-targets=powerpcle-linux --disable-multilib --with-system-zlib
--enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object
--enable-linker-build-id --with-gcc-major-version-only
--with-linker-hash-style=gnu --with-default-libstdcxx-abi=gcc4-compatible
--enable-plugin --enable-initfini-array
--with-isl=/builddir/build/BUILD/gcc-8.3.1-20190311/obj-ppc64le-redhat-linux/isl-install
--disable-libmpx --enable-gnu-indirect-function --enable-secureplt
--with-long-double-128 --with-cpu-32=power8 --with-tune-32=power8
--with-cpu-64=power8 --with-tune-64=power8 --build=ppc64le-redhat-linux
Thread model: posix
gcc version 8.3.1 20190311 (Red Hat 8.3.1-3) (GCC)


No error with:
gcc -std=gnu++98 replace.C
gcc -std=gnu++03 replace.C

Error with:
gcc -std=gnu++11 replace.C
gcc -std=gnu++17 replace.C

[Bug target/94519] New: PPC: ICE: Segmentation fault on -DBL_MAX

2020-04-07 Thread jens.seifert at de dot ibm.com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94519

Bug ID: 94519
   Summary: PPC: ICE: Segmentation fault on -DBL_MAX
   Product: gcc
   Version: 8.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

Input:
#include 
static const double dsmall[] = { -DBL_MAX };

gcc ccerr.C
ccerr.C:3:1: internal compiler error: Segmentation fault
 static const double dsmall[] = { -DBL_MAX };
 ^~
Please submit a full bug report,
with preprocessed source if appropriate.
See <http://bugzilla.redhat.com/bugzilla> for instructions.
Preprocessed source stored into /tmp/cc3rsmv0.out file, please attach this to
your bugreport.

[Bug target/94519] PPC: ICE: Segmentation fault on -DBL_MAX

2020-04-07 Thread jens.seifert at de dot ibm.com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94519

Jens Seifert  changed:

   What|Removed |Added

 Status|RESOLVED|CLOSED

--- Comment #2 from Jens Seifert  ---
Environmental issue.
gcc was compiled with GMP 6.0.0 support but is picking up a GMP 5.0.1

[Bug target/94297] PPCLE std::replace internal compiler error

2020-04-07 Thread jens.seifert at de dot ibm.com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94297

Jens Seifert  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #7 from Jens Seifert  ---
Too old libgmp got picked up. Setting LD_LIBRARY_PATH=/lib64 solved the issue.

[Bug target/94297] PPCLE std::replace internal compiler error

2020-04-07 Thread jens.seifert at de dot ibm.com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94297

--- Comment #8 from Jens Seifert  ---
Too old libgmp got picked up. Setting LD_LIBRARY_PATH=/lib64 solved the issue.

[Bug target/94297] PPCLE std::replace internal compiler error

2020-04-07 Thread jens.seifert at de dot ibm.com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94297

Jens Seifert  changed:

   What|Removed |Added

 Status|RESOLVED|CLOSED

--- Comment #9 from Jens Seifert  ---
Too old libgmp got picked up. Setting LD_LIBRARY_PATH=/lib64 solved the issue.

[Bug target/95704] New: PPC: int128 shifts should be implemented branchless

2020-06-16 Thread jens.seifert at de dot ibm.com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95704

Bug ID: 95704
   Summary: PPC: int128 shifts should be implemented branchless
   Product: gcc
   Version: 8.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

Created attachment 48741
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48741&action=edit
input with branchless 128-bit shifts

PowerPC processors don't like branches and branch mispredicts lead to large
overhead.

shift left/right unsigned __in128 can be implemented in 8 instructions which
can be processed on 2 pipelines almost in parallel leading to ~5 cycle latency
on Power 7 and 8.
shift right algebraic __int128 can be implemented in 10 instructions.
Overall comparable in latency of the branching code.

In attached file you find the branch less implementations in C. And I know that
this is using undefined behavior. But the resulting assembly is the interesting
part. 

The unnecessary rldicl 8,5,0,32 at the beginning of the routines are also not
necessary.

[Bug target/95704] PPC: int128 shifts should be implemented branchless

2020-06-16 Thread jens.seifert at de dot ibm.com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95704

--- Comment #1 from Jens Seifert  ---
Created attachment 48742
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48742&action=edit
assembly

[Bug target/95704] PPC: int128 shifts should be implemented branchless

2020-06-17 Thread jens.seifert at de dot ibm.com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95704

--- Comment #3 from Jens Seifert  ---
GCC 8.3 generates:
_Z3shloy:
.LFB0:
.cfi_startproc
addi 9,5,-64
cmpwi 7,9,0
blt 7,.L2
sld 4,3,9
li 3,0
blr
.p2align 4,,15
.L2:
srdi 9,3,1
subfic 10,5,63
sld 4,4,5
srd 9,9,10
sld 3,3,5
or 4,9,4
blr
.long 0
.byte 0,9,0,0,0,0,0,0
.cfi_endproc

8 instructions if taking L2. The branch free code I propsed:

_Z15shl_branch_lessoy:
.LFB1:
.cfi_startproc
rldicl 5,5,0,32
subfic 9,5,64
addi 10,5,-64
sld 10,3,10
srd 9,3,9
sld 4,4,5
or 9,9,10
or 4,9,4
sld 3,3,5
blr

8 instructions no branch. Almost everything can be executed in parallel.

rldicl 5,5,0,32 gets added by gcc, which is not necessary.

[Bug target/95704] PPC: int128 shifts should be implemented branchless

2020-06-17 Thread jens.seifert at de dot ibm.com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95704

--- Comment #5 from Jens Seifert  ---
Power9 code is branchfree but not good at all.

_Z3shloy:
.LFB0:
.cfi_startproc
addi 8,5,-64
subfic 6,5,63
srdi 10,3,1
li 7,0
sld 4,4,5
sld 5,3,5
cmpwi 7,8,0
srd 10,10,6
sld 3,3,8
or 4,10,4
isel 5,5,7,28
isel 4,4,3,28
mr 3,5
blr

13 instructions.

[Bug target/95737] New: PPC: Unnecessary extsw after negative less than

2020-06-18 Thread jens.seifert at de dot ibm.com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95737

Bug ID: 95737
   Summary: PPC: Unnecessary extsw after negative less than
   Product: gcc
   Version: 8.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

unsigned long long negativeLessThan(unsigned long long a, unsigned long long b)
{
   return -(a < b);
}

gcc -m64 -O2 -save-temps negativeLessThan.C

creates

_Z16negativeLessThanyy:
.LFB0:
.cfi_startproc
subfc 4,4,3
subfe 3,3,3
extsw 3,3
blr

The extsw is not necessary.

[Bug target/95737] PPC: Unnecessary extsw after negative less than

2020-06-19 Thread jens.seifert at de dot ibm.com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95737

Jens Seifert  changed:

   What|Removed |Added

 Status|RESOLVED|UNCONFIRMED
 Resolution|DUPLICATE   |---

--- Comment #3 from Jens Seifert  ---
This is different as the extsw also happens if the result gets used e.g.
followed by a andc, which is my case. I obviously oversimplified the sample. It
has nothing to do with function result and ABI requirements. gcc assume that
the result of -(a < b) implemented by subfc, subfe is signed 32-bit. But the
result is already 64-bit.

unsigned long long branchlesconditional(unsigned long long a, unsigned long
long b, unsigned long long c)
{
   unsigned long long mask = -(a < b);
   return c &~ mask;
}

results in

_Z20branchlesconditionalyyy:
.LFB1:
.cfi_startproc
subfc 4,4,3
subfe 3,3,3
not 3,3
extsw 3,3
and 3,3,5
blr

expected
subfc
subfe
andc

[Bug c++/93012] New: PPC: inefficient 64-bit constant generation (upper = lower)

2019-12-19 Thread jens.seifert at de dot ibm.com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93012

Bug ID: 93012
   Summary: PPC: inefficient 64-bit constant generation (upper =
lower)
   Product: gcc
   Version: 8.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

unsigned long long msk66()
{
   return 0xULL;
}

gcc -maix64 -O2 const.C -save-temps

Output:
._Z5msk66v:
LFB..0:
lis 3,0x
ori 3,3,0x
sldi 3,3,32
oris 3,3,0x
ori 3,3,0x
blr

Any 64-bit constant that has matching upper 32-bit and lower 32-bit can be
created using 3 instructions construct 32-bit lower part and then use rldimi to
duplicate into upper part of register.

Sample:
lis 3, 26214
ori 3, 3, 26214
rldimi 3, 3, 32, 0

[Bug c++/93013] New: PPC: optimization leads around modulo leads to incorrect result

2019-12-19 Thread jens.seifert at de dot ibm.com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93013

Bug ID: 93013
   Summary: PPC: optimization leads around modulo leads to
incorrect result
   Product: gcc
   Version: 8.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

Input:
int mod(int x, int y, int &z)
{
z = x % y;
if (y == 0)
{
// division by zero
return 1;
}
else if (y == -1)
{
// gcc removes this branch, which leads to wrong results for
-2^31 % -1
z=0;
}
return 0;
}

gcc -maix64 -O2 modulo.C -save-temps

Output:
._Z3modiiRi:
LFB..0:
divw 9,3,4
mr 10,3
cntlzw 3,4
srwi 3,3,5
mullw 9,9,4
subf 9,9,10
stw 9,0(5)
blr

For input x=-2^31 y=-1, the result is expected to be 0.
As modulo is emulated using x-(x/y)*y and x/y for x=-2^31 y=-1 is undefined on
PowerPC, the result is incorrect if the branch gets optimized away.

[Bug tree-optimization/93013] PPC: optimization around modulo leads to incorrect result

2020-01-01 Thread jens.seifert at de dot ibm.com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93013

--- Comment #7 from Jens Seifert  ---
The modulo at the beginning was done for optimization purpose. As the divide
takes long and the special cases are extreme edge cases, it is wise to execute
the divide as early as possible on PPC as divide on PPC does not produce
signals on bad input.
Thank you for citing the C standard regarding another undefined behavior
section.

[Bug c++/93123] New: Lacking basic optimization around __int128 bitwise operations against constants

2020-01-01 Thread jens.seifert at de dot ibm.com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93123

Bug ID: 93123
   Summary: Lacking basic optimization around __int128 bitwise
operations against constants
   Product: gcc
   Version: 8.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

unsigned __int128 and128WithConst(unsigned __int128 a)
{
   unsigned __int128 c128 = (((unsigned __int128)(~0ULL)) << 64) | ((unsigned
__int128)(~0xFULL));
   return a & c128;
}

gcc -O2 -maix64 -save-temps int128.C

Output:
._Z12andWithConsto:
LFB..0:
li 10,-1
li 11,-16
and 3,3,10
and 4,4,11
blr

Expected result: Single instruction: rldicr on low part (register 4).
Bitwise and with 0xFF..FF is a no-op.
Bitwise and with 0xFF..F0 can be done using rldicr.

[Bug target/93126] New: PPC altivec -Wunused-but-set-variable false positive

2020-01-02 Thread jens.seifert at de dot ibm.com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93126

Bug ID: 93126
   Summary: PPC altivec -Wunused-but-set-variable false positive
   Product: gcc
   Version: 8.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

Input:
#include 

double vmax(double a, double b)
{
#ifdef _BIG_ENDIAN
   const long PREF = 0;
#else
   const long PREF = 1;
#endif
   vector double va = vec_promote(a,PREF);
   vector double vb = vec_promote(b,PREF);
   return vec_extract(vec_max(va, vb), PREF);
}

gcc -O2 -maix64 -mcpu=power7 -maltivec -Wall

warn.C: In function 'double vmax(double, double)':
warn.C:6:15: warning: variable 'PREF' set but not used
[-Wunused-but-set-variable]
const long PREF = 0;
   ^~~~

[Bug target/93127] PPC altivec vec_promote creates unnecessary xxpermdi instruction

2020-01-02 Thread jens.seifert at de dot ibm.com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93127

Jens Seifert  changed:

   What|Removed |Added

 Target||powerpc-*-*-*

--- Comment #1 from Jens Seifert  ---
Command line:
gcc -O2 -maix64 -mcpu=power7 -maltivec warn.C -save-temps

[Bug target/93127] New: PPC altivec vec_promote creates unnecessary xxpermdi instruction

2020-01-02 Thread jens.seifert at de dot ibm.com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93127

Bug ID: 93127
   Summary: PPC altivec vec_promote creates unnecessary xxpermdi
instruction
   Product: gcc
   Version: 8.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

vec_promote can leave half of register undefined and therefore should not issue
extra instruction.

Input:
#include 

double vmax(double a, double b)
{
#ifdef _BIG_ENDIAN
   const int PREF = 0;
#else
   const int PREF = 1;
#endif
   vector double va = vec_promote(a,PREF);
   vector double vb = vec_promote(b,PREF);
   return vec_extract(vec_max(va, vb), PREF);
}

Output:
._Z4vmaxdd:
LFB..0:
xxpermdi 2,2,2,0
xxpermdi 1,1,1,0
xvmaxdp 1,1,2
 # vec_extract to same register
blr

Both xxpermdi are unnecessary.

[Bug target/93128] New: PPC small floating point constants can be constructed using vector operations

2020-01-02 Thread jens.seifert at de dot ibm.com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93128

Bug ID: 93128
   Summary: PPC small floating point constants can be constructed
using vector operations
   Product: gcc
   Version: 8.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

Input:
#include 
double d2()
{
  return 2.0;
}

vector double v2()
{
   return vec_splats(2.0);
}

gcc -O2 -maix64 -mcpu=power7 -maltivec const.C

gcc uses load from constant area.

Better alternative for "integer" values -15.0..+16.0.
vspltisw 0,
xvcvsxwdp 1,32

0.0 already get constructed using xxlxor, which is great.

Similar things can be done for 
vector float v2f()
{
   return vec_splats(2.0f);
}

[Bug target/93129] New: PPC memset not using vector instruction on >= Power8

2020-01-02 Thread jens.seifert at de dot ibm.com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93129

Bug ID: 93129
   Summary: PPC memset not using vector instruction on >= Power8
   Product: gcc
   Version: 8.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

Input:
void memclear16(char *p)
{
  memset(p, 0, 16);
}

void memFF16(char *p)
{
  memset(p, 0xFF, 16);
}

Output:
._Z10memclear16Pc:
LFB..0:
li 9,0
std 9,0(3)
std 9,8(3)
blr

._Z7memFF16Pc:
LFB..1:
mflr 0
li 5,16
li 4,255
std 0,16(1)
stdu 1,-112(1)
LCFI..0:
bl .memset
nop
addi 1,1,112
LCFI..1:
ld 0,16(1)
mtlr 0
LCFI..2:
blr

Expected output:
vsplitb + vector store

Unaligned vector stores only perform on >= Power8, vector stores should be used
only on >= Power8. On Power7 you would need to know it is at least 8-byte
aligned.

vsplitb has the -16..+15 limit. On Power9 a splat for 0.255 exists.

[Bug target/93128] PPC small floating point constants can be constructed using vector operations

2020-01-02 Thread jens.seifert at de dot ibm.com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93128

--- Comment #1 from Jens Seifert  ---
Wrong number range for Power7: -16..15

[Bug target/93130] New: PCC simple memset not inlined

2020-01-02 Thread jens.seifert at de dot ibm.com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93130

Bug ID: 93130
   Summary: PCC simple memset not inlined
   Product: gcc
   Version: 8.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

Input:
void memspace16(char *p)
{
  memset(p, ' ', 16);
}


Expected result:
li 4,0x2020
rldimi 4,4,16,0
rldimi 4,4,32,0
std 4,0(3)

Splatting the memset input to 64-bit can be done using li + 2xrldimi.
But also

._Z13memspace16OptPc:
LFB..3:
lis 9,0x2020
ori 9,9,0x2020
sldi 9,9,32
oris 9,9,0x2020
ori 9,9,0x2020
std 9,0(3)
std 9,8(3)
blr

would perform better than the function call to memset.

[Bug target/93176] New: PPC: inefficient 64-bit constant consecutive ones

2020-01-06 Thread jens.seifert at de dot ibm.com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93176

Bug ID: 93176
   Summary: PPC: inefficient 64-bit constant consecutive ones
   Product: gcc
   Version: 8.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

All 64-bit constants containing a sequence of ones can be constructed with 2
instructions (li/lis + rldicl). gcc creates up to 5 instructions.

Input:
unsigned long long onesLI()
{
   return 0x0000ULL; // expected: li 3,0xFF00 ; rldicl 3,3,0,8
}

unsigned long long onesLIS()
{
   return 0x0000ULL; // expected: lis 3,0xFF00 ; rldicl 3,3,0,8
}

unsigned long long onesHI()
{
   return 0x0000ULL; // expected: lis 3,0x ; rldicl 3,3,8,8
}

Command line:
gcc -O2 -maix64 -save-temps const.C

Output:
._Z6onesLIv:
LFB..2:
lis 3,0xff
ori 3,3,0x
sldi 3,3,32
oris 3,3,0x
ori 3,3,0xff00
blr


._Z7onesLISv:
LFB..3:
lis 3,0xff
ori 3,3,0x
sldi 3,3,32
oris 3,3,0xff00
blr

._Z6onesHIv:
LFB..4:
lis 3,0xff
ori 3,3,0xff00
sldi 3,3,32
blr

[Bug target/93178] New: PPC: inefficient 64-bit constant generation if msb is off in low 16 bit

2020-01-06 Thread jens.seifert at de dot ibm.com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93178

Bug ID: 93178
   Summary: PPC: inefficient 64-bit constant generation if msb is
off in low 16 bit
   Product: gcc
   Version: 8.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

Input:
unsigned long long hi16msbon_low16msboff()
{
   return 0x87654321ULL; // expected: li 3,0x4321 ; oris 3,0x8765
}

Command line:
gcc -O2 -maix64 -save-temps const.C

Output:
._Z21hi16msbon_low16msboffv:
LFB..1:
lis 3,0x8765
ori 3,3,0x4321
rldicl 3,3,0,32
blr

[Bug c++/93448] New: PPC: missing builtin for DFP quantize(dqua,dquai,dquaq,dquaiq)

2020-01-26 Thread jens.seifert at de dot ibm.com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93448

Bug ID: 93448
   Summary: PPC: missing builtin for DFP
quantize(dqua,dquai,dquaq,dquaiq)
   Product: gcc
   Version: 8.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

I am currently porting an application to PPCLE and found that I am lacking
compiler builtins for decimal floating point quantize on
_Decimal128/_Decimal64.

Any plans to add them ? Any workarounds ? E.g. did I miss a inline asm
constraint for _Decimal128 register even/odd pair ?

[Bug target/93449] New: PCC: Missing conversion builtin from vector to _Decimal128 and vice versa

2020-01-26 Thread jens.seifert at de dot ibm.com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93449

Bug ID: 93449
   Summary: PCC: Missing conversion builtin from vector to
_Decimal128 and vice versa
   Product: gcc
   Version: 8.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

I am currently porting an application from AIX to PPCLE and found that I am
lacking compiler builtins for transforming vector input into _Decimal128 and
vice versa. This can be done on PowerPC using xxlor + xxpermdi (2 instructions
on >= Power7).
The conversion routines __builtin_denbcdq/__builtin_ddedpdq are based on
_Decimal128 input and output. The missing piece is vector to _Decimal128 and
vice versa.

Background:
>From and to vector register I can load/store variable length BCD decimals.
BCD Decimal can be converted to _Decimal128. Then I can perform multiply and
divide. Then can be converted back to BCD, but then again I want to store a BCD
decimal which might not be 16-byte in size.

[Bug target/93453] New: PPC: rldimi not taken into account to avoid shift+or

2020-01-27 Thread jens.seifert at de dot ibm.com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93453

Bug ID: 93453
   Summary: PPC: rldimi not taken into account to avoid shift+or
   Product: gcc
   Version: 8.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

2 samples:

unsigned long long load8r(unsigned long long *in)
{
   return __builtin_bswap64(*in);
}

unsigned long long rldimi(unsigned int hi, unsigned int lo)
{
   return (((unsigned long long)hi) << 32) | ((unsigned long long)lo);
}

Command line:
gcc -maix64 -mcpu=power6 -save-temps -O2 rldimi.C 

Even if number range is known to not cause conflicts shift+or does not get
replaces by rldimi.

Output:
._Z6load8rPy:
LFB..0:
addi 9,3,4
lwbrx 3,0,3
lwbrx 10,0,9
sldi 10,10,32
or 3,3,10
blr

._Z6rldimijj:
LFB..1:
sldi 3,3,32
or 3,3,4
blr

[Bug target/93449] PPC: Missing conversion builtin from vector to _Decimal128 and vice versa

2020-01-28 Thread jens.seifert at de dot ibm.com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93449

--- Comment #2 from Jens Seifert  ---
#include 

typedef float _Decimal128 __attribute__((mode(TD)));

_Decimal128 bcdtodpd(vector double v)
{
   _Decimal128 res;
   memcpy(&res, &v, sizeof(res));
   res = __builtin_denbcdq(0, res);
   return res;
}

_Decimal128 bcdtodpd_opt(vector double bcd)
{
   _Decimal128 res;
   __asm__ volatile("xxlor 4,%x1,%x1;\n"
   "xxpermdi 5,%x1,%x1,3;\n"
   "denbcdq 0,%0,4":"=d"(res):"v"(bcd):"vs36","vs37");
   return res;
}

vector double dpdtobcd(_Decimal128 dpd)
{
   _Decimal128 bcd = __builtin_ddedpdq(0, dpd);
   vector double res;
   memcpy(&res, &bcd, sizeof(res));
   return res;
}

vector double dpdtobcd_opt(_Decimal128 dpd)
{
   vector double res;
   __asm__ volatile("ddedpdq 0,4,%1;\n"
"xxpermdi %x0,4,5,0":"=v"(res):"d"(dpd):"vs36","vs37");
   return res;
}

The non-inline assembly show store/load (very slow).
The assembly version does the conversion from vector to _Decimal128 with
optimal sequence for Power7 and above.

[Bug target/93448] PPC: missing builtin for DFP quantize(dqua,dquai,dquaq,dquaiq)

2020-01-28 Thread jens.seifert at de dot ibm.com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93448

--- Comment #4 from Jens Seifert  ---
The inline asm constraint "d" works. Thank you.

[Bug target/93449] PPC: Missing conversion builtin from vector to _Decimal128 and vice versa

2020-01-28 Thread jens.seifert at de dot ibm.com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93449

--- Comment #4 from Jens Seifert  ---
Power8 has bcdadd which can be only combined with _Decimal128 if you have some
kind of conversion in between BCDs stored in vector register and _Decimal128.

On Power9 vec_load_len/vec_store_len can be used to load variable length BCDs.
On Power7/8 I can load variable length BCDs as well (with more instructions),
but overall it is desirable to have the possibility to convert vector to
_Decimal128 and vice versa.

I suppose I can survive with inline assembly like below. The assembly works for
p7-p9 with optimal speed.

The memcpy inline between vector and _Decimal128 is not optimal for
-mcpu=power7-9. Always a store/load (lacking XNOP) ending up in load-hit-store
issue.

[Bug target/93570] New: PPC: __builtin_mtfsf does not return a value

2020-02-04 Thread jens.seifert at de dot ibm.com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93570

Bug ID: 93570
   Summary: PPC: __builtin_mtfsf does not return a value
   Product: gcc
   Version: 8.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

Documentation says:

double __builtin_mtfsf(const int,double)

Not documented in 8.3.0, but somehow works, nevertheless looks like the
prototype is wrong and should be

void __builtin_mtfsf(const int,double)

double mtfsf(double x)
{
   return __builtin_mtfsf(0xFF,x);
}

returns

flm.C:9:34: error: void value not ignored as it ought to be
return __builtin_mtfsf(0xFF, x);

Is __builtin_mtfsf returning void ?
Is it safe to use __builtin_mtfsf with 8.3.0 ?

[Bug target/93571] New: PPC: fmr gets used instead of faster xxlor

2020-02-04 Thread jens.seifert at de dot ibm.com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93571

Bug ID: 93571
   Summary: PPC: fmr gets used instead of faster xxlor
   Product: gcc
   Version: 8.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

fmr is a 6 cycle instruction on Power8. Why is gcc not using the 2 cycle xxlor
instruction )

Input:
double setflm(double x)
{
   double r = __builtin_mffs();
   __builtin_mtfsf(0xFF, x);
return r;
}

Command line:
gcc -maix64 -O2 -save-temps flm.C -mcpu=power8

Output:
._Z6setflmd:
LFB..0:
mffs 0
mtfsf 255,1
fmr 1,0
blr

[Bug target/70928] Load simple float constants via VSX operations on PowerPC

2020-11-14 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70928

Jens Seifert  changed:

   What|Removed |Added

 CC||jens.seifert at de dot ibm.com

--- Comment #4 from Jens Seifert  ---
values -16.0..+15.0.
vspltisw  0,
xvcvsxwdp 32,32

values -16.0f..+15.0f
vspltisw  0,
xvcvsxwsp 32,32

-0.0 / 0x8000
xxlxor 32,32,32
xvnabsdp 32,32 or xvnegdp 32,32

-0.0f / 0x8000
xxlxor 32,32,32
xvnabssp 32,32 or xvnegsp 32,32

0x7FFF
vspltisw 0,-1
xvabsdp 32,32

0x7FFF
vspltisw 0,-1
xvabssp 32,32

[Bug target/98020] New: PPC: mfvsrwz+extsw not merge to mtvsrwa

2020-11-26 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98020

Bug ID: 98020
   Summary: PPC: mfvsrwz+extsw not merge to mtvsrwa
   Product: gcc
   Version: 8.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

int extract(vector signed int v)
{
   return v[2];
}

Command line:
gcc -mcpu=power8 -maltivec -m64 -O3 -save-temps extract.C

Output:
_Z7extractDv4_i:
.LFB0:
.cfi_startproc
mfvsrwz 3,34
extsw 3,3
blr

[Bug target/98124] New: Z: Load and test LTDBR instruction gets not used for comparison against 0.0

2020-12-03 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98124

Bug ID: 98124
   Summary: Z: Load and test LTDBR instruction gets not used for
comparison against 0.0
   Product: gcc
   Version: 8.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

#include 
double sign(double in)
{
   return in == 0.0 ? 0.0 : copysign(1.0, in);
}

Command line:
gcc m64 -O2 -save-temps copysign.C

Output:
_Z4signd:
.LFB234:
.cfi_startproc
larl%r5,.L8
lzdr%f2
cdbr%f0,%f2
je  .L6
ld  %f2,.L9-.L8(%r5)
cpsdr   %f0,%f0,%f2
br  %r14

Use of LTDBR expected instead of  lzdr%f2 + cdbr%f0,%f2

[Bug target/98020] PPC: mfvsrwz+extsw not merged to mtvsrwa

2020-12-08 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98020

Jens Seifert  changed:

   What|Removed |Added

 Status|WAITING |RESOLVED
 Resolution|--- |INVALID

--- Comment #2 from Jens Seifert  ---
I thought they are symmetric.

[Bug target/100693] New: PPC: missing 64-bit addg6s

2021-05-20 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100693

Bug ID: 100693
   Summary: PPC: missing 64-bit addg6s
   Product: gcc
   Version: 8.3.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

gcc only provides

unsigned int __builtin_addg6s (unsigned int, unsigned int);

but addg6s is a 64-bit operation. I require

unsigned long long __builtin_addg6s (unsigned long long, unsigned long long);
.

I for now use inline assembly.

[Bug target/100694] New: PPC: initialization of __int128 is very inefficient

2021-05-20 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100694

Bug ID: 100694
   Summary: PPC: initialization of __int128 is very inefficient
   Product: gcc
   Version: 8.3.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

Initializing a __int128 from 2 64-bit integers is implemented very inefficient.

The most natural code which works good on all other platforms generate
additional 2 li 0 + 2 or instructions.

void test2(unsigned __int128* res, unsigned long long hi, unsigned long long
lo)
{
   unsigned __int128 i = hi;
   i <<= 64;
   i |= lo;
   *res = i;
}

_Z5test2Poyy:
.LFB15:
.cfi_startproc
li 8,0
li 11,0
or 10,5,8
or 11,11,4
std 10,0(3)
std 11,8(3)
blr
.long 0
.byte 0,9,0,0,0,0,0,0
.cfi_endproc


While for the above sample, "+" instead "|" solves the issues, it generates
addc+addz in other more complicated scenarsion.

The most ugly workaround I can think of I now use as workaround.

void test4(unsigned __int128* res, unsigned long long hi, unsigned long long
lo)
{
   union
   { unsigned __int128 i;
struct
   {
 unsigned long long lo;
 unsigned long long hi;
   } s;
   } u;
   u.s.lo = lo;
   u.s.hi = hi;
   *res = u.i;
}

This generates the expected code sequence in all cases I have looked at.

_Z5test4Poyy:
.LFB17:
.cfi_startproc
std 5,0(3)
std 4,8(3)
blr
.long 0
.byte 0,9,0,0,0,0,0,0
.cfi_endproc

Please merge li 0 + or to nop.

[Bug c/100808] New: PPC: ISA 3.1 builtin documentation

2021-05-28 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100808

Bug ID: 100808
   Summary: PPC: ISA 3.1 builtin documentation
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

https://gcc.gnu.org/onlinedocs/gcc/Basic-PowerPC-Built-in-Functions-Available-on-ISA-3_002e1.html#Basic-PowerPC-Built-in-Functions-Available-on-ISA-3_002e1

Please improve the documentation:
- Avoid additional "int" unsigned long long int => unsigned long long
- add missing line breaks between builtins
- remove semicolons

[Bug c++/100809] New: PPC: __int128 divide/modulo does not use P10 instructions vdivsq/vdivuq

2021-05-28 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100809

Bug ID: 100809
   Summary: PPC: __int128 divide/modulo does not use P10
instructions vdivsq/vdivuq
   Product: gcc
   Version: 10.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

unsigned __int128 div(unsigned __int128 a, unsigned __int128 b)
{
   return a/b;
}

__int128 div(__int128 a, __int128 b)
{
   return a/b;
}

gcc -mcpu=power10 -save-temps -O2 int128.C

Output:
_Z3divoo:
.LFB0:
.cfi_startproc
.localentry _Z3divoo,1
mflr 0
std 0,16(1)
stdu 1,-32(1)
.cfi_def_cfa_offset 32
.cfi_offset 65, 16
bl __udivti3@notoc
addi 1,1,32
.cfi_def_cfa_offset 0
ld 0,16(1)
mtlr 0
.cfi_restore 65
blr
.long 0
.byte 0,9,0,1,128,0,0,0
.cfi_endproc
.LFE0:
.size   _Z3divoo,.-_Z3divoo
.globl __divti3
.align 2
.p2align 4,,15
.globl _Z3divnn
.type   _Z3divnn, @function
_Z3divnn:
.LFB1:
.cfi_startproc
.localentry _Z3divnn,1
mflr 0
std 0,16(1)
stdu 1,-32(1)
.cfi_def_cfa_offset 32
.cfi_offset 65, 16
bl __divti3@notoc
addi 1,1,32
.cfi_def_cfa_offset 0
ld 0,16(1)
mtlr 0
.cfi_restore 65
blr
.long 0
.byte 0,9,0,1,128,0,0,0
.cfi_endproc

Expected is the use of vdivsq/vdivuq.

GCC version:

/opt/rh/devtoolset-10/root/usr/bin/gcc -v
Using built-in specs.
COLLECT_GCC=/opt/rh/devtoolset-10/root/usr/bin/gcc
COLLECT_LTO_WRAPPER=/opt/rh/devtoolset-10/root/usr/libexec/gcc/ppc64le-redhat-linux/10/lto-wrapper
Target: ppc64le-redhat-linux
Configured with: ../configure --enable-bootstrap
--enable-languages=c,c++,fortran,lto --prefix=/opt/rh/devtoolset-10/root/usr
--mandir=/opt/rh/devtoolset-10/root/usr/share/man
--infodir=/opt/rh/devtoolset-10/root/usr/share/info
--with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared
--enable-threads=posix --enable-checking=release
--enable-targets=powerpcle-linux --disable-multilib --with-system-zlib
--enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object
--enable-linker-build-id --with-gcc-major-version-only
--with-linker-hash-style=gnu --with-default-libstdcxx-abi=gcc4-compatible
--enable-plugin --enable-initfini-array
--with-isl=/builddir/build/BUILD/gcc-10.2.1-20200804/obj-ppc64le-redhat-linux/isl-install
--disable-libmpx --enable-gnu-indirect-function --enable-secureplt
--with-long-double-128 --with-cpu-32=power8 --with-tune-32=power8
--with-cpu-64=power8 --with-tune-64=power8 --build=ppc64le-redhat-linux
Thread model: posix
Supported LTO compression algorithms: zlib
gcc version 10.2.1 20200804 (Red Hat 10.2.1-2) (GCC)

[Bug c++/100809] PPC: __int128 divide/modulo does not use P10 instructions vdivsq/vdivuq

2021-05-28 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100809

--- Comment #1 from Jens Seifert  ---
Same applies to modulo.

[Bug c/100808] PPC: ISA 3.1 builtin documentation

2021-05-28 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100808

--- Comment #1 from Jens Seifert  ---
https://gcc.gnu.org/onlinedocs/gcc/PowerPC-AltiVec-Built-in-Functions-Available-on-ISA-3_002e1.html

vector unsigned long long int vec_gnb (vector unsigned __int128, const unsigned
char)

should be

unsigned long long int vec_gnb (vector unsigned __int128, const unsigned char)

vgnb instruction returns result in GPR.

[Bug target/100866] New: PPC: Inefficient code for vec_revb(vector unsigned short) < P9

2021-06-02 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100866

Bug ID: 100866
   Summary: PPC: Inefficient code for vec_revb(vector unsigned
short) < P9
   Product: gcc
   Version: 8.3.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

Input:

vector unsigned short revb(vector unsigned short a)
{
   return vec_revb(a);
}

creates:

_Z4revbDv8_t:
.LFB1:
.cfi_startproc
.LCF1:
0:  addis 2,12,.TOC.-.LCF1@ha
addi 2,2,.TOC.-.LCF1@l
.localentry _Z4revbDv8_t,.-_Z4revbDv8_t
addis 9,2,.LC1@toc@ha
addi 9,9,.LC1@toc@l
lvx 0,0,9
xxlnor 32,32,32
vperm 2,2,2,0
blr


Optimal code sequence:
vector unsigned short revb_pwr7(vector unsigned short a)
{
   return vec_rl(a, vec_splats((unsigned short)8));
}

_Z9revb_pwr7Dv8_t:
.LFB2:
.cfi_startproc
.localentry _Z9revb_pwr7Dv8_t,1
vspltish 0,8
vrlh 2,2,0
blr

[Bug target/100867] New: z13: Inefficient code for vec_revb(vector unsigned short)

2021-06-02 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100867

Bug ID: 100867
   Summary: z13: Inefficient code for vec_revb(vector unsigned
short)
   Product: gcc
   Version: 10.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

Input:

vector unsigned short revb(vector unsigned short a)
{
   return vec_revb(a);
}

Creates:

_Z4revbDv4_j:
.LFB1:
.cfi_startproc
larl%r5,.L4
vl  %v0,.L5-.L4(%r5),3
vperm   %v24,%v24,%v24,%v0
br  %r14

Optimal code sequence:

vector unsigned short revb_z13(vector unsigned short a)
{
   return vec_rli(a, 8);
}

Creates:
_Z8revb_z13Dv8_t:
.LFB5:
.cfi_startproc
verllh  %v24,%v24,8
br  %r14
.cfi_endproc

[Bug target/100868] New: PPC: Inefficient code for vec_reve(vector double)

2021-06-02 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100868

Bug ID: 100868
   Summary: PPC: Inefficient code for vec_reve(vector double)
   Product: gcc
   Version: 8.3.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

Input:

vector double reve(vector double a)
{
   return vec_reve(a);
}

creates:

_Z4reveDv2_d:
.LFB3:
.cfi_startproc
.LCF3:
0:  addis 2,12,.TOC.-.LCF3@ha
addi 2,2,.TOC.-.LCF3@l
.localentry _Z4reveDv2_d,.-_Z4reveDv2_d
addis 9,2,.LC2@toc@ha
addi 9,9,.LC2@toc@l
lvx 0,0,9
xxlnor 32,32,32
vperm 2,2,2,0
blr

Optimal sequence would be:

vector double reve_pwr7(vector double a)
{
   return vec_xxpermdi(a,a,2);
}

which creates:

_Z9reve_pwr7Dv2_d:
.LFB4:
.cfi_startproc
xxpermdi 34,34,34,2
blr

[Bug target/100869] New: z13: Inefficient code for vec_reve(vector double)

2021-06-02 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100869

Bug ID: 100869
   Summary: z13: Inefficient code for vec_reve(vector double)
   Product: gcc
   Version: 10.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

Input:

vector double reve(vector double a)
{
   return vec_reve(a);
}

creates:
_Z4reveDv2_d:
.LFB3:
.cfi_startproc
larl%r5,.L12
vl  %v0,.L13-.L12(%r5),3
vperm   %v24,%v24,%v24,%v0
br  %r14


Optimal code sequence:

vector double reve_z13(vector double a)
{
   return vec_permi(a,a,2);
}

creates:

_Z6reve_2Dv2_d:
.LFB6:
.cfi_startproc
vpdi%v24,%v24,%v24,4
br  %r14
.cfi_endproc

[Bug target/100871] New: z14: vec_doublee maps to wrong builtin in vecintrin.h

2021-06-02 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100871

Bug ID: 100871
   Summary: z14: vec_doublee maps to wrong builtin in vecintrin.h
   Product: gcc
   Version: 10.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

#include 
Input:
vector double doublee(vector float a)
{
   return vec_doublee(a);
}

cause compile error:

vec.C: In function ‘__vector(2) double doublee(__vector(4) float)’:
vec.C:43:10: error: ‘__builtin_s390_vfll’ was not declared in this scope; did
you mean ‘__builtin_s390_vflls’?
   43 |return vec_doublee(a);
  |  ^~~~
  |  __builtin_s390_vflls

vec_doublee in vec_intrin.h should call __builtin_s390_vflls

vector double doublee_fix(vector float a)
{
   return __builtin_s390_vflls(a);
}

[Bug target/100808] PPC: ISA 3.1 builtin documentation

2021-06-02 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100808

--- Comment #3 from Jens Seifert  ---
- Avoid additional "int" unsigned long long int => unsigned long long

Why?  Those are exactly the same types!

Yes, but the rest of the documentation uses unsigned long long.
This is just for consistency with existing documentation.

[Bug target/100926] New: PPCLE: Inefficient code for vec_xl_be(unsigned short *) < P9

2021-06-05 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100926

Bug ID: 100926
   Summary: PPCLE: Inefficient code for vec_xl_be(unsigned short
*) < P9
   Product: gcc
   Version: 8.3.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

Input:

vector unsigned short load_be(unsigned short *c)
{
   return vec_xl_be(0L, c);
}

creates:
_Z7load_bePt:
.LFB6:
.cfi_startproc
.LCF6:
0:  addis 2,12,.TOC.-.LCF6@ha
addi 2,2,.TOC.-.LCF6@l
.localentry _Z7load_bePt,.-_Z7load_bePt
addis 9,2,.LC4@toc@ha
lxvw4x 34,0,3
addi 9,9,.LC4@toc@l
lvx 0,0,9
vperm 2,2,2,0
blr


Optimal sequence:

vector unsigned short load_be_opt2(unsigned short *c)
{
   vector signed int vneg16;
   __asm__("vspltisw %0,-16":"=v"(vneg16));
   vector unsigned int tmp = vec_xl_be(0L, (unsigned int *)c);
   tmp = vec_rl(tmp, (vector unsigned int)vneg16);
   return (vector unsigned short)tmp;
}

creates:
_Z12load_be_opt2Pt:
.LFB8:
.cfi_startproc
lxvw4x 34,0,3
#APP
 # 77 "vec.C" 1
vspltisw 0,-16
 # 0 "" 2
#NO_APP
vrlw 2,2,0
blr

rotate left (-16) = rotate right (+16) as only the 5 bits get evaluated.

Please note that the inline assembly is required, because vec_splats(-16) gets
converted into a very inefficient constant generation.

vector unsigned short load_be_opt(unsigned short *c)
{
   vector signed int vneg16 = vec_splats(-16);
   vector unsigned int tmp = vec_xl_be(0L, (unsigned int *)c);
   tmp = vec_rl(tmp, (vector unsigned int)vneg16);
   return (vector unsigned short)tmp;
}

creates:
_Z11load_be_optPt:
.LFB7:
.cfi_startproc
li 9,48
lxvw4x 34,0,3
vspltisw 0,0
mtvsrd 33,9
xxspltw 33,33,1
vsubuwm 0,0,1
vrlw 2,2,0
blr

[Bug target/100930] New: PPC: Missing builtins for P9 vextsb2w, vextsb2w, vextsb2d, vextsh2d, vextsw2d

2021-06-06 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100930

Bug ID: 100930
   Summary: PPC: Missing builtins for P9 vextsb2w, vextsb2w,
vextsb2d, vextsh2d, vextsw2d
   Product: gcc
   Version: 8.3.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

Using the same names like xlC appreciated:
vec_extsbd, vec_extsbw, vec_extshd, vec_extshw, vec_extswd

[Bug target/101041] New: z13: Inefficient handling of vector register passed to function

2021-06-12 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101041

Bug ID: 101041
   Summary: z13: Inefficient handling of vector register passed to
function
   Product: gcc
   Version: 8.3.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

#include 
vector unsigned long long mul64(vector unsigned long long a, vector unsigned
long long b)
{
   return a * b;
}

creates:
_Z5mul64Dv2_yS_:
.LFB9:
.cfi_startproc
ldgr%f4,%r15
.cfi_register 15, 18
lay %r15,-192(%r15)
.cfi_def_cfa_offset 352
vst %v24,160(%r15),3
vst %v26,176(%r15),3
lg  %r2,160(%r15)
lg  %r1,176(%r15)
lgr %r4,%r2
lg  %r0,168(%r15)
lgr %r2,%r1
lg  %r1,184(%r15)
lgr %r5,%r0
lgr %r3,%r1
vlvgp   %v2,%r4,%r5
vlvgp   %v0,%r2,%r3
vlgvg   %r4,%v2,0
vlgvg   %r1,%v2,1
vlgvg   %r2,%v0,0
vlgvg   %r3,%v0,1
msgr%r2,%r4
msgr%r1,%r3
lgdr%r15,%f4
.cfi_restore 15
.cfi_def_cfa_offset 160
vlvgp   %v24,%r2,%r1
br  %r14

Store to stack of v24,v26, then lg+lgr for all 4 parts, then constructing new
vector register v0 and v2 and then extract the 4 elements again using vlgvg.

Expected 4 * vlgvg + 2 * msgr + vlvgp

[Bug target/100866] PPC: Inefficient code for vec_revb(vector unsigned short) < P9

2021-06-18 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100866

--- Comment #7 from Jens Seifert  ---
Regarding vec_revb for vector unsigned int. I agree that
revb:
.LFB0:
.cfi_startproc
vspltish %v1,8
vspltisw %v0,-16
vrlh %v2,%v2,%v1
vrlw %v2,%v2,%v0
blr

works. But in this case, I would prefer the vperm approach assuming that the
loaded constant for the permute vector can be re-used multiple times.
But please get rid of the xxlnor 32,32,32. That does not make sense after
loading a constant. Change the constant that need to be loaded.

[Bug target/100866] PPC: Inefficient code for vec_revb(vector unsigned short) < P9

2021-06-20 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100866

--- Comment #9 from Jens Seifert  ---
I know that if I would use vec_perm builtin as an end user, that you then need
to fulfill to the LE specification, but you can always optimize the code as you
like as long as it creates correct results afterwards.

load constant
xxlnor constant

can always be transformed to 

load inverse constant.

[Bug target/108396] New: PPCLE: vec_vsubcuq missing

2023-01-13 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108396

Bug ID: 108396
   Summary: PPCLE: vec_vsubcuq missing
   Product: gcc
   Version: 12.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

Input:

#include 

vector unsigned __int128 vsubcuq(vector unsigned __int128 a, vector unsigned
__int128 b)
{
return vec_vsubcuq(a, b);
}

Command line:
gcc -m64 -O2 -maltivec -mcpu=power8 text.C

Output:
: In function '__vector unsigned __int128 vsubcuq(__vector unsigned
__int128, __vector unsigned __int128)':
:6:12: error: 'vec_vsubcuq' was not declared in this scope; did you
mean 'vec_vsubcuqP'?
6 | return vec_vsubcuq(a, b);
  |^~~
  |vec_vsubcuqP
Compiler returned: 1

[Bug c++/108560] New: builtin_va_arg_pack_len is documented to return size_t, but actually returns int

2023-01-26 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108560

Bug ID: 108560
   Summary: builtin_va_arg_pack_len is documented to return
size_t, but actually returns int
   Product: gcc
   Version: 12.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

#include 

bool test(const char *fmt, size_t numTokens, ...)
{
return __builtin_va_arg_pack_len() != numTokens;
}

Compiled with -Wsign-compare results in:
: In function 'bool test(const char*, size_t, ...)':
:5:40: warning: comparison of integer expressions of different
signedness: 'int' and 'size_t' {aka 'long unsigned int'} [-Wsign-compare]
5 | return __builtin_va_arg_pack_len() != numTokens;
  |^~~~
:5:37: error: invalid use of '__builtin_va_arg_pack_len ()'
5 | return __builtin_va_arg_pack_len() != numTokens;
  |~^~
Compiler returned: 1

Documentation:
https://gcc.gnu.org/onlinedocs/gcc/Constructing-Calls.html
indicates a size_t return type
Built-in Function: size_t __builtin_va_arg_pack_len ()

[Bug target/106770] PPCLE: Unnecessary xxpermdi before mfvsrd

2023-02-27 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106770

--- Comment #4 from Jens Seifert  ---
PPCLE with no special option means -mcpu=power8 -maltivec  (altivecle to be mor
precise).

vec_promote(, 1) should be a noop on ppcle. But value gets
splatted to both left and right part of vector register. => 2 unnecesary
xxpermdi
The rest of the operations are done on left and right part.

vec_extract(, 1) should be noop on ppcle. But value gets taken
from right part of register which requires a xxpermdi

Overall 3 unnecessary xxpermdi. Don't know why the right part of register gets
"preferred".

[Bug target/106770] PPCLE: Unnecessary xxpermdi before mfvsrd

2023-02-27 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106770

--- Comment #6 from Jens Seifert  ---
The left part of VSX registers overlaps with floating point registers, that is
why no register xxpermdi is required and mfvsrd can access all (left) parts of
VSX registers directly.
The xxpermdi x,y,y,3 indicates to me that gcc prefers right part of register
which might also cause the xxpermdi at the beginning. At the end the mystery is
why gcc adds 3 xxpermdi to the code.

[Bug target/106525] New: s390: Inefficient branchless conditionals for unsigned long long

2022-08-04 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106525

Bug ID: 106525
   Summary: s390: Inefficient branchless conditionals for unsigned
long long
   Product: gcc
   Version: 11.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

Created attachment 53409
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53409&action=edit
source code

1)  -(a > b)

clgr%r2,%r3
lhi %r2,0
alcr%r2,%r2
sllg%r2,%r2,63
srag%r2,%r2,63

Last 2 could be merged to LCDFR. But optimal is:

slgrk %r2,%r3,%r2
slbgr %r2,%r2
lgfr  %r2,%r2
Note: lgfr is not required => 2 instructions only.

2) -(a <= b)

slgr%r3,%r2
lhi %r2,0
alcr%r2,%r2
sllg%r2,%r2,63
srag%r2,%r2,63

Last 2 could be merged to LCDFR. But optimal is:

clgr %r2,%r3
slbgr %r2,%r2
lgfr%r2,%r2

Note: lgfr is not required => 2 instructions only.

3) unsigned 64-bit compare for qsort (a > b) - (a < b)

clgr%r2,%r3
lhi %r1,0
alcr%r1,%r1
clgr%r3,%r2
lhi %r2,0
alcr%r2,%r2
srk %r2,%r1,%r2
lgfr%r2,%r2

Optimal:
slgrk %r1,%r2,%r3
slgrk 0,%r3,%r2
slbgr %r2,%r3
slbgr %r1,%r2
lgfr  %r2,%r1

Note: lgfr not required => 4 instructions only

[Bug target/106536] New: P9: gcc does not detect setb pattern

2022-08-05 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106536

Bug ID: 106536
   Summary: P9: gcc does not detect setb pattern
   Product: gcc
   Version: 11.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

int compare2(unsigned long long a, unsigned long long b)
{
return (a > b ? 1 : (a < b ? -1 : 0));
}

Output:
_Z8compare2yy:
cmpld 0,3,4
bgt 0,.L5
mfcr 3,128
rlwinm 3,3,1,1
neg 3,3
blr
.L5:
li 3,1
blr
.long 0
.byte 0,9,0,0,0,0,0,0

clang generates:

_Z8compare2yy:  # @_Z8compare2yy
cmpld   3, 4
setb 3, 0
extsw 3, 3
blr
.long   0
.quad   0

[Bug target/106592] New: s390: Inefficient branchless conditionals for long long

2022-08-12 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106592

Bug ID: 106592
   Summary: s390: Inefficient branchless conditionals for long
long
   Product: gcc
   Version: 11.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

Created attachment 53443
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53443&action=edit
source code

long long gtRef(long long a, long long b)
{
   return a > b;
}

Generates:

cgr %r2,%r3
lghi%r1,0
lghi%r2,1
locgrnh %r2,%r1

Better sequence:
cgr %r2,%r3
lghi %r2,0
alcgr %r2,%r2


long long leMaskRef(long long a, long long b)
{
   return -(a <= b);
}

Generates:

cgr %r2,%r3
lhi %r1,0
lhi %r2,1
locrnle %r2,%r1
sllg%r2,%r2,63
srag%r2,%r2,63

Better sequence:

cgr %r2,%r3
slbgr %r2,%r2

long long gtMaskRef(long long a, long long b)
{
   return -(a > b);
}

Generates:
cgr %r2,%r3
lhi %r1,0
lhi %r2,1
locrnh  %r2,%r1
sllg%r2,%r2,63
srag%r2,%r2,63

Better sequence:
cgr   %r2,%r3
lghi  %r2,0
alcgr %r2,%r2
lcgr  %r2,%r2

[Bug target/106598] New: s390: Inefficient branchless conditionals for int

2022-08-12 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106598

Bug ID: 106598
   Summary: s390: Inefficient branchless conditionals for int
   Product: gcc
   Version: 11.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

int lt(int a, int b)
{
return a < b;
}

generates:
cr  %r2,%r3
lhi %r1,1
lhi %r2,0
locrnl  %r1,%r2
lgfr%r2,%r1
br  %r14

int ltOpt(int a, int b)
{
long long x = a;
long long y = b;
return ((unsigned long long)(x - y)) >> 63;
}

better:
sgr %r2,%r3
srlg%r2,%r2,63
br  %r14

int ltMask(int a, int b)
{
return -(a < b);
}

generates:
cr  %r2,%r3
lhi %r1,1
lhi %r2,0
locrnl  %r1,%r2
sllg%r1,%r1,63
srag%r2,%r1,63


int ltMaskOpt(int a, int b)
{
long long x = a;
long long y = b;
return (x - y) >> 63;
}

better:
sgr %r2,%r3
srag%r2,%r2,63
br  %r14

int leMask(int a, int b)
{
return -(a <= b);
}

generates:
cr  %r2,%r3
lhi %r1,1
lhi %r2,0
locrnle %r1,%r2
sllg%r1,%r1,63
srag%r2,%r1,63
br  %r14

int leMaskOpt(int a, int b)
{
   int c;
   __asm__("cr %1,%2\n\tslbgr %0,%0":"=r"(c):"r"(a),"r"(b):"cc");
   // slbgr create a 64-bit mask => lgfr would not be required
   return c;
}

better:
cr %r2,%r3
slbgr %r2,%r2
lgfr%r2,%r2 <= not necessary
br  %r14


int le(int a, int b)
{
return a <= b;
}

generates:
cr  %r2,%r3
lhi %r1,1
lhi %r2,0
locrnle %r1,%r2
lgfr%r2,%r1
br  %r14

int leOpt(int a, int b)
{
   unsigned long long c;
   __asm__("cr %1,%2\n\tslbgr %0,%0":"=r"(c):"r"(a),"r"(b):"cc");
   return (c >> 63);
}

better:
cr %r2,%r3
slbgr %r2,%r2
srlg%r2,%r2,63
br  %r14

[Bug target/106701] New: s390: Compiler does not take into account number range limitation to avoid subtract from immediate

2022-08-21 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106701

Bug ID: 106701
   Summary: s390: Compiler does not take into account number range
limitation to avoid subtract from immediate
   Product: gcc
   Version: 11.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

unsigned long long subfic(unsigned long long a)
{
if (a > 15) __builtin_unreachable();
return 15 - a;
}

With clang on x86 subtract from immediate gets translated to xor:
_Z6subficy: # @_Z6subficy
mov rax, rdi
xor rax, 15
ret

Platforms like 390 and x86 which have no subtract from immediate would benefit
from this optimization:

gcc currently generates:
_Z6subficy:
lghi%r1,15
sgr %r1,%r2
lgr %r2,%r1
br  %r14

[Bug target/106769] New: PPCLE: vec_extract(vector unsigned int) unnecessary rldicl after mfvsrwz

2022-08-28 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106769

Bug ID: 106769
   Summary: PPCLE: vec_extract(vector unsigned int) unnecessary
rldicl after mfvsrwz
   Product: gcc
   Version: 11.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

#include 

unsigned int extr(vector unsigned int v)
{
   return vec_extract(v, 2);
}

Generates:

_Z4extrDv4_j:
.LFB1:
.cfi_startproc
mfvsrwz 3,34
rldicl 3,3,0,32
blr
.long 0
.byte 0,9,0,0,0,0,0,0
.cfi_endproc


The rldicl is not necessary as mfvsrwz already wiped out the upper 32 bits of
the register.

[Bug target/106770] New: PPCLE: Unnecessary xxpermdi before mfvsrd

2022-08-29 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106770

Bug ID: 106770
   Summary: PPCLE: Unnecessary xxpermdi before mfvsrd
   Product: gcc
   Version: 11.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

#include 

int cmp2(double a, double b)
{
vector double va = vec_promote(a, 1);
vector double vb = vec_promote(b, 1);
vector long long vlt = (vector long long)vec_cmplt(va, vb);
vector long long vgt = (vector long long)vec_cmplt(vb, va);
vector signed long long vr = vec_sub(vlt, vgt);

return vec_extract(vr, 1);
}

Generates:

_Z4cmp2dd:
.LFB1:
.cfi_startproc
xxpermdi 1,1,1,0
xxpermdi 2,2,2,0
xvcmpgtdp 33,2,1
xvcmpgtdp 32,1,2
vsubudm 0,1,0
xxpermdi 0,32,32,3
mfvsrd 3,0
extsw 3,3
blr

The unnecessary xxpermdi for vec_promote are already reported in another
bugzilla case.

mfvsrd can access all 64 vector registers directly and xxpermdi is not
required.
mfvsrd 3,32 expected instead xxpermdi 0,32,32,3 + mfvsrd 3,0

[Bug target/106770] PPCLE: Unnecessary xxpermdi before mfvsrd

2022-08-29 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106770

--- Comment #1 from Jens Seifert  ---
vec_extract(vr, 1) should extract the left element. But xxpermdi x,x,x,3
extracts the right element.
Looks like a bug in vec_extract for PPCLE and not a problem regarding
unnecessary xxpermdi.

[Bug target/106770] PPCLE: Unnecessary xxpermdi before mfvsrd

2022-08-29 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106770

--- Comment #2 from Jens Seifert  ---
vec_extract(vr, 1) should extract the left element. But xxpermdi x,x,x,3
extracts the right element.
Looks like a bug in vec_extract for PPCLE and not a problem regarding
unnecessary xxpermdi.

Using assembly for the subtract:
int cmp3(double a, double b)
{
vector double va = vec_promote(a, 0);
vector double vb = vec_promote(b, 0);
vector long long vlt = (vector long long)vec_cmplt(va, vb);
vector long long vgt = (vector long long)vec_cmplt(vb, va);
vector signed long long vr;
__asm__ volatile("vsubudm %0,%1,%2":"=v"(vr):"v"(vlt),"v"(vgt):);
//vector signed long long vr = vec_sub(vlt, vgt);

return vec_extract(vr, 1);
}

generates:

_Z4cmp3dd:
.LFB2:
.cfi_startproc
xxpermdi 1,1,1,0
xxpermdi 2,2,2,0
xvcmpgtdp 32,2,1
xvcmpgtdp 33,1,2
#APP
 # 34 "cmpdouble.C" 1
vsubudm 0,0,1
 # 0 "" 2
#NO_APP
mfvsrd 3,32
extsw 3,3
"

Looks like the compile knows about the vec_promote doing splat and at the end
extracts the non-preferred right element instead of the expected left element.

[Bug target/104268] New: 390: inefficient vec_popcnt for 16-bit for z13

2022-01-28 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104268

Bug ID: 104268
   Summary: 390: inefficient vec_popcnt for 16-bit for z13
   Product: gcc
   Version: 10.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

#include 

vector unsigned short popcnt(vector unsigned short a)
{
   return vec_popcnt(a);
}

Generates with -march=z13

_Z6popcntDv8_t:
.LFB1:
.cfi_startproc
vzero   %v0
vpopct  %v24,%v24,0
vleib   %v0,8,7
vsrlb   %v0,%v24,%v0
vab %v24,%v24,%v0
vgbm%v0,21845
vn  %v24,%v24,%v0
br  %r14
.cfi_endproc


Optimal sequence would be:
vector unsigned short popcnt_opt(vector unsigned short a)
{
   vector unsigned short r = (vector unsigned short)vec_popcnt((vector unsigned
char)a);
   vector unsigned short b = vec_rli(r, 8);
   r = r + b;
   r = r >> 8;
   return r;
}

_Z10popcnt_optDv8_t:
.LFB3:
.cfi_startproc
vpopct  %v24,%v24,0
verllh  %v0,%v24,8
vah %v24,%v0,%v24
vesrlh  %v24,%v24,8
br  %r14
.cfi_endproc

[Bug target/103106] New: PPC: Missing builtin for P9 vmsumudm

2021-11-06 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103106

Bug ID: 103106
   Summary: PPC: Missing builtin for P9 vmsumudm
   Product: gcc
   Version: 8.3.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

I can't find builtin for vmsumudm instruction.

I also found nothing in the Power vector instrinsic programming reference.
https://openpowerfoundation.org/?resource_lib=power-vector-intrinsic-programming-reference

[Bug target/103731] New: 390: inefficient 64-bit constant generation

2021-12-15 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103731

Bug ID: 103731
   Summary: 390: inefficient 64-bit constant generation
   Product: gcc
   Version: 8.3.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

unsigned long long M8()
{
   return 0x;
}

Generates:

.LC0:
.quad   0x
.text
.align  8
.globl _Z2M8v
.type   _Z2M8v, @function
_Z2M8v:
.LFB0:
.cfi_startproc
lgrl%r2,.LC0
br  %r14
.cfi_endproc

Expected 2 instructions:
load immediate + insert immedate(IIHF) instead of LOAD

[Bug target/103743] New: PPC: Inefficient equality compare for large 64-bit constants having only 16-bit relevant bits in high part

2021-12-15 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103743

Bug ID: 103743
   Summary: PPC: Inefficient equality compare for large 64-bit
constants having only 16-bit relevant bits in high
part
   Product: gcc
   Version: 8.3.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

int overflow();
int negOverflow(long long in)
{
   if (in == 0x8000LL)
   {
  return overflow();
   }
   return 0;
}

Generates:
negOverflow(long long):
.quad   .L.negOverflow(long long),.TOC.@tocbase,0
.L.negOverflow(long long):
li 9,-1
rldicr 9,9,0,0
cmpd 0,3,9
beq 0,.L10
li 3,0
blr
.L10:
mflr 0
std 0,16(1)
stdu 1,-112(1)
bl overflow()
nop
addi 1,1,112
ld 0,16(1)
mtlr 0
blr
.long 0
.byte 0,9,0,1,128,0,0,0

Instead of:
li 9,-1
rldicr 9,9,0,0
cmpd 0,3,9

Expected output:
rotldi 3,3,1
cmpdi 0,3,1

This should be only applied if constant fits into 16-bit and if those 16-bit
are in the first 32-bit.

[Bug target/102117] New: s390: Inefficient code for 64x64=128 signed multiply for <= z13

2021-08-29 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102117

Bug ID: 102117
   Summary: s390: Inefficient code for 64x64=128 signed multiply
for <= z13
   Product: gcc
   Version: 8.3.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

__int128 imul128(long long a, long long b)
{
   return (__int128)a * (__int128)b;
}

creates sequence with 3 multiplies:

_Z7imul128xx:
.LFB0:
.cfi_startproc
ldgr%f2,%r12
.cfi_register 12, 17
ldgr%f0,%r13
.cfi_register 13, 16
lgr %r13,%r3
mlgr%r12,%r4
srag%r1,%r3,63
msgr%r1,%r4
srag%r4,%r4,63
msgr%r4,%r3
agr %r4,%r1
agr %r12,%r4
stmg%r12,%r13,0(%r2)
lgdr%r13,%f0
.cfi_restore 13
lgdr%r12,%f2
.cfi_restore 12
br  %r14
.cfi_endproc


The following sequence only requires 1 multiply:

__int128 imul128_opt(long long a, long long b)
{
   unsigned __int128 x = (unsigned __int128)(unsigned long long)a;
   unsigned __int128 y = (unsigned __int128)(unsigned long long)b;
   unsigned long long t1 = (a >> 63) & a;
   unsigned long long t2 = (b >> 63) & b;
   unsigned __int128 u128 = x * y;
   unsigned long long hi = (u128 >> 64) - (t1 + t2);
   unsigned long long lo = (unsigned long long)u128;
   unsigned __int128 res = hi;
   res <<= 64;
   res |= lo;
   return (__int128)res;
}

_Z11imul128_optxx:
.LFB1:
.cfi_startproc
ldgr%f2,%r12
.cfi_register 12, 17
ldgr%f0,%r13
.cfi_register 13, 16
lgr %r13,%r3
mlgr%r12,%r4
lgr %r1,%r3
srag%r3,%r3,63
ngr %r3,%r1
srag%r1,%r4,63
ngr %r4,%r1
agr %r3,%r4
sgrk%r3,%r12,%r3
stg %r13,8(%r2)
lgdr%r12,%f2
.cfi_restore 12
lgdr%r13,%f0
.cfi_restore 13
stg %r3,0(%r2)
br  %r14
.cfi_endproc

[Bug target/102117] s390: Inefficient code for 64x64=128 signed multiply for <= z13

2021-08-29 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102117

--- Comment #1 from Jens Seifert  ---
Sorry small bug in optimal sequence.

__int128 imul128_opt(long long a, long long b)
{
   unsigned __int128 x = (unsigned __int128)(unsigned long long)a;
   unsigned __int128 y = (unsigned __int128)(unsigned long long)b;
   unsigned long long t1 = (a >> 63) & b;
   unsigned long long t2 = (b >> 63) & a;
   unsigned __int128 u128 = x * y;
   unsigned long long hi = (u128 >> 64) - (t1 + t2);
   unsigned long long lo = (unsigned long long)u128;
   unsigned __int128 res = hi;
   res <<= 64;
   res |= lo;
   return (__int128)res;
}

_Z11imul128_optxx:
.LFB1:
.cfi_startproc
ldgr%f2,%r12
.cfi_register 12, 17
ldgr%f0,%r13
.cfi_register 13, 16
lgr %r13,%r3
mlgr%r12,%r4
srag%r1,%r3,63
ngr %r1,%r4
srag%r4,%r4,63
ngr %r4,%r3
agr %r4,%r1
sgrk%r4,%r12,%r4
stg %r13,8(%r2)
lgdr%r12,%f2
.cfi_restore 12
lgdr%r13,%f0
.cfi_restore 13
stg %r4,0(%r2)
br  %r14
.cfi_endproc

[Bug target/102265] New: s390: Inefficient code for __builtin_ctzll

2021-09-09 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102265

Bug ID: 102265
   Summary: s390: Inefficient code for __builtin_ctzll
   Product: gcc
   Version: 10.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

unsigned long long ctzll(unsigned long long x)
{
   return __builtin_ctzll(x);
}

creates:
lcgr%r1,%r2
ngr %r2,%r1
lghi%r1,63
flogr   %r2,%r2
sgrk%r2,%r1,%r2
lgfr%r2,%r2
br  %r14


Optimal sequence for z15 uses population count, for all others use ^ 63 instead
of 63 -.

unsigned long long ctzll_opt(unsigned long long x)
{
#if __ARCH__ >= 13
   return __builtin_popcountll((x-1) & ~x);
#else
   return __builtin_clzll(x & -x) ^ 63;
#endif
}

< z15:
lcgr%r1,%r2
ngr %r2,%r1
flogr   %r2,%r2
xilf%r2,63
lgfr%r2,%r2
br  %r14

=> 1 instruction saved.

z15:
.cfi_startproc
lay %r1,-1(%r2)
ncgrk   %r2,%r1,%r2
popcnt  %r2,%r2,8
br  %r14
.cfi_endproc

=> On z15 only 3 instructions required.

[Bug target/86160] Implement isinf on PowerPC

2022-11-08 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86160

--- Comment #4 from Jens Seifert  ---
I am looking forward to get Power9 optimization using xststdcdp etc.

[Bug target/107757] New: PPCLE: Inefficient vector constant creation

2022-11-18 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107757

Bug ID: 107757
   Summary: PPCLE: Inefficient vector constant creation
   Product: gcc
   Version: 12.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

Due to the fact that vslw, vsld, vsrd, ... only use the modulo of bit width for
shifting, the combination with 0xFF..FF vector can be used to create vector
constants for:
vec_splats(-0.0) or vec_splats(1ULL << 31) and scalar -0.0
vec_splats(-0.0f) or vec_splats(1U << 31)
vec_splats((short)0x8000)
with only 2 2-cycle vector instructions.

Sample:

vector long long lsb64()
{
   return vec_splats(1LL);
}

creates:

lsb64():
.LCF5:
addi 2,2,.TOC.-.LCF5@l
addis 9,2,.LC12@toc@ha
addi 9,9,.LC12@toc@l
lvx 2,0,9
blr
.long 0
.byte 0,9,0,0,0,0,0,0

while:

vector long long lsb64_opt()
{
   vector long long a = vec_splats(~0LL);
   __asm__("vsrd %0,%0,%0":"=v"(a):"v"(a),"v"(a));
   return a;
}

creates:
lsb64_opt():
vspltisw 2,-1
vsrd 2,2,2
blr
.long 0
.byte 0,9,0,0,0,0,0,0

[Bug target/107949] New: PPC: Unnecessary rlwinm after lbzx

2022-12-02 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107949

Bug ID: 107949
   Summary: PPC: Unnecessary rlwinm after lbzx
   Product: gcc
   Version: 12.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

extern unsigned char magic1[256];

unsigned int hash(const unsigned char inp[4])
{
   const unsigned long long INIT = 0x1ULL;
   unsigned long long h1 = INIT;
   h1 = magic1[((unsigned long long)inp[0]) ^ h1];
   h1 = magic1[((unsigned long long)inp[1]) ^ h1];
   h1 = magic1[((unsigned long long)inp[2]) ^ h1];
   h1 = magic1[((unsigned long long)inp[3]) ^ h1];
   return h1;
}

#ifdef __powerpc__
#define lbzx(b,c) ({ unsigned long long r; __asm__("lbzx
%0,%1,%2":"=r"(r):"b"(b),"r"(c)); r; })
unsigned int hash2(const unsigned char inp[4])
{
   const unsigned long long INIT = 0x1ULL;
   unsigned long long h1 = INIT;
   h1 = lbzx(magic1, inp[0] ^ h1);
   h1 = lbzx(magic1, inp[1] ^ h1);
   h1 = lbzx(magic1, inp[2] ^ h1);
   h1 = lbzx(magic1, inp[3] ^ h1);
   return h1;
}
#endif

Extra rlwinm get added.

hash(unsigned char const*):
.LCF0:
addi 2,2,.TOC.-.LCF0@l
lbz 9,0(3)
addis 10,2,.LC0@toc@ha
ld 10,.LC0@toc@l(10)
lbz 6,1(3)
lbz 7,2(3)
lbz 8,3(3)
xori 9,9,0x1
lbzx 9,10,9
xor 9,9,6
rlwinm 9,9,0,0xff <= not necessary
lbzx 9,10,9
xor 9,9,7
rlwinm 9,9,0,0xff <= not necessary
lbzx 9,10,9
xor 9,9,8
rlwinm 9,9,0,0xff <= not necessary
lbzx 3,10,9
blr
.long 0
.byte 0,9,0,0,0,0,0,0
hash2(unsigned char const*):
.LCF1:
addi 2,2,.TOC.-.LCF1@l
lbz 7,0(3)
lbz 8,1(3)
lbz 10,2(3)
lbz 6,3(3)
addis 9,2,.LC1@toc@ha
ld 9,.LC1@toc@l(9)
xori 7,7,0x1
lbzx 7,9,7
xor 8,8,7
lbzx 8,9,8
xor 10,10,8
lbzx 10,9,10
xor 10,6,10
lbzx 3,9,10
rldicl 3,3,0,32
blr

Tiny sample:
unsigned long long tiny(const unsigned char *inp)
{
  return inp[0] ^ inp[1];
}

tiny(unsigned char const*):
lbz 9,0(3)
lbz 10,1(3)
xor 3,9,10
rlwinm 3,3,0,0xff
blr
.long 0
.byte 0,9,0,0,0,0,0,0

unsigned long long tiny2(const unsigned char *inp)
{
  unsigned long long a = inp[0];
  unsigned long long b = inp[1];
  return a ^ b;
}

tiny2(unsigned char const*):
lbz 9,0(3)
lbz 10,1(3)
xor 3,9,10
rlwinm 3,3,0,0xff
blr
.long 0
.byte 0,9,0,0,0,0,0,0

lbz/lbzx creates a value 0 <= x < 256. xor of 2 such values does not change
value range.

[Bug target/107949] PPC: Unnecessary rlwinm after lbzx

2022-12-02 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107949

--- Comment #1 from Jens Seifert  ---
hash2 is only provided to show how the code should look like (without rlwinm).

[Bug target/108048] New: PPCLE: gcc does not recognize that lbzx does zero extend

2022-12-10 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108048

Bug ID: 108048
   Summary: PPCLE: gcc does not recognize that lbzx does zero
extend
   Product: gcc
   Version: 12.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

extern unsigned char magic1[256];

unsigned int hash(const unsigned char inp[4])
{
   const unsigned long long INIT = 0x1ULL;
   unsigned long long h1 = INIT;
   h1 = magic1[((unsigned long long)inp[0]) ^ h1];
   h1 = magic1[((unsigned long long)inp[1]) ^ h1];
   h1 = magic1[((unsigned long long)inp[2]) ^ h1];
   h1 = magic1[((unsigned long long)inp[3]) ^ h1];
   return h1;
}

Generates:

hash(unsigned char const*):
.LCF0:
addi 2,2,.TOC.-.LCF0@l
lbz 9,0(3)
addis 10,2,.LC0@toc@ha
ld 10,.LC0@toc@l(10)
lbz 6,1(3)
lbz 7,2(3)
lbz 8,3(3)
xori 9,9,0x1
lbzx 9,10,9
xor 9,9,6
rlwinm 9,9,0,0xff <= unnecessary
lbzx 9,10,9
xor 9,9,7
rlwinm 9,9,0,0xff <= unnecessary
lbzx 9,10,9
xor 9,9,8
rlwinm 9,9,0,0xff <= unnecessary
lbzx 3,10,9
blr


All XOR operations are done in unsigned long long (64-bit). gcc adds
unnecessary rlwinm. lbz and lbzx does zero extension (no cleanup of upper bits
required).

[Bug rtl-optimization/107949] PPC: Unnecessary rlwinm after lbzx

2022-12-10 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107949

--- Comment #3 from Jens Seifert  ---
*** Bug 108048 has been marked as a duplicate of this bug. ***

[Bug target/108048] PPCLE: gcc does not recognize that lbzx does zero extend

2022-12-10 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108048

Jens Seifert  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |DUPLICATE

--- Comment #1 from Jens Seifert  ---
duplicate of https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107949

*** This bug has been marked as a duplicate of bug 107949 ***

[Bug target/108049] New: s390: Compiler adds extra zero extend after xoring 2 zero extended values

2022-12-10 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108049

Bug ID: 108049
   Summary: s390: Compiler adds extra zero extend after xoring 2
zero extended values
   Product: gcc
   Version: 12.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

Same issue for PPC: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107949

extern unsigned char magic1[256];

unsigned int hash(const unsigned char inp[4])
{
   const unsigned long long INIT = 0x1ULL;
   unsigned long long h1 = INIT;
   h1 = magic1[((unsigned long long)inp[0]) ^ h1];
   h1 = magic1[((unsigned long long)inp[1]) ^ h1];
   h1 = magic1[((unsigned long long)inp[2]) ^ h1];
   h1 = magic1[((unsigned long long)inp[3]) ^ h1];
   return h1;
}

hash(unsigned char const*):
llgc%r4,1(%r2) <= zero extends to 64-bit
lgrl%r1,.LC0
llgc%r3,0(%r2) <= zero extends to 64-bit
xilf%r3,1 
llgc%r3,0(%r3,%r1)
xr  %r3,%r4 <= should be 64-bit xor
llgc%r4,2(%r2) <= zero extends to 64-bit
llgcr   %r3,%r3 <= unnecessary
llgc%r2,3(%r2)
llgc%r3,0(%r3,%r1)
xr  %r3,%r4 <= should be 64-bit xor
llgcr   %r3,%r3 <= unnecessary
llgc%r3,0(%r3,%r1) <= zero extends to 64-bit
xrk %r2,%r3,%r2 <= should be 64-bit xor
llgcr   %r2,%r2 <= unnecessary
llgc%r2,0(%r2,%r1)
br  %r14

Smaller sample:
unsigned long long tiny2(const unsigned char *inp)
{
  unsigned long long a = inp[0];
  unsigned long long b = inp[1];
  return a ^ b;
}

tiny2(unsigned char const*):
llgc%r1,0(%r2)
llgc%r2,1(%r2)
xrk %r2,%r1,%r2
llgcr   %r2,%r2
br  %r14

[Bug target/108049] s390: Compiler adds extra zero extend after xoring 2 zero extended values

2022-12-10 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108049

--- Comment #1 from Jens Seifert  ---
Sample above got compiled with -march=z196

[Bug c/106043] New: Power10: lacking vec_blendv builtins

2022-06-21 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106043

Bug ID: 106043
   Summary: Power10: lacking vec_blendv builtins
   Product: gcc
   Version: 11.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

Missing builtins for vector instructions xxblendvb, xxblendvw, xxblendvd,
xxblendvd.


#include 

vector int blendv(vector int a, vector int b, vector int c)
{
return vec_blendv(a, b, c);
}

[Bug target/106043] Power10: lacking vec_blendv builtins

2022-07-13 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106043

--- Comment #1 from Jens Seifert  ---
Found in documentation:

https://gcc.gnu.org/onlinedocs/gcc-11.3.0/gcc/PowerPC-AltiVec-Built-in-Functions-Available-on-ISA-3_002e1.html#PowerPC-AltiVec-Built-in-Functions-Available-on-ISA-3_002e1

[Bug target/106043] Power10: lacking vec_blendv builtins

2022-07-13 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106043

Jens Seifert  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |INVALID

--- Comment #2 from Jens Seifert  ---
Also found in altivec.h

[Bug target/114376] New: s390: Inefficient __builtin_bswap16

2024-03-18 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114376

Bug ID: 114376
   Summary: s390: Inefficient __builtin_bswap16
   Product: gcc
   Version: 13.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

unsigned short swap16(unsigned short in)
{
   return __builtin_bswap16(in);
}

generates -O3 -march=z196

swap16(unsigned short):
lrvr%r2,%r2
srl %r2,16
llghr   %r2,%r2
br  %r14

More efficient for 64-bit is:

unsigned short swap16_2(unsigned short in)
{
   return __builtin_bswap64(in) >> 48;
}

Which generates:

swap16_2(unsigned short):
lrvgr   %r2,%r2
srlg%r2,%r2,48
br  %r14

For 31-bit lrvr should be used.

[Bug target/93176] PPC: inefficient 64-bit constant consecutive ones

2023-08-16 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93176

--- Comment #7 from Jens Seifert  ---
What happened ? Still waiting for improvement.

[Bug target/93176] PPC: inefficient 64-bit constant consecutive ones

2023-08-17 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93176

--- Comment #10 from Jens Seifert  ---
Looks like no patch in the area got delivered. I did a small test for 

unsigned long long c()
{
return 0xULL;
}

gcc 13.2.0:
li 3,0
ori 3,3,0x
sldi 3,3,32

expected:
li 3, -1
rldic 3, 3, 32, 16

All consecutive ones can be created with li + rldic.

The rotate eliminates the bits on the right and the clear the bits on the left
as described below:

  li t,-1
  rldic d,T,MB,63-ME

[Bug target/115973] PPCLE: Inefficient code for __builtin_uaddll_overflow and __builtin_addcll

2024-09-07 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115973

--- Comment #2 from Jens Seifert  ---
Assembly that better integrates:

unsigned long long addc_opt(unsigned long long a, unsigned long long b,
unsigned long long *res)
{
   unsigned long long rc;
   __asm__("addc %0,%2,%3;\n\tsubfe
%1,%1,%1":"=r"(*res),"=r"(rc):"r"(a),"r"(b):"xer");
   return rc + 1;
}

Output:

.L.addc_opt(unsigned long long, unsigned long long, unsigned long long*):
addc 9,3,4;
subfe 3,3,3
std 9,0(5)
addi 3,3,1
blr

Power10 code for __builtin_uaddll_overflow is okay:

unsigned long long addc(unsigned long long a, unsigned long long b, unsigned
long long *res)
{
   return __builtin_uaddll_overflow(a, b, res);
}

.L.addc(unsigned long long, unsigned long long, unsigned long long*):
add 4,3,4
cmpld 0,4,3
std 4,0(5)
setbc 3,0
blr

[Bug target/116649] New: PPC: Suboptimal code for __builtin_bcdadd_ovf on Power10

2024-09-09 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116649

Bug ID: 116649
   Summary: PPC: Suboptimal code for __builtin_bcdadd_ovf on
Power10
   Product: gcc
   Version: 14.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

unsigned long long bcdadd(vector __int128 a, vector __int128 b, vector __int128
*c)
{
   return __builtin_bcdadd_ov(a, b, 0);
}

creates:

bcdadd(__int128 __vector(1), __int128 __vector(1), __int128 __vector(1)*):
.quad   .L.bcdadd(__int128 __vector(1), __int128 __vector(1), __int128
__vector(1)*),.TOC.@tocbase,0
.L.bcdadd(__int128 __vector(1), __int128 __vector(1), __int128 __vector(1)*):
bcdadd. 2,2,3,0
mfcr 3,2
rlwinm 3,3,28,1
blr
.long 0
.byte 0,9,0,0,0,0,0,0

while use of setbc expected.

bcdadd. 2,2,3,0
setbc 3,27
blr

[Bug target/115355] New: PPCLE: Auto-vectorization creates wrong code for Power9

2024-06-05 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115355

Bug ID: 115355
   Summary: PPCLE: Auto-vectorization creates wrong code for
Power9
   Product: gcc
   Version: 12.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

Input setToIdentity.C:

#include 
#include 
#include 

void setToIdentityGOOD(unsigned long long *mVec, unsigned int mLen)
{
  for (unsigned long long i = 0; i < mLen; i++)
  {
mVec[i] = i;
  }
}

void setToIdentityBAD(unsigned long long *mVec, unsigned int mLen)
{
  for (unsigned int i = 0; i < mLen; i++)
  {
mVec[i] = i;
  }
}

unsigned long long vec1[100];
unsigned long long vec2[100];

int main(int argc, char *argv[])
{
  unsigned int l = argc > 1 ? atoi(argv[1]) : 29;
  setToIdentityGOOD(vec1, l);
  setToIdentityBAD(vec2, l);

  if (memcmp(vec1, vec2, l*sizeof(vec1[0])) != 0)
  {
 for (unsigned int i = 0; i < l; i++)
 {
printf("%llu %llu\n", vec1[i], vec2[i]);
 }
  }
  else
  {
 printf("match\n");
  }
  return 0;
}


Fails
gcc -O3 -mcpu=power9 -m64 setToIdentity.C -save-temps -fverbose-asm -o pwr9.exe
-mno-isel


Good:
gcc -O3 -mcpu=power8 -m64 setToIdentity.C -save-temps -fverbose-asm -o pwr8.exe
-mno-isel

"-mno-isel" is only specified to reduce the diff.


Failing output:

pwr9.exe
0 0
1 1
2 0
3 4294967296
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
26 26
27 27
28 28

4th element contains wrong data.

[Bug target/115355] PPCLE: Auto-vectorization creates wrong code for Power9

2024-06-05 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115355

--- Comment #1 from Jens Seifert  ---
Same issue with gcc 13.2.1

[Bug target/115355] [12/13/14/15 Regression] vectorization exposes wrong code on P9 LE starting from r12-4496

2024-06-06 Thread jens.seifert at de dot ibm.com via Gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115355

--- Comment #10 from Jens Seifert  ---
Does this affect loop vectorize and slp vectorize ?

-fno-tree-loop-vectorize avoids loop vectorization to be performed and
workarounds this issue. Does the same problems also affect SLP vectorization,
which does not take place in this sample.

In other words, do I need
-fno-tree-loop-vectorize
or
-fno-tree-vectorize
to workaround this bug ?

1 2 >

1 - 100 of 112 matches

Mail list logo