[Bug target/87599] New: Broadcasting scalar to vector uses stack unnecessarily on x86

2018-10-12 Thread vgatherps at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87599

Bug ID: 87599
   Summary: Broadcasting scalar to vector uses stack unnecessarily
on x86
   Product: gcc
   Version: 8.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vgatherps at gmail dot com
  Target Milestone: ---

When compiled on GCC 8.2 with -O2, 

typedef long long __m128i __attribute__ ((__vector_size__ (16),
__may_alias__));

__m128i vectorize(long val) {
__m128i rval = {val, val};
return rval;
}

generates the following code:

mov QWORD PTR [rsp-16], rdi
movqxmm0, QWORD PTR [rsp-16]
punpcklqdq  xmm0, xmm0
ret

Which could be replaced with

movqxmm0, rdi
punpcklqdq  xmm0, xmm0
ret

Interestingly, according to godbolt, the current trunk makes this optimization
with -Os but not with -O2 or -O3.

[Bug target/87599] Broadcasting scalar to vector uses stack unnecessarily on x86

2018-10-12 Thread vgatherps at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87599

--- Comment #2 from vgatherps at gmail dot com ---
Thanks! That fixes the optimization. However, using something like
-march=haswell or -march=corei7 does not result in this optimization being
made, which as far as I know -march= would imply -mtune=intel.

[Bug target/87601] New: Missed opportunity for flag reuse and macro-op fusion on x86

2018-10-12 Thread vgatherps at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87601

Bug ID: 87601
   Summary: Missed opportunity for flag reuse and macro-op fusion
on x86
   Product: gcc
   Version: 8.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vgatherps at gmail dot com
  Target Milestone: ---

When I compile the following code with gcc 8.2 and options -O2 (or Os) and
-mtune=intel (or broadwell):

int sum(int *vals, int l) {
int a = 0;
if (l <= 0) {
return 0;
}
for (int i = l; i != 0; i--) {
a += vals[i-1];
}
return a;
}


The following code is generated:

sum(int*, int):
  xor eax, eax
  test esi, esi
  jle .L1
  movsx rsi, esi
.L3:
  add eax, DWORD PTR [rdi-4+rsi*4]
  sub rsi, 1
  test esi, esi
  jne .L3
.L1:
  ret


When passing -march=broadwell or -Os, sub is replaced by dec but otherwise it's
the same.

Inside the loop, the sequence:
  sub rsi, 1
  test esi, esi
  jne .L3

can be replaced with:
  sub rsi, 1
  jne .L3

since sub rsi, 1 since that would set the same zero flag that test would. This
would improve macro-op fusion on relatively recent architectures as well.
Anecdotally, I've seen similar decisions being made along the lines of 

sub index, 1

// some more asm here not using index

test index, index
jne loop_start

But don't have a nice clean test case for it. This suggests to me that the
optimization around flag reuse and macro-op fusion could be improved in
general, and I'll work on getting some clean test cases for other cases.