[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2017-12-15 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #14 from sergey.shalnov at intel dot com --- " we have a basic-block vectorizer. Do you propose to remove it? " Definitely not! SLP vectorizer is very good to have! “What's the rationale for not using vector register

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2017-12-15 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #16 from sergey.shalnov at intel dot com --- «it's one vec_construct operation - it's the task of the target to turn this into a cost comparable to vector_store» I agree that vec_construct operation cost is based on the t

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2017-12-24 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #18 from sergey.shalnov at intel dot com --- Yes, I agree that vector_store stage has it’s own vectorization cost. And each vector_store has vector_construction stage. These stages are different in gcc slp (as you know). To better

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2018-01-02 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #20 from sergey.shalnov at intel dot com --- Richard, I did quick static analysis for your latest patch. Using command line “-g -Ofast -mfpmath=sse -funroll-loops -march=znver1” your latest patch doesn’t affects the issue I discussed

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2018-01-10 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #21 from sergey.shalnov at intel dot com --- Thanks Richard for your comments. Based on our discussion I've produced the patch attached and run it on SPEC2017intrate/fprate on skylake server (with [-Ofast -flto -march=skylake-a

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2018-01-10 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #24 from sergey.shalnov at intel dot com --- Richard, The latest "SLP costing for constants/externs improvement" patch generates the same code as baseline for the test example. Are you sure that "num_vects_to_che

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2018-01-10 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #26 from sergey.shalnov at intel dot com --- Sorry, did you meant "arm_sve.h" on ARM? In this case we have machine specific code in common part of the gcc code. Should we make it as machine dependent callback function beca

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2018-01-17 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #28 from sergey.shalnov at intel dot com --- Richard, Thank you for your comments. I see that TYPE_VECTOR_SUBPARTS is constant for for the test case but multiple_p (group_size, const_nunits) returns 1 in the code: if

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2018-01-19 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #29 from sergey.shalnov at intel dot com --- Richard, Thank you for your latest patch. I would like to clarify the multiple_p() function usage in if() clause. First of all, I assume that architectures with fixed size of HW

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2018-01-26 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #31 from sergey.shalnov at intel dot com --- Richard, Thank you for your latest patch. This patch is exactly that I’ve discussed in this issue request. I tested it with SPEC20[06|17] and see no performance/stability degradation

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2018-01-29 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #33 from sergey.shalnov at intel dot com --- Richard, I'm not sure is it a regression or not. I see code has been visibly refactored in this commit https://github.com/gcc-mirror/gcc/commit/ee6e9ba576099aed29f1097195c649fc796ecf

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2018-02-08 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #36 from sergey.shalnov at intel dot com --- The patch fixes the issue for SKX is in https://gcc.gnu.org/ml/gcc-patches/2018-02/msg00405.html I will close the PR after the patch has been merged. Thank you very much for all involved

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2018-02-09 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 sergey.shalnov at intel dot com changed: What|Removed |Added Status|NEW |RESOLVED

[Bug rtl-optimization/85017] New: Missed constant propagation to index of array reference

2018-03-21 Thread sergey.shalnov at intel dot com
Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: sergey.shalnov at intel dot com Target Milestone: --- I found extra instruction generated on x86_64 platform. I have no performance data to prove performance gap for this but other

[Bug target/83008] New: [performance] Is it better to avoid extra instructions in data passing between loops?

2017-11-15 Thread sergey.shalnov at intel dot com
Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: sergey.shalnov at intel dot com Target Milestone: --- I found strange code generated by GCC-8.0/7.x with following command line options: -g -Ofast -march=skylake

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2017-11-15 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #1 from sergey.shalnov at intel dot com --- Created attachment 42616 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42616&action=edit reproducer

[Bug tree-optimization/65930] Reduction with sign-change not handled

2017-12-05 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65930 sergey.shalnov at intel dot com changed: What|Removed |Added CC||sergey.shalnov at intel

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2017-12-08 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #5 from sergey.shalnov at intel dot com --- (In reply to Richard Biener from comment #2) > The strange code is because we perform basic-block vectorization resulting in > > vect_cst__249 = {_251, _251, _251, _251, _334, _

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2017-12-08 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #6 from sergey.shalnov at intel dot com --- I found the issue request related to the vactorization issues in second loop (reduction uint->int). https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65930

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2017-12-08 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #8 from sergey.shalnov at intel dot com --- Richard, This is great changes and I see the first loop became vectorized for the test example I provided with gcc-8.0 main trunk. But I think the issue a bit more complicated. Vectorization

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2017-12-08 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #9 from sergey.shalnov at intel dot com --- Created attachment 42813 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42813&action=edit New reproducer Slightly changed first loop

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2017-12-08 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #11 from sergey.shalnov at intel dot com --- Richard, “Is this about the "stupid" attempt to use as little AVX512 as possible” No, it is not. I provided asm listing at the beginning with zmm only to illustrate the

[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?

2017-12-08 Thread sergey.shalnov at intel dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008 --- Comment #12 from sergey.shalnov at intel dot com --- Richard, Your last proposal changed the code generated a bit. Currently is shows: test_bugzilla1.c:6:5: note: Cost model analysis:. Vector inside of loop cost: 62576 Vector prologue