https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114435
--- Comment #2 from jchrist at linux dot ibm.com --- I tried this, but it seems like pcom does not handle vectors at all: In the gimple input I have vectp.5_32 = r_26(D); # VUSE <.MEM_52> vect__51.6_1 = MEM <vector(2) doubleD.32> [(doubleD.32 *)vectp.5_32]; # PT = nonlocal null # ALIGN = 8, MISALIGN = 0 vectp.5_2 = vectp.5_32 + 16; # VUSE <.MEM_52> vect__51.7_3 = MEM <vector(2) doubleD.32> [(doubleD.32 *)vectp.5_2]; [...] vectp.15_12 = r_26(D); # .MEM_13 = VDEF <.MEM_52> MEM <vector(2) doubleD.32> [(doubleD.32 *)vectp.15_12] = vect__45.13_11; # PT = nonlocal null # ALIGN = 8, MISALIGN = 0 vectp.15_14 = vectp.15_12 + 16; # .MEM_15 = VDEF <.MEM_13> MEM <vector(2) doubleD.32> [(doubleD.32 *)vectp.15_14] = vect__45.13_29; But the analyzed data dependencies are: (Data Dep: #(Data Ref: # bb: 9 # stmt: vect__51.6_1 = MEM <vector(2) double> [(double *)vectp.5_32]; # ref: MEM <vector(2) double> [(double *)vectp.5_32]; # base_object: MEM <vector(2) double> [(double *)vectp.5_32]; #) #(Data Ref: # bb: 9 # stmt: MEM <vector(2) double> [(double *)vectp.15_12] = vect__45.13_11; # ref: MEM <vector(2) double> [(double *)vectp.15_12]; # base_object: MEM <vector(2) double> [(double *)vectp.15_12]; #) (don't know) ) (Data Dep: #(Data Ref: # bb: 9 # stmt: vect__51.7_3 = MEM <vector(2) double> [(double *)vectp.5_2]; # ref: MEM <vector(2) double> [(double *)vectp.5_2]; # base_object: MEM <vector(2) double> [(double *)vectp.5_2]; #) #(Data Ref: # bb: 9 # stmt: MEM <vector(2) double> [(double *)vectp.15_14] = vect__45.13_29; # ref: MEM <vector(2) double> [(double *)vectp.15_14]; # base_object: MEM <vector(2) double> [(double *)vectp.15_14]; #) (don't know) ) Is this expected? Because I think this is the reason why the generated code is still not optimal. In every loop iteration, we still load the two accumulation vectors on s390x, just to use them for fma and then store them. If I understand commoning correctly, this would be one case where it should solve this problem and improve the code by loading and storing the accumulator outside of the loop.