------- Comment #43 from irar at il dot ibm dot com 2010-01-10 13:43 ------- Since -O2 -ftree-vectorize doesn't cause bad code, it has to be some other optimization on top of vectorized code that causes the problem.
Bad code is generated when the alignment of 'reduce' is forced and the reduction 'sum(reduce)' is vectorized. However, the result of the reduction is correct, and the vector store element does not do any damage (as far as I can see in debugger). So, the vector stores don't corrupt anything. The part that goes wrong is in the scalar code that implements the decision on whether to add the (correctly computed) reduction value to temp[9] and temp[10]. The code that sets the condition, (which, by the way, is not using any vectorized code) is not using the values of reduce[9] and reduce[10], even though the value of the condition depends on them: reduce(1:3) = -1 reduce(4:6) = 0 reduce(7:8) = 5 reduce(9:10) = 10 ... WHERE (reduce > 6) temp = temp + sum(reduce) Here is the code for adding the result of the "sum(reduce)" to temp[9]: L29: lbz r11,152(r1) # ** cmpwi cr7,r11,0 # reduce > 6 ? beq cr7,L30 lwz r11,240(r1) # load temp[9] add r11,r11,r9 # temp[9] + sum(reduce) stw r11,240(r1) # store temp[9] ** - The calculation of 152(r1) is based only on the value of reduce[8]! The values of reduce[9] and reduce[10] are only used in the reduction calculation and not compared to 6 at all. In case we don't vectorize (but force the alignment), there is cmpwi cr7,r29,6 instruction, where r29 is reduce[9] (and the code is correct). The same happens when the alignment of reduce is not forced and the reduction is vectorized using peeling. I.e., as far as I can see, in the bad code, the comparison of reduce[9] and reduce[10] with 6 do not exist. I wonder which optimization can be responsible for that? Also, some values of reduce are copied to a temporal array and are further compared with 6. In the version with peeling the values that are copied are reduce[4:8]: there is no need to keep the first three and the last two are kept in registers and compared to 6 (and also used in reduction epilogue). While in the bad version the kept values are reduce[3:8] and reduce[8] is put before the values of reduce[3:7] (reduce[3:7] are in 276(r1) to 292(r1), and reduce[8] is in 272(r1)). (And in the bad code the last two values reduce[9] and reduce[10] are only used in reduction epilogue). -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41082