https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96133
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> --- Confirmed. The i == 1 lane is different. We're using standard interleaving vectorization here, the innermost two loops are unrolled and rgb_cam is elided. Note eventually we optimize the whole loop at compile-time to <bb 2> [local count: 89478486]: MEM <vector(2) double> [(double *)&xyz_cam] = { 2.97789709999999985257090884260833263397216796875e+0, 3.94211709999999992959374139900319278240203857421875e+0 }; MEM <vector(2) double> [(double *)&xyz_cam + 16B] = { 4.9063371000000000066165739553980529308319091796875e+0, 3.291832700000000055950977184693329036235809326171875e+0 }; MEM <vector(2) double> [(double *)&xyz_cam + 32B] = { 4.06932820000000017301999832852743566036224365234375e+0, 4.8468236999999998459998096222989261150360107421875e+0 }; MEM <vector(2) double> [(double *)&xyz_cam + 48B] = { 5.40156330000000028945805752300657331943511962890625e+0, 6.2267732999999996224005371914245188236236572265625e+0 }; xyz_cam[2][2] = 7.051983299999999843521436559967696666717529296875e+0;