https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81616
--- Comment #26 from Jan Hubicka <hubicka at ucw dot cz> --- On you matrix benchmarks I get: Vector inside of loop cost: 44 Vector prologue cost: 12 Vector epilogue cost: 0 Scalar iteration cost: 40 Scalar outside cost: 0 Vector outside cost: 12 prologue iterations: 0 epilogue iterations: 0 Calculated minimum iters for profitability: 1 mult.c:15:7: note: Runtime profitability threshold = 4 mult.c:15:7: note: Static estimate profitability threshold = 4 Vector inside of loop cost: 2428 Vector prologue cost: 4 Vector epilogue cost: 0 Scalar iteration cost: 2428 Scalar outside cost: 0 Vector outside cost: 4 prologue iterations: 0 epilogue iterations: 0 Calculated minimum iters for profitability: 1 mult.c:30:7: note: Runtime profitability threshold = 4 mult.c:30:7: note: Static estimate profitability threshold = 4 for 128bit vectorization and for 256bit Vector inside of loop cost: 88 Vector prologue cost: 24 Vector epilogue cost: 0 Scalar iteration cost: 40 Scalar outside cost: 0 Vector outside cost: 24 prologue iterations: 0 epilogue iterations: 0 Calculated minimum iters for profitability: 1 mult.c:15:7: note: Runtime profitability threshold = 8 mult.c:15:7: note: Static estimate profitability threshold = 8 Vector inside of loop cost: 6472 Vector prologue cost: 8 Vector epilogue cost: 0 Scalar iteration cost: 2428 Scalar outside cost: 0 Vector outside cost: 8 prologue iterations: 0 epilogue iterations: 0 Calculated minimum iters for profitability: 1 mult.c:30:7: note: Runtime profitability threshold = 8 mult.c:30:7: note: Static estimate profitability threshold = 8 So if vectorizer knew to preffer bigger vector sizes when cost is about double, it would vectoriye first loop to 256 as expected.