https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117323
--- Comment #4 from Hongtao Liu <liuhongt at gcc dot gnu.org> --- Another miss optimization is GCC failed to recognize max_expr for sum1, which generates a lot pack/unpack code in the vectorizer prephitmp_66 = (int) _8; # DEBUG a => NULL # DEBUG b => NULL # DEBUG a => NULL # DEBUG b => NULL # DEBUG INLINE_ENTRY max _35 = (unsigned int) prephitmp_65; _9 = (unsigned int) _8; _10 = _35 * _9; _72 = (int) _10; _74 = _72 / 128; _76 = (char) _74; _42 = prephitmp_66 > 0; prephitmp_77 = _42 ? _76 : 0; Yes, swap the operand order generates much decent code for x86, but make arm generate worse code.