http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56993
--- Comment #1 from Carrot <carrot at google dot com> --- I did some more experimentation on this benchmark. O0/O1 generates correct result, but O2/Os generates wrong result. So the problem should be in some optimization pass that is enabled in O2/Os while disabled in O1.