http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50713
--- Comment #8 from Marc Glisse <glisse at gcc dot gnu.org> 2012-12-01 16:54:08 UTC --- (In reply to comment #5) We seem to do better now. I see essentially the same code for the vector and loop versions. The main issue left is for dfma8*, copying the result to the output "register". This looks similar to PR55266.