https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68707
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #36951|0 |1 is obsolete| | --- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> --- Created attachment 36982 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36982&action=edit patch for testing Ok, so it seems general interleaving isn't going to be more profitable unless the permutes done by SLP are way more expensive as those done by interleaving (can happen on x86_64...). That's because with SLP we never need to permute the stores but only the loads - OTOH with a different permutation for each input vector use (eventually) while interleaving will do number of input vectors times log (group-size) permutes. It's hard to compute a good estimate with the current SLP data structures so I'm leaving interleaving as-is. For load-lanes/store-lanes I have attached an updated patch. It still doesn't include a check against statically determined very low iteration counts (and factor in eventually required peeling for gaps with load/store-lanes) because at this point it would be a heuristic as well (alignment peeling isn't computed yet). Thus this would be my "final" patch. Can you test it please and report back?