https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104723
--- Comment #10 from Jakub Jelinek <jakub at gcc dot gnu.org> --- (In reply to H.J. Lu from comment #8) > > DSE can remove redundant load/store for TI, but not OI/XI. DSE can remove redundant load/store for OI/XI just fine, just remove the last 7 from the string so that it is 48 bytes instead of 49 and all of sudden it works fine. It is indeed due to: > It is due to overlapping store. this. Wonder if we couldn't special case overlapping stores if they are loaded from constant pool and the overlapping bytes have the same values. And for the backend, the question is how big the penalty for the overlapping store is compared to doing multiple non-overlapping stores. Say for those 49 bytes one could do one OI, one TI/V1TI and one QI load/store as opposed to one aligned and one misaligned OI load/store. For say: void foo (void *p, void *q) { __builtin_memcpy (p, q, 49); } we emit the 2 overlapping loads/stores for -mavx512f and 4 non-overlapping loads/stores with say -mavx2.