On Sat, Sep 28, 2013 at 3:05 PM, Jan Hubicka <hubi...@ucw.cz> wrote: >> > Nice extension. Test cases would be great to have. >> Fore those you need i386 changes to actually use the info. I will post that >> after some cleanup and additional testing. > > Hi, > since I already caught your attention, here is the target specific part for > comments. > > this patch implements memcpy/memset prologues and epilogues as suggested by > Ondrej Bilka. His glibc implementation use IMO very smart trick with single > misaligned move to copy first N and last N bytes of the block. The remainder > of the block is then copied by the usual loop that gets aligned to the proper > address. > > This leads to partial memory stall, but that is handled well by modern x86 > chips. > > For example in the following testcase: > char *a; > char *b; > t1() > { > memcpy (a,b,140); > } > > We now produce: > movq b(%rip), %rsi > movq a(%rip), %rcx > movq (%rsi), %rax <- first 8 bytes are moved > leaq 8(%rcx), %rdi > andq $-8, %rdi <- dest is aligned > movq %rax, (%rcx) > movq 132(%rsi), %rax <- last 8 bytes are moved > movq %rax, 132(%rcx) > subq %rdi, %rcx <- alignment is subtracted from count
> subq %rcx, %rsi <- source is aligned This (source aligned) is not always true, but nevertheless the sequence is very tight. > addl $140, %ecx <- normal copying of 8 byte chunks > shrl $3, %ecx > rep; movsq > ret > Of course it is quite common to know only upper bound on the block. In this > case > we need to generate prologue for first few bytes: > char *p,*q; > t(unsigned int a) > { > if (a<100) > memcpy(q,p,a); > > } > t: > .LFB0: > .cfi_startproc > cmpl $99, %edi > jbe .L15 > .L7: > rep; ret > .p2align 4,,10 > .p2align 3 > .L15: > cmpl $8, %edi > movq q(%rip), %rdx > movq p(%rip), %rsi > jae .L3 > testb $4, %dil > jne .L16 > testl %edi, %edi > je .L7 > movzbl (%rsi), %eax > testb $2, %dil > movb %al, (%rdx) > je .L7 > movl %edi, %edi > movzwl -2(%rsi,%rdi), %eax > movw %ax, -2(%rdx,%rdi) > ret > .p2align 4,,10 > .p2align 3 > .L3: > movq (%rsi), %rax > movq %rax, (%rdx) > movl %edi, %eax > movq -8(%rsi,%rax), %rcx > movq %rcx, -8(%rdx,%rax) > leaq 8(%rdx), %rax > andq $-8, %rax > subq %rax, %rdx > addl %edx, %edi > subq %rdx, %rsi > shrl $3, %edi > movl %edi, %ecx > movq %rax, %rdi > rep; movsq > ret > .p2align 4,,10 > .p2align 3 > .L16: > movl (%rsi), %eax > movl %edi, %edi > movl %eax, (%rdx) > movl -4(%rsi,%rdi), %eax > movl %eax, -4(%rdx,%rdi) > ret > .cfi_endproc > .LFE0: > > Mainline would output a libcall here (because size is unknown to it) and with > inlining all stringops it winds up 210 bytes of code instead of 142 bytes > above. > > Unforutnately the following testcase: > char *p,*q; > t(int a) > { > if (a<100) > memcpy(q,p,a); > > } > Won't get inlined. This is because A is known to be smaller than 100 that > results in anti range after conversion to size_t. This anti range allows very > large values (above INT_MAX) and thus we do not know the block size. > I am not sure if the sane range can be recovered somehow. If not, maybe > this is common enough to add support for "probable" upper bound parameter to > the template. Do we know if there is real code that intentionally does that other than security flaws as result of improperly done range check? I think by default GCC should assume the memcpy size range is (0, 100) here with perhaps an option to override it. thanks, David > > Use of value ranges makes it harder to choose proper algorithm since the > average > size is no longer known. For the moment I take simple average of lower and > upper > bound, but this is wrong. > > Libcall starts to win only for pretty large blocks (over 4GB definitely) so > it makes > sense to inline functions with range 0....4096 even though the cost tables > tells > to expand libcall for everything bigger than 140 bytes: if blocks are small > we will > get noticeable win and if blocks are big, we won't lose much. > > I am considering assigning value ranges to the algorithms, too, for more sane > choices in decide_alg. > > I also think the misaligned move trick can/should be performed by > move_by_pieces and we ought to consider sane use of SSE - current vector_loop > with unrolling factor of 4 seems bit extreme. At least buldozer is happy with > 2 and I would expect SSE moves to be especially useful for moving blocks with > known size where they are not used at all. > > Currently I disabled misaligned move prologues/epilogues for Michael's vector > loop path since they ends up longer than the traditional code (that use loop > for epilogue) > > I will deal with that incrementally. > > Bootstrapped/regtested x86_64-linux with and without misaligned move prologues > and also with -minline-all-stringop. > > I plan to do some furhter testing tomorrow and add testcases while waiting for > the generic part to be reviewed. > Comments are welcome. > > * i386.h (TARGET_MISALIGNED_MOVE_STRING_PROLOGUES_EPILOGUES): New > macro. > * i386.md (movmem, setmem insn patterms): Add new parameters for value > range. > * i386-protos.h (ix86_expand_movmem, ix86_expand_setmem): Add new > parameters for value range. > * i386.c (expand_small_movmem): New function. > (expand_movmem_prologue_epilogue_by_misaligned_moves): New function. > (expand_small_setmem): New function. > (expand_setmem_prologue_epilogue_by_misaligned_moves): New funcion. > (alg_usable_p): New function. > (decide_alg): Add zero_memset parameter; cleanup; consider value > ranges. > (ix86_expand_movmem): Add value range parameters; update block > comment; > add support for value ranges; add support for misaligned move > prologues. > (ix86_expand_setmem): Likewise. > Index: config/i386/i386.h > =================================================================== > --- config/i386/i386.h (revision 203004) > +++ config/i386/i386.h (working copy) > @@ -378,6 +378,8 @@ extern unsigned char ix86_tune_features[ > ix86_tune_features[X86_TUNE_AVOID_MEM_OPND_FOR_CMOVE] > #define TARGET_SPLIT_MEM_OPND_FOR_FP_CONVERTS \ > ix86_tune_features[X86_TUNE_SPLIT_MEM_OPND_FOR_FP_CONVERTS] > +#define TARGET_MISALIGNED_MOVE_STRING_PROLOGUES_EPILOGUES \ > + > ix86_tune_features[X86_TUNE_MISALIGNED_MOVE_STRING_PROLOGUES_EPILOGUES] > > /* Feature tests against the various architecture variations. */ > enum ix86_arch_indices { > Index: config/i386/i386.md > =================================================================== > --- config/i386/i386.md (revision 203004) > +++ config/i386/i386.md (working copy) > @@ -15380,11 +15380,13 @@ > (use (match_operand:SWI48 2 "nonmemory_operand")) > (use (match_operand:SWI48 3 "const_int_operand")) > (use (match_operand:SI 4 "const_int_operand")) > - (use (match_operand:SI 5 "const_int_operand"))] > + (use (match_operand:SI 5 "const_int_operand")) > + (use (match_operand:SI 6 "const_int_operand")) > + (use (match_operand:SI 7 ""))] > "" > { > if (ix86_expand_movmem (operands[0], operands[1], operands[2], operands[3], > - operands[4], operands[5])) > + operands[4], operands[5], operands[6], operands[7])) > DONE; > else > FAIL; > @@ -15572,12 +15574,15 @@ > (use (match_operand:QI 2 "nonmemory_operand")) > (use (match_operand 3 "const_int_operand")) > (use (match_operand:SI 4 "const_int_operand")) > - (use (match_operand:SI 5 "const_int_operand"))] > + (use (match_operand:SI 5 "const_int_operand")) > + (use (match_operand:SI 6 "const_int_operand")) > + (use (match_operand:SI 7 ""))] > "" > { > if (ix86_expand_setmem (operands[0], operands[1], > operands[2], operands[3], > - operands[4], operands[5])) > + operands[4], operands[5], > + operands[6], operands[7])) > DONE; > else > FAIL; > Index: config/i386/x86-tune.def > =================================================================== > --- config/i386/x86-tune.def (revision 203004) > +++ config/i386/x86-tune.def (working copy) > @@ -230,3 +230,7 @@ DEF_TUNE (X86_TUNE_AVOID_MEM_OPND_FOR_CM > fp converts to destination register. */ > DEF_TUNE (X86_TUNE_SPLIT_MEM_OPND_FOR_FP_CONVERTS, > "split_mem_opnd_for_fp_converts", > m_SLM) > +/* Use misaligned move to avoid need for conditionals for string operations. > + This is a win on targets that resulve resonably well partial memory > stalls. */ > +DEF_TUNE (X86_TUNE_MISALIGNED_MOVE_STRING_PROLOGUES_EPILOGUES, > "misaligned_move_string_prologues_epilogues", > + m_GENERIC | m_CORE_ALL | m_AMD_MULTIPLE | m_SLM | m_ATOM) > Index: config/i386/i386-protos.h > =================================================================== > --- config/i386/i386-protos.h (revision 203004) > +++ config/i386/i386-protos.h (working copy) > @@ -58,8 +58,8 @@ extern enum machine_mode ix86_cc_mode (e > extern int avx_vpermilp_parallel (rtx par, enum machine_mode mode); > extern int avx_vperm2f128_parallel (rtx par, enum machine_mode mode); > > -extern bool ix86_expand_movmem (rtx, rtx, rtx, rtx, rtx, rtx); > -extern bool ix86_expand_setmem (rtx, rtx, rtx, rtx, rtx, rtx); > +extern bool ix86_expand_movmem (rtx, rtx, rtx, rtx, rtx, rtx, rtx, rtx); > +extern bool ix86_expand_setmem (rtx, rtx, rtx, rtx, rtx, rtx, rtx, rtx); > extern bool ix86_expand_strlen (rtx, rtx, rtx, rtx); > > extern bool constant_address_p (rtx); > Index: config/i386/i386.c > =================================================================== > --- config/i386/i386.c (revision 203004) > +++ config/i386/i386.c (working copy) > @@ -22661,6 +22661,250 @@ expand_movmem_prologue (rtx destmem, rtx > return destmem; > } > > +/* Test if COUNT&SIZE is nonzero and if so, expand memmov > + sequence that is valid for SIZE..2*SIZE-1 bytes > + and jump to DONE_LABEL. */ > +static void > +expand_small_movmem (rtx destmem, rtx srcmem, > + rtx destptr, rtx srcptr, rtx count, > + int size, rtx done_label) > +{ > + rtx label = ix86_expand_aligntest (count, size, false); > + enum machine_mode mode = mode_for_size (size * BITS_PER_UNIT, MODE_INT, 1); > + rtx modesize; > + int n; > + > + if (size >= 32) > + mode = TARGET_AVX ? V32QImode : TARGET_SSE ? V16QImode : DImode; > + else if (size >= 16) > + mode = TARGET_SSE ? V16QImode : DImode; > + > + srcmem = change_address (srcmem, mode, srcptr); > + destmem = change_address (destmem, mode, destptr); > + modesize = GEN_INT (GET_MODE_SIZE (mode)); > + for (n = 0; n * GET_MODE_SIZE (mode) < size; n++) > + { > + emit_move_insn (destmem, srcmem); > + srcmem = offset_address (srcmem, modesize, GET_MODE_SIZE (mode)); > + destmem = offset_address (destmem, modesize, GET_MODE_SIZE (mode)); > + } > + > + srcmem = offset_address (srcmem, count, 1); > + destmem = offset_address (destmem, count, 1); > + srcmem = offset_address (srcmem, GEN_INT (-size - GET_MODE_SIZE (mode)), > + GET_MODE_SIZE (mode)); > + destmem = offset_address (destmem, GEN_INT (-size - GET_MODE_SIZE (mode)), > + GET_MODE_SIZE (mode)); > + emit_move_insn (destmem, srcmem); > + emit_jump_insn (gen_jump (done_label)); > + emit_barrier (); > + > + emit_label (label); > + LABEL_NUSES (label) = 1; > +} > + > +/* Handle small memcpy (up to SIZE that is supposed to be small power of 2. > + and get ready for the main memcpy loop by copying iniital > DESIRED_ALIGN-ALIGN > + bytes and last SIZE bytes adjusitng DESTPTR/SRCPTR/COUNT in a way we can > + proceed with an loop copying SIZE bytes at once. > + DONE_LABEL is a label after the whole copying sequence. The label is > created > + on demand if *DONE_LABEL is NULL. > + MIN_SIZE is minimal size of block copied. This value gets adjusted for > new > + bounds after the initial copies. > + > + DESTMEM/SRCMEM are memory expressions pointing to the copies block, > + DESTPTR/SRCPTR are pointers to the block. DYNAMIC_CHECK indicate whether > + we will dispatch to a library call for large blocks. > + > + In pseudocode we do: > + > + if (COUNT < SIZE) > + { > + Assume that SIZE is 4. Bigger sizes are handled analogously > + if (COUNT & 4) > + { > + copy 4 bytes from SRCPTR to DESTPTR > + copy 4 bytes from SRCPTR + COUNT - 4 to DESTPTR + COUNT - 4 > + goto done_label > + } > + if (!COUNT) > + goto done_label; > + copy 1 byte from SRCPTR to DESTPTR > + if (COUNT & 2) > + { > + copy 2 bytes from SRCPTR to DESTPTR > + copy 2 bytes from SRCPTR + COUNT - 2 to DESTPTR + COUNT - 2 > + } > + } > + else > + { > + copy at least DESIRED_ALIGN-ALIGN bytes from SRCPTR to DESTPTR > + copy SIZE bytes from SRCPTR + COUNT - SIZE to DESTPTR + COUNT -SIZE > + > + OLD_DESPTR = DESTPTR; > + Align DESTPTR up to DESIRED_ALIGN > + SRCPTR += DESTPTR - OLD_DESTPTR > + COUNT -= DEST_PTR - OLD_DESTPTR > + if (DYNAMIC_CHECK) > + Round COUNT down to multiple of SIZE > + << optional caller supplied zero size guard is here >> > + << optional caller suppplied dynamic check is here >> > + << caller supplied main copy loop is here >> > + } > + done_label: > + */ > +static void > +expand_movmem_prologue_epilogue_by_misaligned_moves (rtx destmem, rtx srcmem, > + rtx *destptr, rtx > *srcptr, > + rtx *count, > + rtx *done_label, > + int size, > + int desired_align, > + int align, > + unsigned HOST_WIDE_INT > *min_size, > + bool dynamic_check) > +{ > + rtx loop_label = NULL, label; > + enum machine_mode mode = mode_for_size (size * BITS_PER_UNIT, MODE_INT, 2); > + int n; > + rtx modesize; > + int prolog_size = 0; > + > + if (size >= 32) > + mode = TARGET_AVX ? V32QImode : TARGET_SSE ? V16QImode : DImode; > + else if (size >= 16) > + mode = TARGET_SSE ? V16QImode : DImode; > + > + /* See if block is big or small, handle small blocks. */ > + if (!CONST_INT_P (*count) && *min_size < (unsigned HOST_WIDE_INT)size) > + { > + int size2 = size; > + loop_label = gen_label_rtx (); > + > + if (!*done_label) > + *done_label = gen_label_rtx (); > + > + emit_cmp_and_jump_insns (*count, GEN_INT (size2), GE, 0, GET_MODE > (*count), > + 1, loop_label); > + size2 >>= 1; > + > + /* Handle sizes > 3. */ > + for (;size2 > 2; size2 >>= 1) > + expand_small_movmem (destmem, srcmem, *destptr, *srcptr, *count, > + size2, *done_label); > + /* Nothing to copy? Jump to DONE_LABEL if so */ > + emit_cmp_and_jump_insns (*count, const0_rtx, EQ, 0, GET_MODE (*count), > + 1, *done_label); > + > + /* Do a byte copy. */ > + srcmem = change_address (srcmem, QImode, *srcptr); > + destmem = change_address (destmem, QImode, *destptr); > + emit_move_insn (destmem, srcmem); > + > + /* Handle sizes 2 and 3. */ > + label = ix86_expand_aligntest (*count, 2, false); > + srcmem = change_address (srcmem, HImode, *srcptr); > + destmem = change_address (destmem, HImode, *destptr); > + srcmem = offset_address (srcmem, *count, 1); > + destmem = offset_address (destmem, *count, 1); > + srcmem = offset_address (srcmem, GEN_INT (-2), 2); > + destmem = offset_address (destmem, GEN_INT (-2), 2); > + emit_move_insn (destmem, srcmem); > + > + emit_label (label); > + LABEL_NUSES (label) = 1; > + emit_jump_insn (gen_jump (*done_label)); > + emit_barrier (); > + } > + else > + gcc_assert (*min_size >= (unsigned HOST_WIDE_INT)size > + || UINTVAL (*count) >= (unsigned HOST_WIDE_INT)size); > + > + /* Start memcpy for COUNT >= SIZE. */ > + if (loop_label) > + { > + emit_label (loop_label); > + LABEL_NUSES (loop_label) = 1; > + } > + > + /* Copy first desired_align bytes. */ > + srcmem = change_address (srcmem, mode, *srcptr); > + destmem = change_address (destmem, mode, *destptr); > + modesize = GEN_INT (GET_MODE_SIZE (mode)); > + for (n = 0; prolog_size < desired_align - align; n++) > + { > + emit_move_insn (destmem, srcmem); > + srcmem = offset_address (srcmem, modesize, GET_MODE_SIZE (mode)); > + destmem = offset_address (destmem, modesize, GET_MODE_SIZE (mode)); > + prolog_size += GET_MODE_SIZE (mode); > + } > + > + > + /* Copy last SIZE bytes. */ > + srcmem = offset_address (srcmem, *count, 1); > + destmem = offset_address (destmem, *count, 1); > + srcmem = offset_address (srcmem, > + GEN_INT (-size - prolog_size), > + 1); > + destmem = offset_address (destmem, > + GEN_INT (-size - prolog_size), > + 1); > + emit_move_insn (destmem, srcmem); > + for (n = 1; n * GET_MODE_SIZE (mode) < size; n++) > + { > + srcmem = offset_address (srcmem, modesize, 1); > + destmem = offset_address (destmem, modesize, 1); > + emit_move_insn (destmem, srcmem); > + } > + > + /* Align destination. */ > + if (desired_align > 1 && desired_align > align) > + { > + rtx saveddest = *destptr; > + > + gcc_assert (desired_align <= size); > + /* Align destptr up, place it to new register. */ > + *destptr = expand_simple_binop (GET_MODE (*destptr), PLUS, *destptr, > + GEN_INT (prolog_size), > + NULL_RTX, 1, OPTAB_DIRECT); > + *destptr = expand_simple_binop (GET_MODE (*destptr), AND, *destptr, > + GEN_INT (-desired_align), > + *destptr, 1, OPTAB_DIRECT); > + /* See how many bytes we skipped. */ > + saveddest = expand_simple_binop (GET_MODE (*destptr), MINUS, saveddest, > + *destptr, > + saveddest, 1, OPTAB_DIRECT); > + /* Adjust srcptr and count. */ > + *srcptr = expand_simple_binop (GET_MODE (*srcptr), MINUS, *srcptr, > saveddest, > + *srcptr, 1, OPTAB_DIRECT); > + *count = expand_simple_binop (GET_MODE (*count), PLUS, *count, > + saveddest, *count, 1, OPTAB_DIRECT); > + /* We copied at most size + prolog_size. */ > + if (*min_size > (unsigned HOST_WIDE_INT)(size + prolog_size)) > + *min_size = (*min_size - size) & ~(unsigned HOST_WIDE_INT)(size - 1); > + else > + *min_size = 0; > + > + /* Our loops always round down the bock size, but for dispatch to > library > + we need precise value. */ > + if (dynamic_check) > + *count = expand_simple_binop (GET_MODE (*count), AND, *count, > + GEN_INT (-size), *count, 1, > OPTAB_DIRECT); > + } > + else > + { > + gcc_assert (prolog_size == 0); > + /* Decrease count, so we won't end up copying last word twice. */ > + if (!CONST_INT_P (*count)) > + *count = expand_simple_binop (GET_MODE (*count), PLUS, *count, > + constm1_rtx, *count, 1, OPTAB_DIRECT); > + else > + *count = GEN_INT ((UINTVAL (*count) - 1) & ~(unsigned > HOST_WIDE_INT)(size - 1)); > + if (*min_size) > + *min_size = (*min_size - 1) & ~(unsigned HOST_WIDE_INT)(size - 1); > + } > +} > + > /* Copy enough from DST to SRC to align DST known to DESIRED_ALIGN. > ALIGN_BYTES is how many bytes need to be copied. > The function updates DST and SRC, namely, it sets proper alignment. > @@ -22749,6 +22993,190 @@ expand_setmem_prologue (rtx destmem, rtx > gcc_assert (desired_alignment <= 8); > } > > +/* Test if COUNT&SIZE is nonzero and if so, expand setmem > + sequence that is valid for SIZE..2*SIZE-1 bytes > + and jump to DONE_LABEL. */ > + > +static void > +expand_small_setmem (rtx destmem, > + rtx destptr, rtx count, > + rtx value, > + int size, rtx done_label) > +{ > + rtx label = ix86_expand_aligntest (count, size, false); > + enum machine_mode mode = mode_for_size (size * BITS_PER_UNIT, MODE_INT, 1); > + rtx modesize; > + int n; > + > + if (size >= 32) > + mode = TARGET_AVX ? V32QImode : TARGET_SSE ? V16QImode : DImode; > + else if (size >= 16) > + mode = TARGET_SSE ? V16QImode : DImode; > + > + destmem = change_address (destmem, mode, destptr); > + modesize = GEN_INT (GET_MODE_SIZE (mode)); > + for (n = 0; n * GET_MODE_SIZE (mode) < size; n++) > + { > + emit_move_insn (destmem, gen_lowpart (mode, value)); > + destmem = offset_address (destmem, modesize, GET_MODE_SIZE (mode)); > + } > + > + destmem = offset_address (destmem, count, 1); > + destmem = offset_address (destmem, GEN_INT (-size - GET_MODE_SIZE (mode)), > + GET_MODE_SIZE (mode)); > + emit_move_insn (destmem, gen_lowpart (mode, value)); > + emit_jump_insn (gen_jump (done_label)); > + emit_barrier (); > + > + emit_label (label); > + LABEL_NUSES (label) = 1; > +} > + > +/* Handle small memcpy (up to SIZE that is supposed to be small power of 2) > + and get ready for the main memset loop. > + See expand_movmem_prologue_epilogue_by_misaligned_moves for detailed > + description. */ > + > +static void > +expand_setmem_prologue_epilogue_by_misaligned_moves (rtx destmem, > + rtx *destptr, > + rtx *count, > + rtx value, > + rtx *done_label, > + int size, > + int desired_align, > + int align, > + unsigned HOST_WIDE_INT > *min_size, > + bool dynamic_check) > +{ > + rtx loop_label = NULL, label; > + enum machine_mode mode = mode_for_size (size * BITS_PER_UNIT, MODE_INT, 2); > + int n; > + rtx modesize; > + int prolog_size = 0; > + > + if (size >= 32) > + mode = TARGET_AVX ? V32QImode : TARGET_SSE ? V16QImode : DImode; > + else if (size >= 16) > + mode = TARGET_SSE ? V16QImode : DImode; > + > + /* See if block is big or small, handle small blocks. */ > + if (!CONST_INT_P (*count) && *min_size < (unsigned HOST_WIDE_INT)size) > + { > + int size2 = size; > + loop_label = gen_label_rtx (); > + > + if (!*done_label) > + *done_label = gen_label_rtx (); > + > + emit_cmp_and_jump_insns (*count, GEN_INT (size2), GE, 0, GET_MODE > (*count), > + 1, loop_label); > + size2 >>= 1; > + > + /* Handle sizes > 3. */ > + for (;size2 > 2; size2 >>= 1) > + expand_small_setmem (destmem, *destptr, *count, value, > + size2, *done_label); > + /* Nothing to copy? Jump to DONE_LABEL if so */ > + emit_cmp_and_jump_insns (*count, const0_rtx, EQ, 0, GET_MODE (*count), > + 1, *done_label); > + > + /* Do a byte copy. */ > + destmem = change_address (destmem, QImode, *destptr); > + emit_move_insn (destmem, gen_lowpart (QImode, value)); > + > + /* Handle sizes 2 and 3. */ > + label = ix86_expand_aligntest (*count, 2, false); > + destmem = change_address (destmem, HImode, *destptr); > + destmem = offset_address (destmem, *count, 1); > + destmem = offset_address (destmem, GEN_INT (-2), 2); > + emit_move_insn (destmem, gen_lowpart (HImode, value)); > + > + emit_label (label); > + LABEL_NUSES (label) = 1; > + emit_jump_insn (gen_jump (*done_label)); > + emit_barrier (); > + } > + else > + gcc_assert (*min_size >= (unsigned HOST_WIDE_INT)size > + || UINTVAL (*count) >= (unsigned HOST_WIDE_INT)size); > + > + /* Start memcpy for COUNT >= SIZE. */ > + if (loop_label) > + { > + emit_label (loop_label); > + LABEL_NUSES (loop_label) = 1; > + } > + > + /* Copy first desired_align bytes. */ > + destmem = change_address (destmem, mode, *destptr); > + modesize = GEN_INT (GET_MODE_SIZE (mode)); > + for (n = 0; prolog_size < desired_align - align; n++) > + { > + emit_move_insn (destmem, gen_lowpart (mode, value)); > + destmem = offset_address (destmem, modesize, GET_MODE_SIZE (mode)); > + prolog_size += GET_MODE_SIZE (mode); > + } > + > + > + /* Copy last SIZE bytes. */ > + destmem = offset_address (destmem, *count, 1); > + destmem = offset_address (destmem, > + GEN_INT (-size - prolog_size), > + 1); > + emit_move_insn (destmem, gen_lowpart (mode, value)); > + for (n = 1; n * GET_MODE_SIZE (mode) < size; n++) > + { > + destmem = offset_address (destmem, modesize, 1); > + emit_move_insn (destmem, gen_lowpart (mode, value)); > + } > + > + /* Align destination. */ > + if (desired_align > 1 && desired_align > align) > + { > + rtx saveddest = *destptr; > + > + gcc_assert (desired_align <= size); > + /* Align destptr up, place it to new register. */ > + *destptr = expand_simple_binop (GET_MODE (*destptr), PLUS, *destptr, > + GEN_INT (prolog_size), > + NULL_RTX, 1, OPTAB_DIRECT); > + *destptr = expand_simple_binop (GET_MODE (*destptr), AND, *destptr, > + GEN_INT (-desired_align), > + *destptr, 1, OPTAB_DIRECT); > + /* See how many bytes we skipped. */ > + saveddest = expand_simple_binop (GET_MODE (*destptr), MINUS, saveddest, > + *destptr, > + saveddest, 1, OPTAB_DIRECT); > + /* Adjust count. */ > + *count = expand_simple_binop (GET_MODE (*count), PLUS, *count, > + saveddest, *count, 1, OPTAB_DIRECT); > + /* We copied at most size + prolog_size. */ > + if (*min_size > (unsigned HOST_WIDE_INT)(size + prolog_size)) > + *min_size = (*min_size - size) & ~(unsigned HOST_WIDE_INT)(size - 1); > + else > + *min_size = 0; > + > + /* Our loops always round down the bock size, but for dispatch to > library > + we need precise value. */ > + if (dynamic_check) > + *count = expand_simple_binop (GET_MODE (*count), AND, *count, > + GEN_INT (-size), *count, 1, > OPTAB_DIRECT); > + } > + else > + { > + gcc_assert (prolog_size == 0); > + /* Decrease count, so we won't end up copying last word twice. */ > + if (!CONST_INT_P (*count)) > + *count = expand_simple_binop (GET_MODE (*count), PLUS, *count, > + constm1_rtx, *count, 1, OPTAB_DIRECT); > + else > + *count = GEN_INT ((UINTVAL (*count) - 1) & ~(unsigned > HOST_WIDE_INT)(size - 1)); > + if (*min_size) > + *min_size = (*min_size - 1) & ~(unsigned HOST_WIDE_INT)(size - 1); > + } > +} > + > /* Set enough from DST to align DST known to by aligned by ALIGN to > DESIRED_ALIGN. ALIGN_BYTES is how many bytes need to be stored. */ > static rtx > @@ -22790,61 +23218,98 @@ expand_constant_setmem_prologue (rtx dst > return dst; > } > > -/* Given COUNT and EXPECTED_SIZE, decide on codegen of string operation. */ > -static enum stringop_alg > -decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset, > - int *dynamic_check, bool *noalign) > +/* Return true if ALG can be used in current context. > + Assume we expand memset if MEMSET is true. */ > +static bool > +alg_usable_p (enum stringop_alg alg, bool memset) > { > - const struct stringop_algs * algs; > - bool optimize_for_speed; > + if (alg == no_stringop) > + return false; > + if (alg == vector_loop) > + return TARGET_SSE || TARGET_AVX; > /* Algorithms using the rep prefix want at least edi and ecx; > additionally, memset wants eax and memcpy wants esi. Don't > consider such algorithms if the user has appropriated those > registers for their own purposes. */ > - bool rep_prefix_usable = !(fixed_regs[CX_REG] || fixed_regs[DI_REG] > - || (memset > - ? fixed_regs[AX_REG] : fixed_regs[SI_REG])); > - *noalign = false; > + if (alg == rep_prefix_1_byte > + || alg == rep_prefix_4_byte > + || alg == rep_prefix_8_byte) > + return !(fixed_regs[CX_REG] || fixed_regs[DI_REG] > + || (memset ? fixed_regs[AX_REG] : fixed_regs[SI_REG])); > + return true; > +} > + > + > > -#define ALG_USABLE_P(alg) (rep_prefix_usable \ > - || (alg != rep_prefix_1_byte \ > - && alg != rep_prefix_4_byte \ > - && alg != rep_prefix_8_byte)) > +/* Given COUNT and EXPECTED_SIZE, decide on codegen of string operation. */ > +static enum stringop_alg > +decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, > + unsigned HOST_WIDE_INT min_size, unsigned HOST_WIDE_INT max_size, > + bool memset, bool zero_memset, int *dynamic_check, bool *noalign) > +{ > + const struct stringop_algs * algs; > + bool optimize_for_speed; > + int max = -1; > const struct processor_costs *cost; > + int i; > + bool any_alg_usable_p = false; > + > + *noalign = false; > + *dynamic_check = -1; > > /* Even if the string operation call is cold, we still might spend a lot > of time processing large blocks. */ > if (optimize_function_for_size_p (cfun) > || (optimize_insn_for_size_p () > - && expected_size != -1 && expected_size < 256)) > + && (max_size < 256 > + || (expected_size != -1 && expected_size < 256)))) > optimize_for_speed = false; > else > optimize_for_speed = true; > > cost = optimize_for_speed ? ix86_cost : &ix86_size_cost; > - > - *dynamic_check = -1; > if (memset) > algs = &cost->memset[TARGET_64BIT != 0]; > else > algs = &cost->memcpy[TARGET_64BIT != 0]; > - if (ix86_stringop_alg != no_stringop && ALG_USABLE_P (ix86_stringop_alg)) > + > + /* See maximal size for user defined algorithm. */ > + for (i = 0; i < MAX_STRINGOP_ALGS; i++) > + { > + enum stringop_alg candidate = algs->size[i].alg; > + bool usable = alg_usable_p (candidate, memset); > + any_alg_usable_p |= usable; > + > + if (candidate != libcall && candidate && usable) > + max = algs->size[i].max; > + } > + > + /* If expected size is not known but max size is small enough > + so inline version is a win, set expected size into > + the range. */ > + if (max > 1 && (unsigned HOST_WIDE_INT)max >= max_size && expected_size == > -1) > + expected_size = min_size / 2 + max_size / 2; > + > + /* If user specified the algorithm, honnor it if possible. */ > + if (ix86_stringop_alg != no_stringop > + && alg_usable_p (ix86_stringop_alg, memset)) > return ix86_stringop_alg; > /* rep; movq or rep; movl is the smallest variant. */ > else if (!optimize_for_speed) > { > - if (!count || (count & 3)) > - return rep_prefix_usable ? rep_prefix_1_byte : loop_1_byte; > + if (!count || (count & 3) || (memset && !zero_memset)) > + return alg_usable_p (rep_prefix_1_byte, memset) > + ? rep_prefix_1_byte : loop_1_byte; > else > - return rep_prefix_usable ? rep_prefix_4_byte : loop; > + return alg_usable_p (rep_prefix_4_byte, memset) > + ? rep_prefix_4_byte : loop; > } > - /* Very tiny blocks are best handled via the loop, REP is expensive to > setup. > - */ > + /* Very tiny blocks are best handled via the loop, REP is expensive to > + setup. */ > else if (expected_size != -1 && expected_size < 4) > return loop_1_byte; > else if (expected_size != -1) > { > - unsigned int i; > enum stringop_alg alg = libcall; > for (i = 0; i < MAX_STRINGOP_ALGS; i++) > { > @@ -22858,7 +23323,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WI > { > enum stringop_alg candidate = algs->size[i].alg; > > - if (candidate != libcall && ALG_USABLE_P (candidate)) > + if (candidate != libcall && alg_usable_p (candidate, memset)) > alg = candidate; > /* Honor TARGET_INLINE_ALL_STRINGOPS by picking > last non-libcall inline algorithm. */ > @@ -22871,14 +23336,13 @@ decide_alg (HOST_WIDE_INT count, HOST_WI > return alg; > break; > } > - else if (ALG_USABLE_P (candidate)) > + else if (alg_usable_p (candidate, memset)) > { > *noalign = algs->size[i].noalign; > return candidate; > } > } > } > - gcc_assert (TARGET_INLINE_ALL_STRINGOPS || !rep_prefix_usable); > } > /* When asked to inline the call anyway, try to pick meaningful choice. > We look for maximal size of block that is faster to copy by hand and > @@ -22888,22 +23352,11 @@ decide_alg (HOST_WIDE_INT count, HOST_WI > If this turns out to be bad, we might simply specify the preferred > choice in ix86_costs. */ > if ((TARGET_INLINE_ALL_STRINGOPS || TARGET_INLINE_STRINGOPS_DYNAMICALLY) > - && (algs->unknown_size == libcall || !ALG_USABLE_P > (algs->unknown_size))) > + && (algs->unknown_size == libcall > + || !alg_usable_p (algs->unknown_size, memset))) > { > - int max = -1; > enum stringop_alg alg; > - int i; > - bool any_alg_usable_p = true; > - > - for (i = 0; i < MAX_STRINGOP_ALGS; i++) > - { > - enum stringop_alg candidate = algs->size[i].alg; > - any_alg_usable_p = any_alg_usable_p && ALG_USABLE_P (candidate); > > - if (candidate != libcall && candidate > - && ALG_USABLE_P (candidate)) > - max = algs->size[i].max; > - } > /* If there aren't any usable algorithms, then recursing on > smaller sizes isn't going to find anything. Just return the > simple byte-at-a-time copy loop. */ > @@ -22916,15 +23369,16 @@ decide_alg (HOST_WIDE_INT count, HOST_WI > } > if (max == -1) > max = 4096; > - alg = decide_alg (count, max / 2, memset, dynamic_check, noalign); > + alg = decide_alg (count, max / 2, min_size, max_size, memset, > + zero_memset, dynamic_check, noalign); > gcc_assert (*dynamic_check == -1); > gcc_assert (alg != libcall); > if (TARGET_INLINE_STRINGOPS_DYNAMICALLY) > *dynamic_check = max; > return alg; > } > - return ALG_USABLE_P (algs->unknown_size) ? algs->unknown_size : libcall; > -#undef ALG_USABLE_P > + return (alg_usable_p (algs->unknown_size, memset) > + ? algs->unknown_size : libcall); > } > > /* Decide on alignment. We know that the operand is already aligned to ALIGN > @@ -22964,31 +23418,51 @@ decide_alignment (int align, > /* Expand string move (memcpy) operation. Use i386 string operations > when profitable. expand_setmem contains similar code. The code > depends upon architecture, block size and alignment, but always has > - the same overall structure: > + one of the following overall structures: > + > + Aligned move sequence: > + > + 1) Prologue guard: Conditional that jumps up to epilogues for small > + blocks that can be handled by epilogue alone. This is faster > + but also needed for correctness, since prologue assume the block > + is larger than the desired alignment. > + > + Optional dynamic check for size and libcall for large > + blocks is emitted here too, with -minline-stringops-dynamically. > + > + 2) Prologue: copy first few bytes in order to get destination > + aligned to DESIRED_ALIGN. It is emitted only when ALIGN is less > + than DESIRED_ALIGN and up to DESIRED_ALIGN - ALIGN bytes can be > + copied. We emit either a jump tree on power of two sized > + blocks, or a byte loop. > + > + 3) Main body: the copying loop itself, copying in SIZE_NEEDED chunks > + with specified algorithm. > > - 1) Prologue guard: Conditional that jumps up to epilogues for small > - blocks that can be handled by epilogue alone. This is faster > - but also needed for correctness, since prologue assume the block > - is larger than the desired alignment. > - > - Optional dynamic check for size and libcall for large > - blocks is emitted here too, with -minline-stringops-dynamically. > - > - 2) Prologue: copy first few bytes in order to get destination > - aligned to DESIRED_ALIGN. It is emitted only when ALIGN is less > - than DESIRED_ALIGN and up to DESIRED_ALIGN - ALIGN bytes can be > - copied. We emit either a jump tree on power of two sized > - blocks, or a byte loop. > + 4) Epilogue: code copying tail of the block that is too small to be > + handled by main body (or up to size guarded by prologue guard). > > - 3) Main body: the copying loop itself, copying in SIZE_NEEDED chunks > - with specified algorithm. > + Misaligned move sequence > > - 4) Epilogue: code copying tail of the block that is too small to be > - handled by main body (or up to size guarded by prologue guard). */ > + 1) missaligned move prologue/epilogue containing: > + a) Prologue handling small memory blocks and jumping to done_label > + (skipped if blocks are known to be large enough) > + b) Signle move copying first DESIRED_ALIGN-ALIGN bytes if alignment is > + needed by single possibly misaligned move > + (skipped if alignment is not needed) > + c) Copy of last SIZE_NEEDED bytes by possibly misaligned moves > + > + 2) Zero size guard dispatching to done_label, if needed > + > + 3) dispatch to library call, if needed, > + > + 3) Main body: the copying loop itself, copying in SIZE_NEEDED chunks > + with specified algorithm. */ > > bool > ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp, > - rtx expected_align_exp, rtx expected_size_exp) > + rtx expected_align_exp, rtx expected_size_exp, > + rtx min_size_exp, rtx max_size_exp) > { > rtx destreg; > rtx srcreg; > @@ -23006,6 +23480,9 @@ ix86_expand_movmem (rtx dst, rtx src, rt > bool noalign; > enum machine_mode move_mode = VOIDmode; > int unroll_factor = 1; > + unsigned HOST_WIDE_INT min_size = UINTVAL (min_size_exp); > + unsigned HOST_WIDE_INT max_size = max_size_exp ? UINTVAL (max_size_exp) : > -1; > + bool misaligned_prologue_used = false; > > if (CONST_INT_P (align_exp)) > align = INTVAL (align_exp); > @@ -23029,7 +23506,8 @@ ix86_expand_movmem (rtx dst, rtx src, rt > > /* Step 0: Decide on preferred algorithm, desired alignment and > size of chunks to be copied by main loop. */ > - alg = decide_alg (count, expected_size, false, &dynamic_check, &noalign); > + alg = decide_alg (count, expected_size, min_size, max_size, > + false, false, &dynamic_check, &noalign); > if (alg == libcall) > return false; > gcc_assert (alg != no_stringop); > @@ -23115,8 +23593,44 @@ ix86_expand_movmem (rtx dst, rtx src, rt > } > gcc_assert (desired_align >= 1 && align >= 1); > > + /* Misaligned move sequences handles both prologues and epilogues at once. > + Default code generation results in smaller code for large alignments and > + also avoids redundant job when sizes are known precisely. */ > + if (TARGET_MISALIGNED_MOVE_STRING_PROLOGUES_EPILOGUES > + && MAX (desired_align, epilogue_size_needed) <= 32 > + && ((desired_align > align && !align_bytes) > + || (!count && epilogue_size_needed > 1))) > + { > + /* Misaligned move prologue handled small blocks by itself. */ > + misaligned_prologue_used = true; > + expand_movmem_prologue_epilogue_by_misaligned_moves > + (dst, src, &destreg, &srcreg, &count_exp, &jump_around_label, > + desired_align < align > + ? MAX (desired_align, epilogue_size_needed) : > epilogue_size_needed, > + desired_align, align, &min_size, dynamic_check); > + src = change_address (src, BLKmode, srcreg); > + dst = change_address (dst, BLKmode, destreg); > + set_mem_align (dst, desired_align * BITS_PER_UNIT); > + epilogue_size_needed = 0; > + if (need_zero_guard && !min_size) > + { > + /* It is possible that we copied enough so the main loop will not > + execute. */ > + gcc_assert (size_needed > 1); > + if (jump_around_label == NULL_RTX) > + jump_around_label = gen_label_rtx (); > + emit_cmp_and_jump_insns (count_exp, > + GEN_INT (size_needed), > + LTU, 0, counter_mode (count_exp), 1, > jump_around_label); > + if (expected_size == -1 > + || expected_size < (desired_align - align) / 2 + size_needed) > + predict_jump (REG_BR_PROB_BASE * 20 / 100); > + else > + predict_jump (REG_BR_PROB_BASE * 60 / 100); > + } > + } > /* Ensure that alignment prologue won't copy past end of block. */ > - if (size_needed > 1 || (desired_align > 1 && desired_align > align)) > + else if (size_needed > 1 || (desired_align > 1 && desired_align > align)) > { > epilogue_size_needed = MAX (size_needed - 1, desired_align - align); > /* Epilogue always copies COUNT_EXP & EPILOGUE_SIZE_NEEDED bytes. > @@ -23135,8 +23649,9 @@ ix86_expand_movmem (rtx dst, rtx src, rt > goto epilogue; > } > } > - else > + else if (min_size < (unsigned HOST_WIDE_INT)epilogue_size_needed) > { > + gcc_assert (max_size >= (unsigned > HOST_WIDE_INT)epilogue_size_needed); > label = gen_label_rtx (); > emit_cmp_and_jump_insns (count_exp, > GEN_INT (epilogue_size_needed), > @@ -23176,18 +23691,23 @@ ix86_expand_movmem (rtx dst, rtx src, rt > > /* Step 2: Alignment prologue. */ > > - if (desired_align > align) > + if (desired_align > align && !misaligned_prologue_used) > { > if (align_bytes == 0) > { > - /* Except for the first move in epilogue, we no longer know > + /* Except for the first move in prologue, we no longer know > constant offset in aliasing info. It don't seems to worth > the pain to maintain it for the first move, so throw away > the info early. */ > src = change_address (src, BLKmode, srcreg); > dst = change_address (dst, BLKmode, destreg); > - dst = expand_movmem_prologue (dst, src, destreg, srcreg, count_exp, > align, > - desired_align); > + dst = expand_movmem_prologue (dst, src, destreg, srcreg, count_exp, > + align, desired_align); > + /* At most desired_align - align bytes are copied. */ > + if (min_size < (unsigned)(desired_align - align)) > + min_size = 0; > + else > + min_size -= desired_align; > } > else > { > @@ -23200,6 +23720,7 @@ ix86_expand_movmem (rtx dst, rtx src, rt > count -= align_bytes; > } > if (need_zero_guard > + && !min_size > && (count < (unsigned HOST_WIDE_INT) size_needed > || (align_bytes == 0 > && count < ((unsigned HOST_WIDE_INT) size_needed > @@ -23227,7 +23748,7 @@ ix86_expand_movmem (rtx dst, rtx src, rt > label = NULL; > epilogue_size_needed = 1; > } > - else if (label == NULL_RTX) > + else if (label == NULL_RTX && !misaligned_prologue_used) > epilogue_size_needed = size_needed; > > /* Step 3: Main loop. */ > @@ -23395,7 +23916,8 @@ promote_duplicated_reg_to_size (rtx val, > steps performed. */ > bool > ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp, > - rtx expected_align_exp, rtx expected_size_exp) > + rtx expected_align_exp, rtx expected_size_exp, > + rtx min_size_exp, rtx max_size_exp) > { > rtx destreg; > rtx label = NULL; > @@ -23414,6 +23936,9 @@ ix86_expand_setmem (rtx dst, rtx count_e > bool noalign; > enum machine_mode move_mode = VOIDmode; > int unroll_factor; > + unsigned HOST_WIDE_INT min_size = UINTVAL (min_size_exp); > + unsigned HOST_WIDE_INT max_size = max_size_exp ? UINTVAL (max_size_exp) : > -1; > + bool misaligned_prologue_used = false; > > if (CONST_INT_P (align_exp)) > align = INTVAL (align_exp); > @@ -23433,7 +23958,8 @@ ix86_expand_setmem (rtx dst, rtx count_e > /* Step 0: Decide on preferred algorithm, desired alignment and > size of chunks to be copied by main loop. */ > > - alg = decide_alg (count, expected_size, true, &dynamic_check, &noalign); > + alg = decide_alg (count, expected_size, min_size, max_size, > + true, val_exp == const0_rtx, &dynamic_check, &noalign); > if (alg == libcall) > return false; > gcc_assert (alg != no_stringop); > @@ -23508,8 +24034,47 @@ ix86_expand_setmem (rtx dst, rtx count_e > if (CONST_INT_P (val_exp)) > promoted_val = promote_duplicated_reg_to_size (val_exp, size_needed, > desired_align, align); > + /* Misaligned move sequences handles both prologues and epilogues at once. > + Default code generation results in smaller code for large alignments and > + also avoids redundant job when sizes are known precisely. */ > + if (TARGET_MISALIGNED_MOVE_STRING_PROLOGUES_EPILOGUES > + && MAX (desired_align, epilogue_size_needed) <= 32 > + && ((desired_align > align && !align_bytes) > + || (!count && epilogue_size_needed > 1))) > + { > + /* We always need promoted value. */ > + if (!promoted_val) > + promoted_val = promote_duplicated_reg_to_size (val_exp, size_needed, > + desired_align, align); > + /* Misaligned move prologue handled small blocks by itself. */ > + misaligned_prologue_used = true; > + expand_setmem_prologue_epilogue_by_misaligned_moves > + (dst, &destreg, &count_exp, promoted_val, &jump_around_label, > + desired_align < align > + ? MAX (desired_align, epilogue_size_needed) : epilogue_size_needed, > + desired_align, align, &min_size, dynamic_check); > + dst = change_address (dst, BLKmode, destreg); > + set_mem_align (dst, desired_align * BITS_PER_UNIT); > + epilogue_size_needed = 0; > + if (need_zero_guard && !min_size) > + { > + /* It is possible that we copied enough so the main loop will not > + execute. */ > + gcc_assert (size_needed > 1); > + if (jump_around_label == NULL_RTX) > + jump_around_label = gen_label_rtx (); > + emit_cmp_and_jump_insns (count_exp, > + GEN_INT (size_needed), > + LTU, 0, counter_mode (count_exp), 1, > jump_around_label); > + if (expected_size == -1 > + || expected_size < (desired_align - align) / 2 + size_needed) > + predict_jump (REG_BR_PROB_BASE * 20 / 100); > + else > + predict_jump (REG_BR_PROB_BASE * 60 / 100); > + } > + } > /* Ensure that alignment prologue won't copy past end of block. */ > - if (size_needed > 1 || (desired_align > 1 && desired_align > align)) > + else if (size_needed > 1 || (desired_align > 1 && desired_align > align)) > { > epilogue_size_needed = MAX (size_needed - 1, desired_align - align); > /* Epilogue always copies COUNT_EXP & (EPILOGUE_SIZE_NEEDED - 1) bytes. > @@ -23534,8 +24099,9 @@ ix86_expand_setmem (rtx dst, rtx count_e > goto epilogue; > } > } > - else > + else if (min_size < (unsigned HOST_WIDE_INT)epilogue_size_needed) > { > + gcc_assert (max_size >= (unsigned > HOST_WIDE_INT)epilogue_size_needed); > label = gen_label_rtx (); > emit_cmp_and_jump_insns (count_exp, > GEN_INT (epilogue_size_needed), > @@ -23566,7 +24132,7 @@ ix86_expand_setmem (rtx dst, rtx count_e > desired_align, align); > gcc_assert (desired_align >= 1 && align >= 1); > > - if (desired_align > align) > + if (desired_align > align && !misaligned_prologue_used) > { > if (align_bytes == 0) > { > @@ -23577,6 +24143,11 @@ ix86_expand_setmem (rtx dst, rtx count_e > dst = change_address (dst, BLKmode, destreg); > expand_setmem_prologue (dst, destreg, promoted_val, count_exp, > align, > desired_align); > + /* At most desired_align - align bytes are copied. */ > + if (min_size < (unsigned)(desired_align - align)) > + min_size = 0; > + else > + min_size -= desired_align; > } > else > { > @@ -23589,6 +24160,7 @@ ix86_expand_setmem (rtx dst, rtx count_e > count -= align_bytes; > } > if (need_zero_guard > + && !min_size > && (count < (unsigned HOST_WIDE_INT) size_needed > || (align_bytes == 0 > && count < ((unsigned HOST_WIDE_INT) size_needed > @@ -23617,7 +24189,7 @@ ix86_expand_setmem (rtx dst, rtx count_e > promoted_val = val_exp; > epilogue_size_needed = 1; > } > - else if (label == NULL_RTX) > + else if (label == NULL_RTX && !misaligned_prologue_used) > epilogue_size_needed = size_needed; > > /* Step 3: Main loop. */