Hi Jan, > I also think the misaligned move trick can/should be performed by > move_by_pieces and we ought to consider sane use of SSE - current vector_loop > with unrolling factor of 4 seems bit extreme. At least buldozer is happy with > 2 and I would expect SSE moves to be especially useful for moving blocks with > known size where they are not used at all. > > Currently I disabled misaligned move prologues/epilogues for Michael's vector > loop path since they ends up longer than the traditional code (that use loop > for epilogue) Prologues could use this techniques even with vector_loop, as they actually don't have a loop. As for epilogues - have you tried to use misaligned vector_moves (movdqu)? It looks to me that we need approx. the same amount of instructions in vector-loop and in usual-loop epilogues, if we use vector-instructions in vector-loop epilogue.
> Comments are welcome. BTW, maybe we could generalize expand_small_movmem a bit and make a separate expanding strategy out of it? It will expand a memmov with no loop (and probably no alignment prologue) - just with the biggest available moves. Also, a cost model could be added here to make decisions on when we actually want to align the moves. Here is a couple of examples of that: memcpy (a, b, 32); // alignment is unknown will expand to movdqu a, %xmm0 movdqu a+16, %xmm1 movdqu %xmm0, b movdqu %xmm1, b+16 memcpy (a, b, 32); // alignment is known and equals 64bit will expand to a) movdqu a, %xmm0 movdqu a+16, %xmm1 movdqu %xmm0, b movdqu %xmm1, b+16 or b) movq a, %xmm0 movdqa a+8, %xmm1 movq a+24,%xmm2 movq %xmm0, b movdqa %xmm1, b+8 movq %xmm2, b+24 We would compute the total cost of both variant and choose the best - for computation we need just a cost of aligned and misaligned moves. This strategy is actually pretty similar to move_by_pieces, but as it have much more target-specific information, it would be able to produce much more effective code. And one more question - in a work on vector_loop for memset I tried to merge many of movmem and setmem routines (e.g. instead of expand_movmem_prologue and expand_setmem_prologue I made a single routine expand_movmem_or_setmem_prologue). What do you think, is it a good idea? It reduces code size in i386.c, but slightly complicates the code. I'll send a patch shortly, as soon as the testing completes. Thanks, Michael