Hi Jan,

> I also think the misaligned move trick can/should be performed by
> move_by_pieces and we ought to consider sane use of SSE - current vector_loop
> with unrolling factor of 4 seems bit extreme.  At least buldozer is happy with
> 2 and I would expect SSE moves to be especially useful for moving blocks with
> known size where they are not used at all.
> 
> Currently I disabled misaligned move prologues/epilogues for Michael's vector
> loop path since they ends up longer than the traditional code (that use loop
> for epilogue)
Prologues could use this techniques even with vector_loop, as they actually
don't have a loop.
As for epilogues - have you tried to use misaligned vector_moves (movdqu)?  It
looks to me that we need approx. the same amount of instructions in vector-loop
and in usual-loop epilogues, if we use vector-instructions in vector-loop
epilogue.

> Comments are welcome.
BTW, maybe we could generalize expand_small_movmem a bit and make a separate
expanding strategy out of it?  It will expand a memmov with no loop (and
probably no alignment prologue) - just with the biggest available moves.  Also,
a cost model could be added here to make decisions on when we actually want to
align the moves.  Here is a couple of examples of that:

memcpy (a, b, 32); // alignment is unknown
will expand to
  movdqu a, %xmm0
  movdqu a+16, %xmm1
  movdqu %xmm0, b
  movdqu %xmm1, b+16
memcpy (a, b, 32); // alignment is known and equals 64bit
will expand to
a)
  movdqu a, %xmm0
  movdqu a+16, %xmm1
  movdqu %xmm0, b
  movdqu %xmm1, b+16
or b)
  movq    a,   %xmm0
  movdqa  a+8, %xmm1
  movq    a+24,%xmm2
  movq    %xmm0, b
  movdqa  %xmm1, b+8
  movq    %xmm2, b+24

We would compute the total cost of both variant and choose the best - for
computation we need just a cost of aligned and misaligned moves.

This strategy is actually pretty similar to move_by_pieces, but as it have much
more target-specific information, it would be able to produce much more
effective code.

And one more question - in a work on vector_loop for memset I tried to merge
many of movmem and setmem routines (e.g. instead of expand_movmem_prologue and
expand_setmem_prologue I made a single routine
expand_movmem_or_setmem_prologue).  What do you think, is it a good idea?  It
reduces code size in i386.c, but slightly complicates the code.  I'll send a
patch shortly, as soon as the testing completes.

Thanks, Michael

Reply via email to