Hi,

I did rely on gcc optimization run on moving things around for me. What
_mesa_streaming_clamp_float_rgba really look like when I compile it is this:

Dump of assembler code for function _mesa_streaming_clamp_float_rgba:
   0x00007ffff401a0a0 <+0>:     test   %edi,%edi
   0x00007ffff401a0a2 <+2>:     je     0x7ffff401a0d7
<_mesa_streaming_clamp_float_rgba+55>
   0x00007ffff401a0a4 <+4>:     sub    $0x1,%edi
   0x00007ffff401a0a7 <+7>:     shufps $0x0,%xmm0,%xmm0
   0x00007ffff401a0ab <+11>:    shufps $0x0,%xmm1,%xmm1
   0x00007ffff401a0af <+15>:    add    $0x1,%rdi
   0x00007ffff401a0b3 <+19>:    shl    $0x4,%rdi
   0x00007ffff401a0b7 <+23>:    xor    %eax,%eax
   0x00007ffff401a0b9 <+25>:    nopl   0x0(%rax)
   0x00007ffff401a0c0 <+32>:    movups (%rsi,%rax,1),%xmm2
   0x00007ffff401a0c4 <+36>:    maxps  %xmm0,%xmm2
   0x00007ffff401a0c7 <+39>:    minps  %xmm1,%xmm2
   0x00007ffff401a0ca <+42>:    movups %xmm2,(%rdx,%rax,1)
   0x00007ffff401a0ce <+46>:    add    $0x10,%rax
   0x00007ffff401a0d2 <+50>:    cmp    %rdi,%rax
   0x00007ffff401a0d5 <+53>:    jne    0x7ffff401a0c0
<_mesa_streaming_clamp_float_rgba+32>
   0x00007ffff401a0d7 <+55>:    repz retq
End of assembler dump.

Gcc has after inlining moved all unnecessary stuff outside the loop but
I can still have _mesa_clamp_float_rgba function ready for generic use
on source level. I did trust gcc here also with the unrolling, looking
at the loop unrolling would reduce three instructions per round but I
suspect add/cmp/jne are not the expensive instructions here (I didn't check)

Out of order execution might be interesting to try here though. I need
to check if I can get gcc to behave properly, never before attempted
that with intrinsics on gcc :)

/Juha-Pekka



On 04.11.2014 19:35, Siavash Eliasi wrote:
> Hello. I'd get rid of "_mm_set1_ps" inside "_mesa_clamp_float_rgba" by
> passing _m128 version of min/max directly, so "_mm_set1_ps" will be
> moved out of the for loop.
> 
> I'd also unroll the "_mesa_streaming_clamp_float_rgba" loop to minimize
> the loop overhead (and utilize out of order execution as a bonus),
> because nothing compute intensive is happening there. You can also use
> prefetching (_mm_prefetch) there to improve performance by reading data
> ahead from memory.
> 
> Best regards,
> Siavash Eliasi.
> _______________________________________________
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/mesa-dev

_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev

Reply via email to