Hi, I did rely on gcc optimization run on moving things around for me. What _mesa_streaming_clamp_float_rgba really look like when I compile it is this:
Dump of assembler code for function _mesa_streaming_clamp_float_rgba: 0x00007ffff401a0a0 <+0>: test %edi,%edi 0x00007ffff401a0a2 <+2>: je 0x7ffff401a0d7 <_mesa_streaming_clamp_float_rgba+55> 0x00007ffff401a0a4 <+4>: sub $0x1,%edi 0x00007ffff401a0a7 <+7>: shufps $0x0,%xmm0,%xmm0 0x00007ffff401a0ab <+11>: shufps $0x0,%xmm1,%xmm1 0x00007ffff401a0af <+15>: add $0x1,%rdi 0x00007ffff401a0b3 <+19>: shl $0x4,%rdi 0x00007ffff401a0b7 <+23>: xor %eax,%eax 0x00007ffff401a0b9 <+25>: nopl 0x0(%rax) 0x00007ffff401a0c0 <+32>: movups (%rsi,%rax,1),%xmm2 0x00007ffff401a0c4 <+36>: maxps %xmm0,%xmm2 0x00007ffff401a0c7 <+39>: minps %xmm1,%xmm2 0x00007ffff401a0ca <+42>: movups %xmm2,(%rdx,%rax,1) 0x00007ffff401a0ce <+46>: add $0x10,%rax 0x00007ffff401a0d2 <+50>: cmp %rdi,%rax 0x00007ffff401a0d5 <+53>: jne 0x7ffff401a0c0 <_mesa_streaming_clamp_float_rgba+32> 0x00007ffff401a0d7 <+55>: repz retq End of assembler dump. Gcc has after inlining moved all unnecessary stuff outside the loop but I can still have _mesa_clamp_float_rgba function ready for generic use on source level. I did trust gcc here also with the unrolling, looking at the loop unrolling would reduce three instructions per round but I suspect add/cmp/jne are not the expensive instructions here (I didn't check) Out of order execution might be interesting to try here though. I need to check if I can get gcc to behave properly, never before attempted that with intrinsics on gcc :) /Juha-Pekka On 04.11.2014 19:35, Siavash Eliasi wrote: > Hello. I'd get rid of "_mm_set1_ps" inside "_mesa_clamp_float_rgba" by > passing _m128 version of min/max directly, so "_mm_set1_ps" will be > moved out of the for loop. > > I'd also unroll the "_mesa_streaming_clamp_float_rgba" loop to minimize > the loop overhead (and utilize out of order execution as a bonus), > because nothing compute intensive is happening there. You can also use > prefetching (_mm_prefetch) there to improve performance by reading data > ahead from memory. > > Best regards, > Siavash Eliasi. > _______________________________________________ > mesa-dev mailing list > mesa-dev@lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/mesa-dev _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev