------- Comment #6 from d at teklibre dot com  2009-04-22 15:40 -------
Pinska: Actually, no. I started with the intrinsics and looked hard at what the
code scheduler was doing before settling on rewriting this in inline assembly. 

The intrinsics have several problems that effect the code quality in this case.

1) They don't issue a request from memory for many instructions, such as
cvtps2pd. Doing oneliners for stuff like is feasible but even harder to
understand and debug than pure assembly.  Gcc also seems to have a misguided
sense for how many clocks cvtX2Y instructions take.

2) The combination of intrinsics, C, and assembly gcc was generating included a
lot of extra instructions, promoting ints to longs, leas, etc. 

3) The optimizer tends to push prefetches to the end of the loop when it really
needs to happen as early as possible. This particular bit of code *might*
benefit from prefetching (it is not a very predictable access pattern) but at
the end of the loop prefetches hurt more than they help.

4) this code is right up against the edge of the x86_64 register set (all the
xmm registers (for 8 channel resampling) and 7 integer registers) 

5) You can't use push/pop across multiple bits of inline assembly.

Yes, it would be nice if gcc did a better job on it...

I can show you oprofiles of the gcc generated code, but the larger point
remains that doing complex vectorized operations tends to use up a lot of
registers and doing it well requires hand optimized assembly... and to do that
well, it would be helpful to have as many named parameters available as in the
register set.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39847

Reply via email to