------- Comment #6 from d at teklibre dot com 2009-04-22 15:40 ------- Pinska: Actually, no. I started with the intrinsics and looked hard at what the code scheduler was doing before settling on rewriting this in inline assembly.
The intrinsics have several problems that effect the code quality in this case. 1) They don't issue a request from memory for many instructions, such as cvtps2pd. Doing oneliners for stuff like is feasible but even harder to understand and debug than pure assembly. Gcc also seems to have a misguided sense for how many clocks cvtX2Y instructions take. 2) The combination of intrinsics, C, and assembly gcc was generating included a lot of extra instructions, promoting ints to longs, leas, etc. 3) The optimizer tends to push prefetches to the end of the loop when it really needs to happen as early as possible. This particular bit of code *might* benefit from prefetching (it is not a very predictable access pattern) but at the end of the loop prefetches hurt more than they help. 4) this code is right up against the edge of the x86_64 register set (all the xmm registers (for 8 channel resampling) and 7 integer registers) 5) You can't use push/pop across multiple bits of inline assembly. Yes, it would be nice if gcc did a better job on it... I can show you oprofiles of the gcc generated code, but the larger point remains that doing complex vectorized operations tends to use up a lot of registers and doing it well requires hand optimized assembly... and to do that well, it would be helpful to have as many named parameters available as in the register set. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39847