------- Comment #7 from pinskia at gcc dot gnu dot org 2009-04-22 15:45 ------- (In reply to comment #6) > Pinska: Actually, no. I started with the intrinsics and looked hard at what > the > code scheduler was doing before settling on rewriting this in inline > assembly. > > The intrinsics have several problems that effect the code quality in this > case. > > 1) They don't issue a request from memory for many instructions, such as > cvtps2pd. Doing oneliners for stuff like is feasible but even harder to > understand and debug than pure assembly. Gcc also seems to have a misguided > sense for how many clocks cvtX2Y instructions take.
Are you using the correct -mtune= value for the processor you are tuning for? Because different processors have different clock cycles. If you have an issue with the optimizers, I rather see the bugs filed there rather you working around it with inline-asm. > > 2) The combination of intrinsics, C, and assembly gcc was generating included > a > lot of extra instructions, promoting ints to longs, leas, etc. Int to Long, that is normal and a different issue and really you should have filed this one. > > 3) The optimizer tends to push prefetches to the end of the loop when it > really > needs to happen as early as possible. This particular bit of code *might* > benefit from prefetching (it is not a very predictable access pattern) but at > the end of the loop prefetches hurt more than they help. file a bug. > > 4) this code is right up against the edge of the x86_64 register set (all the > xmm registers (for 8 channel resampling) and 7 integer registers) try 4.4.0 which was just released, it has a better register allocator. > I can show you oprofiles of the gcc generated code, but the larger point > remains that doing complex vectorized operations tends to use up a lot of > registers and doing it well requires hand optimized assembly... and to do that > well, it would be helpful to have as many named parameters available as in the > register set. No, GCC should be doing a better job with the intrinsics which is much better than you doing it manually in the inline-asm. Inline-asm should be used when there are no intrinsics for the instruction or something which you really cannot do using intrinsics. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39847