------- Comment #7 from pinskia at gcc dot gnu dot org  2009-04-22 15:45 -------
(In reply to comment #6)
> Pinska: Actually, no. I started with the intrinsics and looked hard at what 
> the
> code scheduler was doing before settling on rewriting this in inline 
> assembly. 
> 
> The intrinsics have several problems that effect the code quality in this 
> case.
> 
> 1) They don't issue a request from memory for many instructions, such as
> cvtps2pd. Doing oneliners for stuff like is feasible but even harder to
> understand and debug than pure assembly.  Gcc also seems to have a misguided
> sense for how many clocks cvtX2Y instructions take.

Are you using the correct -mtune= value for the processor you are tuning for? 
Because different processors have different clock cycles.  If you have an issue
with the optimizers, I rather see the bugs filed there rather you working
around it with inline-asm.  

> 
> 2) The combination of intrinsics, C, and assembly gcc was generating included 
> a
> lot of extra instructions, promoting ints to longs, leas, etc. 

Int to Long, that is normal and a different issue and really you should have
filed this one.

> 
> 3) The optimizer tends to push prefetches to the end of the loop when it 
> really
> needs to happen as early as possible. This particular bit of code *might*
> benefit from prefetching (it is not a very predictable access pattern) but at
> the end of the loop prefetches hurt more than they help.

file a bug.

> 
> 4) this code is right up against the edge of the x86_64 register set (all the
> xmm registers (for 8 channel resampling) and 7 integer registers) 

try 4.4.0 which was just released, it has a better register allocator.

> I can show you oprofiles of the gcc generated code, but the larger point
> remains that doing complex vectorized operations tends to use up a lot of
> registers and doing it well requires hand optimized assembly... and to do that
> well, it would be helpful to have as many named parameters available as in the
> register set.

No, GCC should be doing a better job with the intrinsics which is much better
than you doing it manually in the inline-asm.  Inline-asm should be used when
there are no intrinsics for the instruction or something which you really
cannot do using intrinsics.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39847

Reply via email to