------- Comment #9 from d at teklibre dot com 2009-04-22 17:24 ------- Pinskia:
It is going to take me a long time to address these issues piecemeal, so... 0) I will build gcc-4.4 and try that. I will also make the 1 line patch to it to try increasing the number of asm params, and try that. I would prefer that someone with more guts inside the guts of gcc do the latter, I fear I would rapidly end up over my head. Is it a magic number or just a stupid default? re 1) I am using -mtune=core2 -O3 which is correct. I note, that in looking at the generated code today, without that and with -O2, using the non-sse version (just doubles), -O2 generates the following code sequence for left [0] += icoeff * filter->buffer [data_index]; left [1] += icoeff * filter->buffer [data_index+1]; - where left[0] and icoeff are doubles, filter->buffer[data_index] is a float movss (%r11),%xmm0 cvtps2pd %xmm0,%xmm0; cvtss2sd would be more correct and faster on most x86_64 arches prior to the k10 and core2. ... mult and add elided, second line elided ... (-O3 -mtune will do a cvtss2sd (%r9), %xmm0 which is better) converting this into the SSE2 equivalent can't be expressed in the intrinsics (requires an explicit, separate, load & cast). Doing it as inline assembly ended up generating extra leas, would not get scheduled well, and stuff like that. ... like I said, it will take me a while to discuss this piecemeal and going to 0) is the right thing. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39847