Hi, 2016-02-22 22:43 GMT+01:00 James Almer <jamr...@gmail.com>: > +.loop: > +%if cpuflag(avx) > + cvtdq2ps m4, [lfeq] > + shufps m5, m4, m4, q0123 > +%elif cpuflag(sse2) > + movu m4, [lfeq] > + cvtdq2ps m4, m4 > + pshufd m5, m4, q0123 > +%endif > + > +.inner_loop: > +%if ARCH_X86_64 > + movaps m6, [coeffq+cnt1q*4 ] > + movaps m7, [coeffq+cnt1q*4+16] > + movaps m8, [coeffq+cnt1q*4+32] > + movaps m9, [coeffq+cnt1q*4+48] > + mulps m0, m5, m6 > + mulps m1, m5, m7 > + mulps m2, m5, m8 > + mulps m3, m5, m9 > +%else > + movaps m6, [coeffq+cnt1q*4 ] > + movaps m7, [coeffq+cnt1q*4+16] > + mulps m0, m5, m6 > + mulps m1, m5, m7 > + mulps m2, m5, [coeffq+cnt1q*4+32] > + mulps m3, m5, [coeffq+cnt1q*4+48] > +%endif
Is OOE the reason why you don't move the common code out of those conditional blocks? Otherwise, it looks cleaner to me to do: movaps m6, [coeffq+cnt1q*4 ] movaps m7, [coeffq+cnt1q*4+16] mulps m0, m3, m6 mulps m1, m3, m7 %if ARCH_X86_64 movaps m8, [coeffq+cnt1q*4+32] movaps m9, [coeffq+cnt1q*4+48] mulps m2, m5, m8 mulps m3, m5, m9 %else mulps m2, m5, [coeffq+cnt1q*4+32] mulps m3, m5, [coeffq+cnt1q*4+48] %endif and let OOE do its job. Secondly, m5 is not reused afterwards, so maybe replace m5 by m3 for all code up to this, and load something into m5 instead? > + haddps m0, m1 > + haddps m2, m3 > + haddps m0, m2 > + movaps [samplesq+cnt1q], m0 I suppose you've already looked at most arrangements that would help doing fewer shuffles. And I don't see any obvious one either. -- Christophe _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel