On 2/22/2016 7:44 PM, Christophe Gisquet wrote: > Hi, > > 2016-02-22 22:43 GMT+01:00 James Almer <jamr...@gmail.com>: >> +.loop: >> +%if cpuflag(avx) >> + cvtdq2ps m4, [lfeq] >> + shufps m5, m4, m4, q0123 >> +%elif cpuflag(sse2) >> + movu m4, [lfeq] >> + cvtdq2ps m4, m4 >> + pshufd m5, m4, q0123 >> +%endif >> + >> +.inner_loop: >> +%if ARCH_X86_64 >> + movaps m6, [coeffq+cnt1q*4 ] >> + movaps m7, [coeffq+cnt1q*4+16] >> + movaps m8, [coeffq+cnt1q*4+32] >> + movaps m9, [coeffq+cnt1q*4+48] >> + mulps m0, m5, m6 >> + mulps m1, m5, m7 >> + mulps m2, m5, m8 >> + mulps m3, m5, m9 >> +%else >> + movaps m6, [coeffq+cnt1q*4 ] >> + movaps m7, [coeffq+cnt1q*4+16] >> + mulps m0, m5, m6 >> + mulps m1, m5, m7 >> + mulps m2, m5, [coeffq+cnt1q*4+32] >> + mulps m3, m5, [coeffq+cnt1q*4+48] >> +%endif > > Is OOE the reason why you don't move the common code out of those > conditional blocks? Otherwise, it looks cleaner to me to do:
Not really. I just thought having x86_64 and X86_32 clearly separated was easier to read. > movaps m6, [coeffq+cnt1q*4 ] > movaps m7, [coeffq+cnt1q*4+16] > mulps m0, m3, m6 > mulps m1, m3, m7 > %if ARCH_X86_64 > movaps m8, [coeffq+cnt1q*4+32] > movaps m9, [coeffq+cnt1q*4+48] > mulps m2, m5, m8 > mulps m3, m5, m9 > %else > mulps m2, m5, [coeffq+cnt1q*4+32] > mulps m3, m5, [coeffq+cnt1q*4+48] > %endif > and let OOE do its job. > > Secondly, m5 is not reused afterwards, so maybe replace m5 by m3 for > all code up to this, and load something into m5 instead? m5 and m4 contain the lfe samples. I can't reuse them inside the inner loop. > >> + haddps m0, m1 >> + haddps m2, m3 >> + haddps m0, m2 >> + movaps [samplesq+cnt1q], m0 > > I suppose you've already looked at most arrangements that would help > doing fewer shuffles. And I don't see any obvious one either. > _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel