x86: move simple_idct to external assembly

James Darnley Tue, 23 May 2017 17:59:54 -0700

On 2017-05-18 19:13, Ronald S. Bultje wrote:
> - do you think a checkasm test makes sense? That would also make
> performance measuring easier.


The (I)DCT code seems to have its own test program in the fate-idct8x8
test.  That is built from libavcodec/tests/dct.c.  It even includes its
own benchmarking.

>> +; Simple IDCT MMX
>> +;
>> +; Copyright (c) 2001, 2002 Michael Niedermayer <michae...@gmx.at>
>>
> 
> Please add a line that you converted it from inline asm to x264asm syntax,
> we've done that in other places also.

Done.  I found a line in 2 other files and copied their wording.

>> +%macro DC_COND_IDCT 7
>> +movq           mm0, [blockq + %1]             ; R4     R0      r4      r0
>> +movq           mm1, [blockq + %2]             ; R6     R2      r6      r2
>> +movq           mm2, [blockq + %3]             ; R3     R1      r3      r1
>> +movq           mm3, [blockq + %4]             ; R7     R5      r7      r5
>>
> 
> Please use 4-space indentation.

I will change this along with fixing the alignment in places.

>> +    %%9:
>> +; :: "r" (block), "r" (temp), "r" (coeffs)
>> +;    NAMED_CONSTRAINTS_ADD(wm1010,d40000)
>> +; : "%eax"
>> +%endmacro
>>
> 
> The inline asm bits (middle 3 lines) can be removed (yay!).

Thanks for reminding me about this left-over from writing.

> Rest is fine. I am assuming that the binary size will grow slightly because
> the idct is now inlined in the put/add (and duplicated between mmx/sse2),
> but no performance implication (or possibly slight improvement because of
> the inline). If that's correct, I'm OK with this.

About the size.  It shouldn't change that much.  The idct() function was
marked as inline so it should have been duplicated into the 5 other
functions.  I have copied the {ADD,PUT}_PIXELS_CLAMPED macros whereas
they used to call the functions in another asm file.  These are fairly
small, about 20 instructions or so, each.

> Are you planning to write an actual SSE2 version of the IDCT? (No pressure,
> just wondering.)

I do have that plan.  I first wrote a simple and naive (and slightly
pointless) sse2 version which just uses half the width of the xmm
registers.  Similar to what I've done elsewhere in h264.  A quick
benchmark by the program mentioned above did show that it was faster
despite needing a few pshufd to put things in the right places.  I'll
show that alongside an updated version of this patch, Coming Soon(TM).

However I am trying to work on a better sse2 version using the full
width of xmm registers (and maybe all 16 regs on x86-64 if I need to).
It is slow work and I'll probably have still more in-depth questions for
you (or anyone who can answer them) later on IRC.

P.S.  My apologies for sending this directly to you Ronald.
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [PATCH/WIP] avcodec/x86: move simple_idct to external assembly

Reply via email to