On 2017-05-18 19:13, Ronald S. Bultje wrote: > - do you think a checkasm test makes sense? That would also make > performance measuring easier.
The (I)DCT code seems to have its own test program in the fate-idct8x8 test. That is built from libavcodec/tests/dct.c. It even includes its own benchmarking. >> +; Simple IDCT MMX >> +; >> +; Copyright (c) 2001, 2002 Michael Niedermayer <michae...@gmx.at> >> > > Please add a line that you converted it from inline asm to x264asm syntax, > we've done that in other places also. Done. I found a line in 2 other files and copied their wording. >> +%macro DC_COND_IDCT 7 >> +movq mm0, [blockq + %1] ; R4 R0 r4 r0 >> +movq mm1, [blockq + %2] ; R6 R2 r6 r2 >> +movq mm2, [blockq + %3] ; R3 R1 r3 r1 >> +movq mm3, [blockq + %4] ; R7 R5 r7 r5 >> > > Please use 4-space indentation. I will change this along with fixing the alignment in places. >> + %%9: >> +; :: "r" (block), "r" (temp), "r" (coeffs) >> +; NAMED_CONSTRAINTS_ADD(wm1010,d40000) >> +; : "%eax" >> +%endmacro >> > > The inline asm bits (middle 3 lines) can be removed (yay!). Thanks for reminding me about this left-over from writing. > Rest is fine. I am assuming that the binary size will grow slightly because > the idct is now inlined in the put/add (and duplicated between mmx/sse2), > but no performance implication (or possibly slight improvement because of > the inline). If that's correct, I'm OK with this. About the size. It shouldn't change that much. The idct() function was marked as inline so it should have been duplicated into the 5 other functions. I have copied the {ADD,PUT}_PIXELS_CLAMPED macros whereas they used to call the functions in another asm file. These are fairly small, about 20 instructions or so, each. > Are you planning to write an actual SSE2 version of the IDCT? (No pressure, > just wondering.) I do have that plan. I first wrote a simple and naive (and slightly pointless) sse2 version which just uses half the width of the xmm registers. Similar to what I've done elsewhere in h264. A quick benchmark by the program mentioned above did show that it was faster despite needing a few pshufd to put things in the right places. I'll show that alongside an updated version of this patch, Coming Soon(TM). However I am trying to work on a better sse2 version using the full width of xmm registers (and maybe all 16 regs on x86-64 if I need to). It is slow work and I'll probably have still more in-depth questions for you (or anyone who can answer them) later on IRC. P.S. My apologies for sending this directly to you Ronald. _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel