On 2015-07-14 23:23, James Almer wrote: > On 14/07/15 3:54 PM, James Darnley wrote: >> On 2015-07-11 18:34, James Almer wrote: >>> On 11/07/15 10:40 AM, James Darnley wrote: >>>> new file mode 100644 >>>> index 0000000..5e1feea >>>> --- /dev/null >>>> +++ b/libavfilter/x86/vf_removegrain.asm >>>> @@ -0,0 +1,1215 @@ >>>> +;***************************************************************************** >>>> +;* x86-optimized functions for yadif filter >>>> +;* >>>> +;* Copyright (C) 2015 James Darnley >>>> +;* >>>> +;* This file is part of FFmpeg. >>>> +;* >>>> +;* TODO: gpl text goes here. >>>> +;***************************************************************************** >>>> + >>>> +; column: -1 0 +1 >>>> +; row -1: a1 a2 a3 >>>> +; row 0: a4 c a5 >>>> +; row +1: a6 a7 a8 >>>> + >>>> +%include "libavutil/x86/x86util.asm" >>>> + >>>> +SECTION_RODATA >>> >>> Declare it SECTION_RODATA 32 now so you don't forget when you add AVX2 >>> versions. >> >> Okay, I'm confused by this. It doesn't align the constants you define >> on 32 byte boundaries if you happen to define the data incorrectly so... >> Why does this matter? Or what does it do? >> > > SECTION_RODATA 32 translates into "SECTION .rodata align=32", which according > to the > documentation makes sure the constants are aligned on a 32-byte boundary. > I noticed it doesn't do much if the constants are "out of order" if you will, > but if > you're defining them right and you're going to use 32-byte movas with them, > then i > don't see a reason to not declare the section properly.
I will change it. Perhaps I misunderstood its use. My impression now is that it instructs the linker to maintain the alignment of the section not that the assembler should align the labels in a particular fashion. If the user (me) defines some data to be 10 bytes long it is the user's fault that the next constant begins at 0xA. >>>> + >>>> +; The loop doesn't need to do all the iterations. It could stop when the >>>> right >>>> +; pixels are in the right registers. >>>> +%macro SORT_SQUARE 0 >>>> + %assign k 7 >>>> + %rep 7 >>>> + %assign i 1 >>>> + %assign j 2 >>>> + %rep k >>>> + SORT_PAIR m %+ i , m %+ j , m9 >>>> + %assign i i+1 >>>> + %assign j j+1 >>>> + %endrep >>>> + %assign k k-1 >>>> + %endrep >>>> +%endmacro >>>> + >>>> +; %1 dest simd register >>>> +; %2 source (simd register/memory) >>>> +; %3 temp simd register >>>> +%macro ABS_DIFF 3 >>>> + mova %3, %2 >>>> + psubusb %3, %1 >>>> + psubusb %1, %2 >>>> + por %1, %3 >>>> +%endmacro >>>> + >>>> +; %1 dest simd register >>>> +; %2 source (simd register/memory) >>>> +; %3 temp simd register >>>> +%macro ABS_DIFF_W 3 >>>> + mova %3, %2 >>>> + psubusw %3, %1 >>>> + psubusw %1, %2 >>>> + por %1, %3 >>>> +%endmacro >>> >>> No way to achieve this using the pabs* ssse3 instructions? >> >> Maybe. I looked them up and see that I would need a subtraction anyway. >> I wonder if I could measure a speed change. I need to check the >> behaviour of them together with the subtract instructions. > > For that matter, since in many cases you're calling two or more ABS_DIFF_W > one after > the other, you could combine the instructions to minimize dependencies > regardless of > using pabs or psubs + por like you're doing right now. > It may end up being faster that way. You're probably right. I will try them and send a separate patch later. >>>> +; %1 simd register that hold the mask and will hold the result >>>> +; %2 simd register that holds the "true" values >>>> +; %3 location of the "false" values (simd register/memory) >>>> +%macro BLEND 3 ; mask, true, false >>>> + pand %2, %1 >>>> + pandn %1, %3 >>>> + por %1, %2 >>>> +%endmacro >>> >>> This doesn't follow the sse4 blend instructions' semantic, so it will make >>> adding AVX2 >>> versions harder. >>> >>> Try instead >>> >>> %macro BLEND 3 ; false, true, mask >>> pand %2, %3 >>> pandn %3, %1 >>> por %3, %2 >>> SWAP %1, %3 >>> %endmacro >>> >>> Which will let you use "vpblendvb %1, %1, %2, %3" for the avx2 version. >>> SSE4's pblendvb is kinda crappy, requiring the mask to be in xmm0, so >>> adding an SSE4 version may not be worth it. >> >> I applied your patches that you sent on IRC (this and the sort macro >> one). Do you want me to put your name in the copyright header? >> > > Not really. small changes like those are IMO not enough to get credit in the > header. > If you want you can add a "With contributions by" line in the commit message. Okay. >>>> +av_cold void ff_removegrain_init_x86(RemoveGrainContext *rg) >>>> +{ >>>> + int cpu_flags = av_get_cpu_flags(); >>>> + int i; >>>> + >>>> + for (i = 0; i < rg->nb_planes; i++) { >>>> + switch (rg->mode[i]) { >>>> + case 1: >>>> + if (EXTERNAL_SSE2(cpu_flags)) >>> >>> I think it will be much cleaner and simpler if you put the switch inside a >>> single >>> EXTERNAL_SSE2 branch, rather than checking it for every mode. >> >> You're probably right. I was going to see what it looks like after AVX2 >> and clean it up then but I can squash some changes into this patch if >> people prefer it. Done anyway.
signature.asc
Description: OpenPGP digital signature
_______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel