On 14/07/15 3:54 PM, James Darnley wrote: > On 2015-07-11 18:34, James Almer wrote: >> On 11/07/15 10:40 AM, James Darnley wrote: >>> @@ -566,7 +554,19 @@ static int filter_slice(AVFilterContext *ctx, void >>> *arg, int jobnr, int nb_jobs) >>> } >>> >>> *dst++ = *src++; >>> - for (x = 1; x < s->planewidth[i] - 1; x++) { >>> + >>> + if (s->fl[i]) { >>> + int w_asm = (s->planewidth[i] - 2) & ~15; >> >> I don't thinks this belongs here. If you add an AVX2 version as you intended, >> wouldn't this need to be 31? >> >> A wrapper function to have this code in the x86 folder may be the best >> option. > > I was considering having the init_x86 function set a variable in the > context to do this math based on how many pixels are actually done per > iteration, meaning: > - 8 for the unpacked words and sse2 > - 16 for the packed bytes and see2 or future unpacked words and avx2 > - 32 for the future packed bytes and avx2 > > That would let it be simply re-used if someone wanted to try writing a > line function in C or other platform assembly. > >>> new file mode 100644 >>> index 0000000..5e1feea >>> --- /dev/null >>> +++ b/libavfilter/x86/vf_removegrain.asm >>> @@ -0,0 +1,1215 @@ >>> +;***************************************************************************** >>> +;* x86-optimized functions for yadif filter >>> +;* >>> +;* Copyright (C) 2015 James Darnley >>> +;* >>> +;* This file is part of FFmpeg. >>> +;* >>> +;* TODO: gpl text goes here. >>> +;***************************************************************************** >>> + >>> +; column: -1 0 +1 >>> +; row -1: a1 a2 a3 >>> +; row 0: a4 c a5 >>> +; row +1: a6 a7 a8 >>> + >>> +%include "libavutil/x86/x86util.asm" >>> + >>> +SECTION_RODATA >> >> Declare it SECTION_RODATA 32 now so you don't forget when you add AVX2 >> versions. > > Okay, I'm confused by this. It doesn't align the constants you define > on 32 byte boundaries if you happen to define the data incorrectly so... > Why does this matter? Or what does it do? >
SECTION_RODATA 32 translates into "SECTION .rodata align=32", which according to the documentation makes sure the constants are aligned on a 32-byte boundary. I noticed it doesn't do much if the constants are "out of order" if you will, but if you're defining them right and you're going to use 32-byte movas with them, then i don't see a reason to not declare the section properly. >>> + >>> +; The loop doesn't need to do all the iterations. It could stop when the >>> right >>> +; pixels are in the right registers. >>> +%macro SORT_SQUARE 0 >>> + %assign k 7 >>> + %rep 7 >>> + %assign i 1 >>> + %assign j 2 >>> + %rep k >>> + SORT_PAIR m %+ i , m %+ j , m9 >>> + %assign i i+1 >>> + %assign j j+1 >>> + %endrep >>> + %assign k k-1 >>> + %endrep >>> +%endmacro >>> + >>> +; %1 dest simd register >>> +; %2 source (simd register/memory) >>> +; %3 temp simd register >>> +%macro ABS_DIFF 3 >>> + mova %3, %2 >>> + psubusb %3, %1 >>> + psubusb %1, %2 >>> + por %1, %3 >>> +%endmacro >>> + >>> +; %1 dest simd register >>> +; %2 source (simd register/memory) >>> +; %3 temp simd register >>> +%macro ABS_DIFF_W 3 >>> + mova %3, %2 >>> + psubusw %3, %1 >>> + psubusw %1, %2 >>> + por %1, %3 >>> +%endmacro >> >> No way to achieve this using the pabs* ssse3 instructions? > > Maybe. I looked them up and see that I would need a subtraction anyway. > I wonder if I could measure a speed change. I need to check the > behaviour of them together with the subtract instructions. For that matter, since in many cases you're calling two or more ABS_DIFF_W one after the other, you could combine the instructions to minimize dependencies regardless of using pabs or psubs + por like you're doing right now. It may end up being faster that way. > >>> +; %1 simd register that hold the mask and will hold the result >>> +; %2 simd register that holds the "true" values >>> +; %3 location of the "false" values (simd register/memory) >>> +%macro BLEND 3 ; mask, true, false >>> + pand %2, %1 >>> + pandn %1, %3 >>> + por %1, %2 >>> +%endmacro >> >> This doesn't follow the sse4 blend instructions' semantic, so it will make >> adding AVX2 >> versions harder. >> >> Try instead >> >> %macro BLEND 3 ; false, true, mask >> pand %2, %3 >> pandn %3, %1 >> por %3, %2 >> SWAP %1, %3 >> %endmacro >> >> Which will let you use "vpblendvb %1, %1, %2, %3" for the avx2 version. >> SSE4's pblendvb is kinda crappy, requiring the mask to be in xmm0, so >> adding an SSE4 version may not be worth it. > > I applied your patches that you sent on IRC (this and the sort macro > one). Do you want me to put your name in the copyright header? > Not really. small changes like those are IMO not enough to get credit in the header. If you want you can add a "With contributions by" line in the commit message. >>> diff --git a/libavfilter/x86/vf_removegrain_init.c >>> b/libavfilter/x86/vf_removegrain_init.c >>> new file mode 100644 >>> index 0000000..450185f >>> --- /dev/null >>> +++ b/libavfilter/x86/vf_removegrain_init.c >>> @@ -0,0 +1,128 @@ >>> +#include "libavutil/attributes.h" >>> +#include "libavutil/cpu.h" >>> +#include "libavutil/x86/cpu.h" >>> +#include "libavfilter/removegrain.h" >>> + >>> +void ff_rg_fl_mode_1_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, >>> int pixels); >>> +void ff_rg_fl_mode_10_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, >>> int pixels); >>> +void ff_rg_fl_mode_11_12_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t >>> stride, int pixels); >>> +void ff_rg_fl_mode_13_14_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t >>> stride, int pixels); >>> +void ff_rg_fl_mode_19_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, >>> int pixels); >>> +void ff_rg_fl_mode_20_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, >>> int pixels); >>> +void ff_rg_fl_mode_21_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, >>> int pixels); >>> +void ff_rg_fl_mode_22_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, >>> int pixels); >>> +#if ARCH_X86_64 >>> +void ff_rg_fl_mode_2_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, >>> int pixels); >>> +void ff_rg_fl_mode_3_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, >>> int pixels); >>> +void ff_rg_fl_mode_4_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, >>> int pixels); >>> +void ff_rg_fl_mode_5_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, >>> int pixels); >>> +void ff_rg_fl_mode_6_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, >>> int pixels); >>> +void ff_rg_fl_mode_7_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, >>> int pixels); >>> +void ff_rg_fl_mode_8_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, >>> int pixels); >>> +void ff_rg_fl_mode_9_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, >>> int pixels); >>> +void ff_rg_fl_mode_15_16_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t >>> stride, int pixels); >>> +void ff_rg_fl_mode_17_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, >>> int pixels); >>> +void ff_rg_fl_mode_18_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, >>> int pixels); >>> +void ff_rg_fl_mode_23_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, >>> int pixels); >>> +void ff_rg_fl_mode_24_sse2(uint8_t *dst, uint8_t *src, ptrdiff_t stride, >>> int pixels); >>> +#endif > > A general question. Is it preferred to have the function declarations > and assignments out of order to reduce the number of #if ARCH_X86_64 > conditions? IMO yes. The cleaner the better. > >>> +av_cold void ff_removegrain_init_x86(RemoveGrainContext *rg) >>> +{ >>> + int cpu_flags = av_get_cpu_flags(); >>> + int i; >>> + >>> + for (i = 0; i < rg->nb_planes; i++) { >>> + switch (rg->mode[i]) { >>> + case 1: >>> + if (EXTERNAL_SSE2(cpu_flags)) >> >> I think it will be much cleaner and simpler if you put the switch inside a >> single >> EXTERNAL_SSE2 branch, rather than checking it for every mode. > > You're probably right. I was going to see what it looks like after AVX2 > and clean it up then but I can squash some changes into this patch if > people prefer it. > > > > > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > http://ffmpeg.org/mailman/listinfo/ffmpeg-devel > _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel