On 22 July 2017 at 12:18, Ivan Kalvachev <ikalvac...@gmail.com> wrote:
> This patch is ready for review and inclusion. > > > >+%macro emu_blendvps 3 ; dst/src_a, src_b, mask >+%macro lea_pic_base 2; reg, const_label Capitalize macro names. >+ %error sse41 blendvps uses xmm0 as default 3d operand, you used %3 Remove this error >+ %error "blendvps" emulation needs SSE Definitely remove this error too. >+ %error AVX1 does not support integer 256bit ymm operations And this one is particularly pointless... >+ %error sse41 blendvps uses xmm0 as default 3 operand, you used %3 Same... >+ %error "broadcastss" emulation needs SSE Same... >+ %error "pbroadcastd" emulation needs SSE2 Yep, the same... >+ %error pick HSUMPS implementation Again... All of these are pointless and unnecessary. Always assume at least SSE2. >+ >+; 256bit version seems slower than the 128bit AVX on all current CPU's >+; So disable it for now and keep the code in case something changes in future. >+%if HAVE_AVX2_EXTERNAL > 0 && 0 >+INIT_YMM avx2 >+PVQ_FAST_SEARCH >+%endif Remove this altogether. >+%if 1 >+ emu_pbroadcastd m1, xmm1 ... >+%else >+ ; Ryzen is the only CPU at the moment that doesn't have >+ ; write forwarding stall and where this code is faster >+ ; with about 7 cycles on avarage. >+ %{1}ps m5, mm_const_float_1 >+ movss [tmpY + r4q], xmm5 ; Y[max_idx] +-= 1.0; Remove the %else and always use the first bit of code. >+%if cpuflag(avx2) && 01 >+%elif cpuflag(avx) && 01 >+%if cpuflag(avx2) && 01 Remove the && 01 in these 3 places. >+; vperm2f128 %1, %1, %1, 0 ; ymm, ymm, ymm ; 2-3c 1/1 Remove this commented out code. >+cglobal pvq_search, 4, 5+num_pic_regs, 11, 256*4, inX, outY, K, N >+%if ARCH_X86_64 == 0 ;sbrdsp You're using more than 11 xmm regs, so the entire code is always going to need a 64 bit system. So wrap the entire file, from top to bottom after the %include in %if ARCH_X86_64 <everything> %endif >+SECTION_RODATA 64 >+ >+const_float_abs_mask: times 8 dd 0x7fffffff >+const_align_abs_edge: times 8 dd 0 >+ >+const_float_0_5: times 8 dd 0.5 >+const_float_1: times 8 dd 1.0 >+const_float_sign_mask: times 8 dd 0x80000000 >+ >+const_int32_offsets: >+ %rep 8 >+ dd $-const_int32_offsets >+ %endrep >+SECTION .text This whole thing needs to go right at the very top after the %include. All macros must then follow it. Having read only sections in the middle of the file is very confusing and not at all what all of our asm code follows. _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel