On Fri, Jun 09, 2017 at 01:36:07AM +0300, Ivan Kalvachev wrote: > On request by Rostislav Pehlivanov (atomnuker), I've been working on > SSE/AVX accelerated version of pvq_search(). > > The attached patch is my work so far. > > At the moment, for me, at the default bitrate > the function is 2.5 times faster than the current C version. > (My cpu is Intel Westmere class.) > The total encoding time is about 5% faster. > > I'd like some more benchmarks on different CPUs > and maybe some advises how to improve it. > (I've left some alternative methods in the code > that could be easily switched with defines, as they > may be faster on other cpu's). > (I'm also quite afraid how fast it would run on pre-Ryzen AMD CPUs) > > > The code generates 4 variants: SSE2, SSE4.2, AVX and 256bit AVX2. > I haven't tested the AVX myself on real CPU, > I used Intel SDE to develop and test them. > Rostislav (atomnuker) reported some crashes with the 256bit AVX2, > that however might be related to clang tools. > > > > > Bellow are some broad descriptions: > > The typical use of the function for the default bitrate (96kbps) is > K<=36 and N<=32. > N is the size of the input array (called vector), K is number of > pulses that should be present in the output (The sum of output > elements is K). > > In synthetic tests, the SIMD function could be 8-12 times faster with > the maximum N=176. > I've been told that bigger sizes are more common at low bitrate encodes and > will be more common with the upcoming RDO improvements. > > A short description of the function working: > 1. Loop that calculates sum (Sx) of the input elements (inX[]). The > loop is used to fill a stack allocated temp buffer (tmpX) that is > aligned to mmsize and contains absolute values of inX[i]. > > 2. Pre-Search loop. It uses K/Sx as approximation for the vector gain > and fills output vector outY[] based on it. The output is in integers, > but we use outY[] to temporally store the doubled Y values as floats. > (We need 2*Y for calculations). This loop also calculates few > parameters that are needed for the distortion calculations later (Syy= > Sum of inY[i]^2 ; Sxy=Sum inX[i]*outY[i] ) > > 3. Adding of missing pulses or Elimination of extra ones. > The C function uses variable "phase" to signal if pulses should be > added or removed, I've separated this to separate cases. The code is > shared through a macro PULSES_SEARCH . > Each case is formed by 2 loops. The outer loop is executed until we > have K pulses in the output. > The inner is calculating the distortion parameter for each element and > picking the best one. > (parallel search, combination of parallel results, update of variables). > > 4. When we are done we do one more loop, to convert outY[] to single > integer and to restore its sign (by using the original inX[]). > > 5. There is special case when Sx==0, that happens if all elements of > the input are zeroes (in theory the input should be normalized, that > means Sum of X[i]^2 == 1.0). In this case return zero output and 1.0 > as gain. > > --- > Now, I've left some defines that change the generated code. > > HADDPS_IS_FAST > PHADDD_IS_FAST > I've implemented my own horizontal sums macros, and while doing it, I > have discovered that on my CPU (Westmere class) the use of "new" > SSE4.2 instructions are not faster than using my own code for doing > the same. > It's not speed critical, since horizontal sums are used 3-4 times per > function call. > > BLENDVPS_IS_FAST > PBLENDVB_IS_FAST > I think that blend on my CPU is faster than the alternative version > that I've implemented. However I'm not sure this is true for all > CPU's, since a number of modern cpu have latency=3 and > inv_throughput=2 (that's 2 clocks until another blend could start). > > CONST_IN_X64_REG_IS_FASTER > The function is implemented so only 8 registers are used. With this > define constants used during PULSES_SEARCH are loaded in the high > registers available on X64. I could not determine if it is faster to > do so... it should be, but sometimes I got the opposite result. > I'd probably enable it in the final version. > > STALL_WRITE_FORWARDING > After the inner search finds the maximum, we add/remove pulse in > outY[i]. Writing single element (sizeof(float)=4) however could block > the long load done in the inner loop (mmsize=16). This hurts a lot > more on small vector sizes. > On Skylake the penalty is only 11 cycles, while Ryzen should have no > penalty at all. Older CPU's can have penalty of up to 200 cycles. > > SHORT_SYY_UPDATE > This define has meaning only when the STALL* is 0 (aka have the longer > code to avoid stalls). > It saves few instructions by loading old outY[] value by scalar load, > instead of using HSUMPS and some 'haddps' to calculate them. > So far it looks like the short update is always faster, but I've left > it just in case... > > USE_APPROXIMATION > This controls the method used for calculation of the distortion parameter. > "0" means using 1 multiplication and 1 division, that could be a lot > slower (14;14 cycles on my CPU, 11;7 on Skylake) > "1" uses 2 multiplications and 1 reciprocal op that is a lot faster > than real division, but gives half precision. > "2" uses 1 multiplication and 1 reciprocal square root op, that is > literally 1 cycle, but again gives half precision. > > PRESEARCH_ROUNDING > This control the rounding of the gain used for guess. > "0" means using truncf() that makes sure that the pulses would never > be more than K. > It gives results identical to the original celt_* functions > "1" means using lrintf(), this is basically the improvement of the > current C code over the celt_ one. > > > ALL_FLOAT_PRESEARCH > The presearch filling of outY[] could be done entirely with float ops > (using SSE4.2 'roundps' instead of two cvt*). It is mostly useful if > you want to try YMM on AVX1 (AVX1 lacks 256 integer ops). > For some reason enabling this makes the whole function 4 times slower > on my CPU. ^_^ > > I've left some commented out code. I'll remove it for sure in the final > version. > > I just hope I haven't done some lame mistake in the last minute...
> opus_pvq.c | 9 > opus_pvq.h | 5 > x86/Makefile | 1 > x86/opus_dsp_init.c | 47 +++ > x86/opus_pvq_search.asm | 597 > ++++++++++++++++++++++++++++++++++++++++++++++++ > 5 files changed, 657 insertions(+), 2 deletions(-) > 3b9648bea3f01dad2cf159382f0ffc2d992c84b2 > 0001-SIMD-opus-pvq_search-implementation.patch > From 06dc798c302e90aa5b45bec5d8fbcd64ba4af076 Mon Sep 17 00:00:00 2001 > From: Ivan Kalvachev <ikalvac...@gmail.com> > Date: Thu, 8 Jun 2017 22:24:33 +0300 > Subject: [PATCH 1/3] SIMD opus pvq_search implementation. seems this breaks build with mingw64, didnt investigate but it fails with these errors: libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0x2d): relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge' libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0x3fd): relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge' libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0x7a1): relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge' libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0xb48): relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge' libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0x2d): relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge' libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0x3fd): relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge' libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0x7a1): relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge' libavcodec/libavcodec.a(opus_pvq_search.o):src/libavcodec/x86/opus_pvq_search.asm:(.text+0xb48): relocation truncated to fit: R_X86_64_32 against `const_align_abs_edge' collect2: error: ld returned 1 exit status collect2: error: ld returned 1 exit status make: *** [ffmpeg_g.exe] Error 1 make: *** Waiting for unfinished jobs.... make: *** [ffprobe_g.exe] Error 1 [...] -- Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB Democracy is the form of government in which you can choose your dictator
signature.asc
Description: Digital signature
_______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel