On 3/6/2016 4:14 PM, Reimar Döffinger wrote: > On Sun, Mar 06, 2016 at 03:49:00PM -0300, James Almer wrote: >> On 3/6/2016 3:35 PM, Reimar Döffinger wrote: >>> Approximately 10% faster transcode from mp3 to aac >>> with default settings. >>> >>> Signed-off-by: Reimar Döffinger <reimar.doeffin...@gmx.de> >>> --- >>> libavcodec/aacenc_utils.h | 47 >>> ++++++++++++++++++++++++++++++++++++++--------- >>> 1 file changed, 38 insertions(+), 9 deletions(-) >>> >>> diff --git a/libavcodec/aacenc_utils.h b/libavcodec/aacenc_utils.h >>> index b9bd6bf..1639021 100644 >>> --- a/libavcodec/aacenc_utils.h >>> +++ b/libavcodec/aacenc_utils.h >>> @@ -36,15 +36,29 @@ >>> #define ROUND_TO_ZERO 0.1054f >>> #define C_QUANT 0.4054f >>> >>> +#define ABSPOW(inv, outv) \ >>> +do { \ >>> + float a = (inv); \ >>> + a = fabsf(a); \ >>> + (outv) = sqrtf(a * sqrtf(a)); \ >>> +} while(0) >>> + >>> static inline void abs_pow34_v(float *out, const float *in, const int size) >>> { >>> int i; >>> - for (i = 0; i < size; i++) { >>> - float a = fabsf(in[i]); >>> - out[i] = sqrtf(a * sqrtf(a)); >>> + for (i = 0; i < size - 3; i += 4) { >>> + ABSPOW(in[i], out[i]); >>> + ABSPOW(in[i+1], out[i+1]); >>> + ABSPOW(in[i+2], out[i+2]); >>> + ABSPOW(in[i+3], out[i+3]); >>> + } >> >> Are you sure this wasn't vectorized already? I remember i checked and it >> mostly >> was, at least on gcc 5.3 mingw-w64 with default settings. > > Then it would hardly get 10% faster, would it (though > I admit I didn't test the two parts separately)? > But I am fairly sure that before the patch it only > used sqrtss instructions and not sqrtps.
Without your patch, GCC 5.3 mingw-w64 x86_64 default settings. $ make libavcodec/aacenc_ltp.o && objdump -d -M intel libavcodec/aacenc_ltp.o | grep sqrtps CC libavcodec/aacenc_ltp.o 1029: 0f 51 c8 sqrtps xmm1,xmm0 102f: 0f 51 c0 sqrtps xmm0,xmm0 161d: 0f 51 c8 sqrtps xmm1,xmm0 1623: 0f 51 c0 sqrtps xmm0,xmm0 1ccf: 0f 51 c8 sqrtps xmm1,xmm0 1cd5: 0f 51 c0 sqrtps xmm0,xmm0 2745: 0f 51 c8 sqrtps xmm1,xmm0 274b: 0f 51 c0 sqrtps xmm0,xmm0 34e4: 0f 51 c8 sqrtps xmm1,xmm0 34ea: 0f 51 c0 sqrtps xmm0,xmm0 42f7: 0f 51 c8 sqrtps xmm1,xmm0 42fd: 0f 51 c0 sqrtps xmm0,xmm0 44ac: 0f 51 c8 sqrtps xmm1,xmm0 44b2: 0f 51 c0 sqrtps xmm0,xmm0 With your patch 11fd: 0f 51 c8 sqrtps xmm1,xmm0 1203: 0f 51 c0 sqrtps xmm0,xmm0 12cb: 0f 51 c8 sqrtps xmm1,xmm0 12d1: 0f 51 c0 sqrtps xmm0,xmm0 1d43: 0f 51 c8 sqrtps xmm1,xmm0 1d49: 0f 51 c0 sqrtps xmm0,xmm0 1e21: 0f 51 c8 sqrtps xmm1,xmm0 1e27: 0f 51 c0 sqrtps xmm0,xmm0 2964: 0f 51 c8 sqrtps xmm1,xmm0 296a: 0f 51 c0 sqrtps xmm0,xmm0 2a3c: 0f 51 c8 sqrtps xmm1,xmm0 2a42: 0f 51 c0 sqrtps xmm0,xmm0 35f3: 0f 51 c8 sqrtps xmm1,xmm0 35f9: 0f 51 c0 sqrtps xmm0,xmm0 36bc: 0f 51 c8 sqrtps xmm1,xmm0 36c2: 0f 51 c0 sqrtps xmm0,xmm0 457b: 0f 51 c8 sqrtps xmm1,xmm0 4581: 0f 51 c0 sqrtps xmm0,xmm0 464c: 0f 51 c8 sqrtps xmm1,xmm0 4652: 0f 51 c0 sqrtps xmm0,xmm0 54b3: 0f 51 c8 sqrtps xmm1,xmm0 54b9: 0f 51 c0 sqrtps xmm0,xmm0 558f: 0f 51 c8 sqrtps xmm1,xmm0 5595: 0f 51 c0 sqrtps xmm0,xmm0 56e4: 0f 51 c8 sqrtps xmm1,xmm0 56ea: 0f 51 c0 sqrtps xmm0,xmm0 Didn't bench but it seems to help GCC vectorize more efficiently so this patch is probably ok, especially if in your case it made your compiler actually be able to vectorize at all. _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel