On Fri, Mar 18, 2016 at 10:12:14PM -0700, Ganesh Ajjanagadde wrote: > It seems like in all usages, size is a multiple of 4. This is documented > as an assert. > > Yields speedup in this function, and small speedup for aac encoding overall. > > Sample benchmark (Haswell, -march=native + GCC): > old: > [...] > 1390 decicycles in abs_pow34_v, 127138 runs, 3934 skips63.1x > 1385 decicycles in abs_pow34_v, 254191 runs, 7953 skips64.4x > 1383 decicycles in abs_pow34_v, 508305 runs, 15983 skips65.3x > > new: > [...] > 1109 decicycles in abs_pow34_v, 127122 runs, 3950 skips61.2x > 1107 decicycles in abs_pow34_v, 254177 runs, 7967 skips63.5x > 1106 decicycles in abs_pow34_v, 508292 runs, 15996 skips65.3x > > old: > ffmpeg -f lavfi -i anoisesrc -t 300 -y sin_new.aac 4.55s user 0.03s system > 99% cpu 4.581 total > new: > ffmpeg -f lavfi -i anoisesrc -t 300 -y sin_new.aac 4.50s user 0.04s system > 99% cpu 4.537 total > > Signed-off-by: Ganesh Ajjanagadde <gajja...@gmail.com> > --- > libavcodec/aacenc_utils.h | 24 +++++++++++++++--------- > 1 file changed, 15 insertions(+), 9 deletions(-) > > diff --git a/libavcodec/aacenc_utils.h b/libavcodec/aacenc_utils.h > index 0203b6e..800b78f 100644 > --- a/libavcodec/aacenc_utils.h > +++ b/libavcodec/aacenc_utils.h > @@ -37,20 +37,26 @@ > #define ROUND_TO_ZERO 0.1054f > #define C_QUANT 0.4054f > > -static inline void abs_pow34_v(float *av_restrict out, const float > *av_restrict in, const int size) > -{ > - int i; > - for (i = 0; i < size; i++) { > - float a = fabsf(in[i]); > - out[i] = sqrtf(a * sqrtf(a)); > - } > -} > - > static inline float pos_pow34(float a) > { > return sqrtf(a * sqrtf(a)); > } > > +static inline void abs_pow34_v(float *av_restrict out, const float > *av_restrict in, const int size) > +{ > + av_assert2(!(size % 4)); > + for (int i = 0; i < size; i+=4) { > + float a0 = fabsf(in[i]); > + float a1 = fabsf(in[i+1]); > + float a2 = fabsf(in[i+2]); > + float a3 = fabsf(in[i+3]); > + out[i ] = pos_pow34(a0); > + out[i+1] = pos_pow34(a1); > + out[i+2] = pos_pow34(a2); > + out[i+3] = pos_pow34(a3); > + } > +} > +
I'm curious (and lazy), is GCC able to unroll by itself if you hint it with a loop such as: int i; for (i = 0; i < size & ~3; i++) { float a = fabsf(in[i]); out[i] = sqrtf(a * sqrtf(a)); } -- Clément B.
signature.asc
Description: PGP signature
_______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel