On 11.05.2016, at 20:37, Michael Niedermayer <mich...@niedermayer.cc> wrote:

> On Wed, May 11, 2016 at 06:39:20PM +0200, Matthieu Bouron wrote:
>> From: Matthieu Bouron <matthieu.bou...@stupeflix.com>
>> 
>> ---
>> 
>> Hello,
>> 
>> Here are some benchmark on a rpi2 of the attached patch.
>> 
>> ./ffmpeg -f lavfi -i 
>> sine=440,aformat=sample_fmts=fltp,asetnsamples=4096,abench=start,aresample=48000,abench=stop
>>  -t 1000 -f null -
>> 
>> With patch:    avg=0.001159 speed=44,1x
>> Without patch: avg=0.001297 speed=40,8x
>> 
>> ./ffmpeg -f lavfi -i 
>> sine=440,aformat=sample_fmts=s16p,asetnsamples=4096,abench=start,aresample=48000,abench=stop
>>  -t 1000 -f null -
>> 
> 
>> With patch:    avg=0.001374 speed=45,6x
>> Without patch: avg=0.000782 speed=64,6x
> 
> so its slower ? or am i misreading this ?


Yes, that seems weird.
Also, what are common filter lengths?
Because for a length of 4 or 8 or 16 I'd think this would be much better fully 
unrolled.
And for longer ones at least partially unrolled.
Also having the filter length if inside the outer loop in the C code does not 
seem ideal either, even if the compiler might manage to fix it.
There's also the problem that on simple CPUs like most ARM, the jump overhead 
seems likely significant, so this might be a case where inline assembly might 
provide significant benefits (or writing the whole function in assembly), 
otherwise there's a risk that enabling the recently discussed -ftree-vecorize 
for that file specifically would give better results.

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Reply via email to