James Almer<mailto:jamr...@gmail.com>: >On 11/20/2021 5:42 PM, Wu Jianhua wrote: >> James Almer<mailto:jamr...@gmail.com>: >> On 11/4/2021 1:18 AM, Wu Jianhua wrote: >>>> Performance data(Less is better): >>>> exposure_sse: 500491 >> >>> You reported a better result in the first patch. >> >> For they are tested on different baseline, I think it might be better to >> only compare these two values. >> >>>> exposure_avx2: 449122 >> >>> This looks like a really low speed up for a function that processes >>> twice the amount of floats per loop. >> >>>> >>>> Signed-off-by: Wu Jianhua <jianhua...@intel.com> >>>> --- >>>> libavfilter/x86/vf_exposure.asm | 15 +++++++++++++++ >>>> libavfilter/x86/vf_exposure_init.c | 6 ++++++ >>>> 2 files changed, 21 insertions(+) >>>> >>>> diff --git a/libavfilter/x86/vf_exposure.asm >>>> b/libavfilter/x86/vf_exposure.asm >>>> index 3351c6fb3b..f271167805 100644 >>>> --- a/libavfilter/x86/vf_exposure.asm >>>> +++ b/libavfilter/x86/vf_exposure.asm >>>> @@ -36,11 +36,21 @@ cglobal exposure, 2, 2, 4, ptr, length, black, scale >>>> VBROADCASTSS m1, xmm1 >>>> %endif >>>> >>>> +%if cpuflag(fma3) || cpuflag(fma4) >> >>> Remove the fma4 check if you're not using it. >> >> No problem. Avx2 flag is only initialized with fma3, so the fma4 is >> redundant indeed. >> >>>> + mulps m0, m0, m1 ; black * scale >>>> +%endif >>>> + >>>> .loop: >>>> +%if cpuflag(fma3) || cpuflag(fma4) >>>> + mova m2, m0 >>>> + vfmsub231ps m2, m1, [ptrq] >>>> + movu [ptrq], m2 > > >>> Have you tried to not use FMA for this and just kept the sub + mul even >>> for AVX2 and see how it performs? >> >> Yeah. Definitely. I have had sufficient tests before. The first version is >> kept sub + mul >> for AVX2. After that, I keep trying to find a way out to speed up it >> further. Using FMA >> here would be faster than sub + mul indeed, precisely, improving by 4%-10% >> approximately. >> Not that much better, but still an optimal way I found at the present.
> I tried the checkasm test you wrote and when i made the AVX2 version use > sub + mul instead of vfmsub231ps i noticed that i could change the > epsilon value to FLT_EPSILON instead of 0.01f and the test would still > succeed, meaning the output of the version using vfmsub231ps deviates a > bit from the normal sub + mul one. > The speed up is pretty small, so it may be worth just using the sub + > mul version instead. Yeah. Small, but it’s not called just one time. Many a little makes a mickle, isn’t it? I might be more prefer to keep this. _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".