Aug 13, 2020, 18:23 by one...@gmail.com: > Hi, > > patch attached. > > Please review and/or benchmark, especially .asm file. >
I took a look. Its just the horizontal pass of an inverse 2-6 idwt with clipping. The code is so simple I wasn't able to find any obvious ways to improve it, except perhaps replacing the "mov xq, 0" with "xor xq, xq", since I think xor is more universally recognized by x86 CPUs as "zeroing a register" so it'll just allocate a pre-zeroed one. I could be wrong though, its what everyone uses. Maybe call it idwt_26_horiz instead of a vague horiz_filter, since that's what it is? Its also called on a per-line basis in a loop with 1 call, and 3 adds everywhere. You could easily incorporate the loop into the function to reduce call overhead if you want to (and I think you should look into it, but I won't block the patch just for that). Registers might be a tight fit on 32-bit systems then, but even using the stack should be faster than a hot function call. Aside from those nitpicks, LGTM. SIMDing the remaining DSP function (interlaced_vertical_filter) should help a lot too, though that function is pretty much trivial, since its just an average + deinterleave. That function should 100% have its 3-line loop incorporated into it, however, as you'll definitely have no shortage of registers, even on 32bit systems. _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".