The assembly optimized half pel interpolation in some cases rounds the interpolated value when no rounding is requested. The result is a off by one error when one of the pixel values is zero.
Signed-off-by: Jerome Borsboom <jerome.borsb...@carpalis.nl> --- In the put_no_rnd_pixels functions, the psubusb instruction subtracts one from each unsigned byte to correct for the rouding that the PAVGB instruction performs. The psubusb instruction, however, uses saturation when the value does not fit in the operand type, i.e. an unsigned byte. In this particular case, this means that when the value of a pixel is 0, the psubusb instruction will return 0 instead of -1 as this value does not fit in an unsigned byte and is saturated to 0. The result is that the interpolated value is not corrected for the rounding that PAVGB performs and that the result will be off by one. The corrections below solved the issues for me, but I do not a lot of experience in optimizing assembly. A good check for the correctness of the solution might be advisable. Furthermore, I have not checked the other assembly, but there may be more cases where the psubusb instruction does not provide the desired results. A good check by the owner/maintainer of the assembly code might be appropriate. libavcodec/x86/hpeldsp.asm | 38 ++++++++++++++++++++++++++++++++------ 1 file changed, 32 insertions(+), 6 deletions(-) diff --git a/libavcodec/x86/hpeldsp.asm b/libavcodec/x86/hpeldsp.asm index ce5d7a4e28..bae2ba9880 100644 --- a/libavcodec/x86/hpeldsp.asm +++ b/libavcodec/x86/hpeldsp.asm @@ -145,10 +145,16 @@ cglobal put_no_rnd_pixels8_x2, 4,5 mova m1, [r1+1] mova m3, [r1+r2+1] add r1, r4 - psubusb m0, m6 - psubusb m2, m6 + mova m4, m0 + pxor m4, m1 + pand m4, m6 PAVGB m0, m1 + psubb m0, m4 + mova m4, m2 + pxor m4, m3 + pand m4, m6 PAVGB m2, m3 + psubb m2, m4 mova [r0], m0 mova [r0+r2], m2 mova m0, [r1] @@ -157,10 +163,16 @@ cglobal put_no_rnd_pixels8_x2, 4,5 mova m3, [r1+r2+1] add r0, r4 add r1, r4 - psubusb m0, m6 - psubusb m2, m6 + mova m4, m0 + pxor m4, m1 + pand m4, m6 PAVGB m0, m1 + psubb m0, m4 + mova m4, m2 + pxor m4, m3 + pand m4, m6 PAVGB m2, m3 + psubb m2, m4 mova [r0], m0 mova [r0+r2], m2 add r0, r4 @@ -227,18 +239,32 @@ cglobal put_no_rnd_pixels8_y2, 4,5 mova m1, [r1+r2] mova m2, [r1+r4] add r1, r4 - psubusb m1, m6 + mova m3, m0 + pxor m3, m1 + pand m3, m6 PAVGB m0, m1 + psubb m0, m3 + mova m3, m1 + pxor m3, m2 + pand m3, m6 PAVGB m1, m2 + psubb m1, m3 mova [r0+r2], m0 mova [r0+r4], m1 mova m1, [r1+r2] mova m0, [r1+r4] add r0, r4 add r1, r4 - psubusb m1, m6 + mova m3, m2 + pxor m3, m1 + pand m3, m6 PAVGB m2, m1 + psubb m2, m3 + mova m3, m1 + pxor m3, m0 + pand m3, m6 PAVGB m1, m0 + psubb m1, m3 mova [r0+r2], m2 mova [r0+r4], m1 add r0, r4 -- 2.13.6 _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel