On 12/07/15 8:33 PM, Ronald S. Bultje wrote: > +INIT_XMM sse4 > +cglobal ssim_end_line, 3, 3, 6, sum0, sum1, w > + pxor m0, m0 > +.loop: > + mova m1, [sum0q+mmsize*0] > + mova m2, [sum0q+mmsize*1] > + mova m3, [sum0q+mmsize*2] > + mova m4, [sum0q+mmsize*3] > + paddd m1, [sum1q+mmsize*0] > + paddd m2, [sum1q+mmsize*1] > + paddd m3, [sum1q+mmsize*2] > + paddd m4, [sum1q+mmsize*3] > + paddd m1, m2 > + paddd m2, m3 > + paddd m3, m4 > + paddd m4, [sum0q+mmsize*4] > + paddd m4, [sum1q+mmsize*4] > + TRANSPOSE4x4D 1, 2, 3, 4, 5 > + > + ; m1 = fs1, m2 = fs2, m3 = fss, m4 = fs12 > + pslld m3, 6 > + pslld m4, 6 > + pmulld m5, m1, m2 ; fs1 * fs2 > + pmulld m1, m1 ; fs1 * fs1 > + pmulld m2, m2 ; fs2 * fs2
If these values are guaranteed to be always positive then this could also be implemented with pmuludq to get an sse2 version working. Although I'm not sure if it's worth doing. It will be six pmuludq and an awful lot of shuffling and unpacking when the speed up of the sse4 version is already only ~2x the C version. This was already oked (Same with the psnr sse2 code), so it should be pushed already. _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel