Hi, 2015-10-18 2:47 GMT+02:00 Timothy Gu <timothyg...@gmail.com>: > This function is only used within other inline asm functions, hence the > HAVE_MMX_INLINE guard. Per recent discussions, we should not worry about > the performance of inline asm-only builds.
On a quick glance, looks good. > The conversion process has to start _somewhere_... True. > +.loop: > + movh m2, [srcq] > + add srcq, strideq > + movh m3, [srcq] > + punpcklbw m2, m0 > + punpcklbw m3, m0 > + SHIFT2_LINE 0, 1, 2, 3, 4 > + SHIFT2_LINE 24, 2, 3, 4, 1 > + SHIFT2_LINE 48, 3, 4, 1, 2 > + SHIFT2_LINE 72, 4, 1, 2, 3 > + SHIFT2_LINE 96, 1, 2, 3, 4 > + SHIFT2_LINE 120, 2, 3, 4, 1 > + SHIFT2_LINE 144, 3, 4, 1, 2 > + SHIFT2_LINE 168, 4, 1, 2, 3 > + sub srcq, stride_9minus4 > + add dstq, 8 > + dec i > + jnz .loop The following remarks are for potential later work and food for thought. I'm the first offender, but that loop expands to ~100 instructions. I don't know what others may have as an opinion on this, but that might be a tad bit. So maybe specializing for particular shift and round values (if possible, I don't remember) would be better. Then there's the fact the 16-wide blocks are currently handled as 2x8 (iirc), that would suggest doing part of this in C. On the other hand, idcts are not yet implemented, and there are h/w decoders doing a better job of decoding vc1, so it may be a waste of time (hence why I myself never did all of this). -- Christophe _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel