Hi, On Sun, Oct 4, 2015 at 3:46 PM, Paul B Mahol <one...@gmail.com> wrote:
> + .loop: > + movd m10, [ana_matrix_rq+ 0] > + movd m11, [ana_matrix_rq+ 4] > + movd m12, [ana_matrix_rq+ 8] > + movd m13, [ana_matrix_rq+12] > + movd m14, [ana_matrix_rq+16] > + movd m15, [ana_matrix_rq+20] > + pshufd m10, m10, q0000 > + pshufd m11, m11, q0000 > + pshufd m12, m12, q0000 > + pshufd m13, m13, q0000 > + pshufd m14, m14, q0000 > + pshufd m15, m15, q0000 > [..] > + movd m10, [ana_matrix_bq+ 0] > + movd m11, [ana_matrix_bq+ 4] > + movd m12, [ana_matrix_bq+ 8] > + movd m13, [ana_matrix_bq+12] > + movd m14, [ana_matrix_bq+16] > + movd m15, [ana_matrix_bq+20] > + pshufd m10, m10, q0000 > + pshufd m11, m11, q0000 > + pshufd m12, m12, q0000 > + pshufd m13, m13, q0000 > + pshufd m14, m14, q0000 > + pshufd m15, m15, q0000 > So, you want more registers, right? :-D. OK, so let's talk stack usage. you want aligned stack here to put all these constants so you don't need to recreate them in each loop cycle iteration. change: cglobal name, n_args, n_gprs, n_xmms, arg1, arg2, arg3 to: cglobal name, n_args, n_gprs, n_xmms, aligned_memory_in_bytes, arg1, arg2, arg3 In your case, add memory of 6*mmsize*3. Now, in the function, prepare the stack space first: movd m10, [ana_matrix_rq+0] [etc for the other r args] pshufd m10, m10, q0000 [etc for the other r args] mova [rsp+mmsize*0], m10 [etc for the others into rsp+mmsize*1-5] now do the same for g/b in mmsize*6-11 and 12-17 Now as pshufb argument, use [rsp+mmsize*0-17]. > + packusdw m1, m1 > + packuswb m1, m1 > + pshufb m7, m1, [rshuf] Try to do r/g/b all at the same time (especially now that you have more registers available since m10-15 are free), and packusdw r/g together, and then packuswb r/g and b/nothing together, so that you have a single output register instead of 3. That saves you the pors at the end also. Ronald _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel