On Mon, Jun 19, 2017 at 5:11 PM, James Darnley <jdarn...@obe.tv> wrote: > + por m1, m8, m13 > + por m1, m12 > + por m1, [blockq+ 16] ; { row[1] }[0-7] > + por m1, [blockq+ 48] ; { row[3] }[0-7] > + por m1, [blockq+ 80] ; { row[5] }[0-7] > + por m1, [blockq+112] ; { row[7] }[0-7]
Using a single register as destination here means that only one instruction per cycle can be executed due to dependencies. Splitting it across two destinations would double the (local) IPC. OoOE might alleviate it, but no reason to unnecessarily rely on it. _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel