On Sat, Dec 9, 2017 at 1:11 PM, Martin Vignali <martin.vign...@gmail.com> wrote: > the idea in AVX2 is to load 128bits of data (2x 64 bits) > then shuffle accross lane, the two 64 bits in the low part of each lane, to > keep the rest of the process similar > to the sse version
What about using pmovzxbw instead of movu + vpermq + punpcklbw? > for the store, the idea is similar in the opposite way (shuffle before > store) You could also do vextracti128 + 128-bit packuswb instead of 256-bit packuswb + vpermq. _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel