Thanks for the feedback. You are right, I can use VPERMQ to free up a
register. I can also remove the PAND mask by doing PSLLD/PSRLD. That
eliminates the need for an x86-64 block.
I tried the naive 'unrolled' version with no permute, and it was much slower,
about the same as the AVX/SSSE3 co
On 2019-03-01 18:41, Michael Stoner wrote:
> The AVX2 code leverages VPERMD to process 12 pixels/iteration. This is my
> first patch submission so any comments are greatly appreciated.
>
> -Mike
>
> Tested on Skylake (Win32 & Win64)
> 1920x1080 input frame
> =
> C code - 440
On 2019-03-03 15:44, Martin Vignali wrote:
> Hello,
>
> ...
>
> Not directly related to this patch, but it can be interesting for testing
> purpose to write a checkasm test for the v210 func decoding.
> So it's more easy to check the perf for "each" cpu flags, and be sure, the
> various width cas
Hello,
Few comments.
You can use VBROADCASTI128 macro instead of changing the size of the
constants
(VBROADCASTI128 load 128 bit when using XMM, and broadcast the 128bit to
the two lane when using YMM)
The %if ARCH_X86_64 part, seems strange.
seems to only be useful for AVX2, not for sse/avx.
N
The AVX2 code leverages VPERMD to process 12 pixels/iteration. This is my
first patch submission so any comments are greatly appreciated.
-Mike
Tested on Skylake (Win32 & Win64)
1920x1080 input frame
=
C code - 440 fps
SSSE3 - 920 fps
AVX- 930 fps
AVX2 - 1040 fps
Reg