On Fri, Jul 27, 2018 at 4:03 PM, James Darnley wrote:
> On 2018-07-27 15:05, Henrik Gramner wrote:
>> Can't you just use 7 GPR:s on x86-32 as well?
>
> I'm sure I've done that in the past and at least 1 platform has always
> complained due to PIE or stack alignment or whatever, I think. I went
>
On 27 July 2018 at 12:47, James Darnley wrote:
> On 2018-07-26 17:29, Rostislav Pehlivanov wrote:
> > On 26 July 2018 at 12:28, James Darnley wrote:
> > +cglobal vertical_compose_haar_10bit, 3, 6, 4, b0, b1, w
> >> +DECLARE_REG_TMP 4,5
> >> +
> >> +mova m2, [pd_1]
> >> +mov r3d, wd
On 2018-07-27 15:05, Henrik Gramner wrote:
> On Fri, Jul 27, 2018 at 1:47 PM, James Darnley wrote:
>> On 2018-07-26 17:29, Rostislav Pehlivanov wrote:
+cglobal horizontal_compose_haar_10bit, 3, 6+ARCH_X86_64, 4, b, temp_, w,
x, b2
+DECLARE_REG_TMP 2,5
+%if ARCH_X86_64
>
On Fri, Jul 27, 2018 at 1:47 PM, James Darnley wrote:
> On 2018-07-26 17:29, Rostislav Pehlivanov wrote:
>>> +cglobal horizontal_compose_haar_10bit, 3, 6+ARCH_X86_64, 4, b, temp_, w,
>>> x, b2
>>> +DECLARE_REG_TMP 2,5
>>> +%if ARCH_X86_64
>>> +%define tail r6d
>>> +%else
>>> +
On 2018-07-26 17:29, Rostislav Pehlivanov wrote:
> On 26 July 2018 at 12:28, James Darnley wrote:
> +cglobal vertical_compose_haar_10bit, 3, 6, 4, b0, b1, w
>> +DECLARE_REG_TMP 4,5
>> +
>> +mova m2, [pd_1]
>> +mov r3d, wd
>> +and wd, ~(mmsize/4 - 1)
>> +shl wd, 2
>> +
On 26 July 2018 at 12:28, James Darnley wrote:
> +
> +%macro HAAR_HORIZONTAL 0
> +
> +cglobal horizontal_compose_haar_10bit, 3, 6+ARCH_X86_64, 4, b, temp_, w,
> x, b2
> +DECLARE_REG_TMP 2,5
> +%if ARCH_X86_64
> +%define tail r6d
> +%else
> +%define tail dword wm
> +
On 26 July 2018 at 12:28, James Darnley wrote:
> Speed of ffmpeg when decoding a 720p yuv422p10 file encoded with the
> relevant transform.
> C:119fps
> SSE2: 204fps
> AVX: 206fps
> AVX2: 221fps
>
> timer measurements, haar horizontal compose:
> sse2: 3.68x faster (45143 vs. 12279 decicy
Speed of ffmpeg when decoding a 720p yuv422p10 file encoded with the
relevant transform.
C:119fps
SSE2: 204fps
AVX: 206fps
AVX2: 221fps
timer measurements, haar horizontal compose:
sse2: 3.68x faster (45143 vs. 12279 decicycles) compared with C
avx: 3.68x faster (45143 vs. 12275 deci