Hi,
On Mon, Jul 11, 2016 at 6:15 PM, Henrik Gramner wrote:
> On Mon, Jul 11, 2016 at 11:48 PM, Carl Eugen Hoyos
> wrote:
> > Ronald S. Bultje gmail.com> writes:
> >
> >> +%if ARCH_X86_64
> >
> > Just curious: Why does this not work on x86-32?
> > Isn't there some asm magic that moves some
> >
On Mon, Jul 11, 2016 at 11:48 PM, Carl Eugen Hoyos wrote:
> Ronald S. Bultje gmail.com> writes:
>
>> +%if ARCH_X86_64
>
> Just curious: Why does this not work on x86-32?
> Isn't there some asm magic that moves some
> parameters to the stack if necessary?
>
> Carl Eugen
Uses more than 8 vector re
Ronald S. Bultje gmail.com> writes:
> +%if ARCH_X86_64
Just curious: Why does this not work on x86-32?
Isn't there some asm magic that moves some
parameters to the stack if necessary?
Carl Eugen
___
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
Hi,
On Sat, Jul 9, 2016 at 11:12 AM, James Almer wrote:
> On 7/8/2016 6:59 PM, Ronald S. Bultje wrote:
> > +%if ARCH_X86_64
> > +INIT_YMM avx2
>
> Add an %if HAVE_AVX2_EXTERNAL check here, because yasm 1.1.0 and older
> don't support avx2.
>
> lgtm aside from that.
Changed, and pushed.
Ronald
On 7/8/2016 6:59 PM, Ronald S. Bultje wrote:
> +%if ARCH_X86_64
> +INIT_YMM avx2
Add an %if HAVE_AVX2_EXTERNAL check here, because yasm 1.1.0 and older
don't support avx2.
lgtm aside from that.
> +cglobal vp9_idct_idct_16x16_add, 4, 4, 16, dst, stride, block, eob
___
checkasm --bench, 10k runs, for *_add_${bpc}_${sub_idct}_${opt}, shows
that it's about 1.65x as fast as the AVX version for the full IDCT, and
similar speedups for the sub-IDCTs:
nop: 24.6
vp9_inv_dct_dct_16x16_add_8_1_c: 6444.8
vp9_inv_dct_dct_16x16_add_8_1_sse2: 638.6
vp9_inv_dct_dct_16x16_add_8
On Fri, Jul 08, 2016 at 04:40:28PM -0400, Ronald S. Bultje wrote:
> checkasm --bench, 10k runs, for *_add_${bpc}_${sub_idct}_${opt}, shows
> that it's about 1.65x as fast as the AVX version for the full IDCT, and
> similar speedups for the sub-IDCTs:
>
> nop: 24.6
> vp9_inv_dct_dct_16x16_add_8_1_c
checkasm --bench, 10k runs, for *_add_${bpc}_${sub_idct}_${opt}, shows
that it's about 1.65x as fast as the AVX version for the full IDCT, and
similar speedups for the sub-IDCTs:
nop: 24.6
vp9_inv_dct_dct_16x16_add_8_1_c: 6444.8
vp9_inv_dct_dct_16x16_add_8_1_sse2: 638.6
vp9_inv_dct_dct_16x16_add_8