Re: [FFmpeg-devel] [PATCH 1/3] diracdec: add 10-bit Haar SIMD functions

2018-07-27 Thread Henrik Gramner
On Fri, Jul 27, 2018 at 4:03 PM, James Darnley wrote: > On 2018-07-27 15:05, Henrik Gramner wrote: >> Can't you just use 7 GPR:s on x86-32 as well? > > I'm sure I've done that in the past and at least 1 platform has always > complained due to PIE or stack alignment or whatever, I think. I went >

Re: [FFmpeg-devel] [PATCH 1/3] diracdec: add 10-bit Haar SIMD functions

2018-07-27 Thread Rostislav Pehlivanov
On 27 July 2018 at 12:47, James Darnley wrote: > On 2018-07-26 17:29, Rostislav Pehlivanov wrote: > > On 26 July 2018 at 12:28, James Darnley wrote: > > +cglobal vertical_compose_haar_10bit, 3, 6, 4, b0, b1, w > >> +DECLARE_REG_TMP 4,5 > >> + > >> +mova m2, [pd_1] > >> +mov r3d, wd

Re: [FFmpeg-devel] [PATCH 1/3] diracdec: add 10-bit Haar SIMD functions

2018-07-27 Thread James Darnley
On 2018-07-27 15:05, Henrik Gramner wrote: > On Fri, Jul 27, 2018 at 1:47 PM, James Darnley wrote: >> On 2018-07-26 17:29, Rostislav Pehlivanov wrote: +cglobal horizontal_compose_haar_10bit, 3, 6+ARCH_X86_64, 4, b, temp_, w, x, b2 +DECLARE_REG_TMP 2,5 +%if ARCH_X86_64 >

Re: [FFmpeg-devel] [PATCH 1/3] diracdec: add 10-bit Haar SIMD functions

2018-07-27 Thread Henrik Gramner
On Fri, Jul 27, 2018 at 1:47 PM, James Darnley wrote: > On 2018-07-26 17:29, Rostislav Pehlivanov wrote: >>> +cglobal horizontal_compose_haar_10bit, 3, 6+ARCH_X86_64, 4, b, temp_, w, >>> x, b2 >>> +DECLARE_REG_TMP 2,5 >>> +%if ARCH_X86_64 >>> +%define tail r6d >>> +%else >>> +

Re: [FFmpeg-devel] [PATCH 1/3] diracdec: add 10-bit Haar SIMD functions

2018-07-27 Thread James Darnley
On 2018-07-26 17:29, Rostislav Pehlivanov wrote: > On 26 July 2018 at 12:28, James Darnley wrote: > +cglobal vertical_compose_haar_10bit, 3, 6, 4, b0, b1, w >> +DECLARE_REG_TMP 4,5 >> + >> +mova m2, [pd_1] >> +mov r3d, wd >> +and wd, ~(mmsize/4 - 1) >> +shl wd, 2 >> +

Re: [FFmpeg-devel] [PATCH 1/3] diracdec: add 10-bit Haar SIMD functions

2018-07-26 Thread Rostislav Pehlivanov
On 26 July 2018 at 12:28, James Darnley wrote: > + > +%macro HAAR_HORIZONTAL 0 > + > +cglobal horizontal_compose_haar_10bit, 3, 6+ARCH_X86_64, 4, b, temp_, w, > x, b2 > +DECLARE_REG_TMP 2,5 > +%if ARCH_X86_64 > +%define tail r6d > +%else > +%define tail dword wm > +

Re: [FFmpeg-devel] [PATCH 1/3] diracdec: add 10-bit Haar SIMD functions

2018-07-26 Thread Rostislav Pehlivanov
On 26 July 2018 at 12:28, James Darnley wrote: > Speed of ffmpeg when decoding a 720p yuv422p10 file encoded with the > relevant transform. > C:119fps > SSE2: 204fps > AVX: 206fps > AVX2: 221fps > > timer measurements, haar horizontal compose: > sse2: 3.68x faster (45143 vs. 12279 decicy

[FFmpeg-devel] [PATCH 1/3] diracdec: add 10-bit Haar SIMD functions

2018-07-26 Thread James Darnley
Speed of ffmpeg when decoding a 720p yuv422p10 file encoded with the relevant transform. C:119fps SSE2: 204fps AVX: 206fps AVX2: 221fps timer measurements, haar horizontal compose: sse2: 3.68x faster (45143 vs. 12279 decicycles) compared with C avx: 3.68x faster (45143 vs. 12275 deci