On Sat, Nov 18, 2017 at 06:35:48PM +0100, Rafal Dabrowa wrote: > > This is a proposal of performance optimizations for 8-bit > hevc video decoding on aarch64 platform with neon (simd) extension. > > I'm testing my optimizations on NanoPi M3 device. I'm using > mainly "Big Buck Bunny" video file in format 1280x720 for testing. > The video file was pulled from libde265.org page, see > http://www.libde265.org/hevc-bitstreams/bbb-1280x720-cfg06.mkv > The movie duration is 00:10:34.53. > > Overall performance gain is about 2x. Without optimizations the movie > playback stops in practice after a few seconds. With > optimizations the file is played smoothly 99% of the time. > > For performance testing the following command was used: > > time ./ffmpeg -hide_banner -i ~/bbb-1280x720-cfg06.mkv -f yuv4mpegpipe - > >/dev/null > > The video file was pre-read before test to minimize disk reads during testing. > Program execution time without optimization was as follows: > > real 11m48.576s > user 43m8.111s > sys 0m12.469s > > Execution time with optimizations: > > real 6m17.046s > user 21m19.792s > sys 0m14.724s >
Can you post the results of checkasm --bench for hevc? Did you run it to check for any calling convention violation? > > The patch contains optimizations for most heavily used qpel, epel, sao and > idct > functions. Among the functions provided for optimization there are two > intensively used, but not optimized in this patch: hevc_v_loop_filter_luma_8 > and hevc_h_loop_filter_luma_8. I have no idea how they could be optimized > hence I leaved them without optimizations. > You may want to check x86/hevc_deblock.asm then (no idea if these are implemented). [...] > +function ff_hevc_put_hevc_pel_pixels4_8_neon, export=1 > + mov x7, 128 > +1: ld1 { v0.s }[0], [x1], x2 > + ushll v4.8h, v0.8b, 6 > + st1 { v4.d }[0], [x0], x7 using #128 not possible? > + subs x3, x3, 1 > + b.ne 1b > + ret here and below: no use of the x6 register? A few comments on the style: - please use a consistent spacing (current function mismatches with later code), preferably using a relatively large number of spaces as common ground (check the other sources) - we use capitalized size suffixes (B, H, ...); and IIRC the lower case form are problematic with some assembler but don't quote me on that. - we don't use spaces between {} > +endfunc > + > +function ff_hevc_put_hevc_pel_pixels6_8_neon, export=1 > + mov x7, 120 > +1: ld1 { v0.8b }, [x1], x2 > + ushll v4.8h, v0.8b, 6 > + st1 { v4.d }[0], [x0], 8 I think you need to use # as prefix for the immediates > + st1 { v4.s }[2], [x0], x7 I assume you can't use #120? Have you checked if using #128 here and decrementing x0 afterward isn't faster? [...] > +function ff_hevc_put_hevc_pel_bi_pixels32_8_neon, export=1 > + mov x10, 128 > +1: ld1 { v0.16b, v1.16b }, [x2], x3 // src > + ushll v16.8h, v0.8b, 6 > + ushll2 v17.8h, v0.16b, 6 > + ushll v18.8h, v1.8b, 6 > + ushll2 v19.8h, v1.16b, 6 > + ld1 { v20.8h, v21.8h, v22.8h, v23.8h }, [x4], x10 // src2 > + sqadd v16.8h, v16.8h, v20.8h > + sqadd v17.8h, v17.8h, v21.8h > + sqadd v18.8h, v18.8h, v22.8h > + sqadd v19.8h, v19.8h, v23.8h > + sqrshrun v0.8b, v16.8h, 7 > + sqrshrun2 v0.16b, v17.8h, 7 > + sqrshrun v1.8b, v18.8h, 7 > + sqrshrun2 v1.16b, v19.8h, 7 does pairing helps here? sqrshrun v0.8b, v16.8h, 7 sqrshrun v1.8b, v18.8h, 7 sqrshrun2 v0.16b, v17.8h, 7 sqrshrun2 v1.16b, v19.8h, 7 [...] > + sqrshrun v0.8b, v16.8h, 7 > + sqrshrun2 v0.16b, v17.8h, 7 > + sqrshrun v1.8b, v18.8h, 7 > + sqrshrun2 v1.16b, v19.8h, 7 > + sqrshrun v2.8b, v20.8h, 7 > + sqrshrun2 v2.16b, v21.8h, 7 > + sqrshrun v3.8b, v22.8h, 7 > + sqrshrun2 v3.16b, v23.8h, 7 Again, this might be a good candidate for attempting to shuffle the instructions and see if it helps (there are many other places, I picked one randomly). > +.Lepel_filters: const/endconst + align might be better for all these labels [...] > +function ff_hevc_put_hevc_epel_hv12_8_neon, export=1 > + add x10, x3, 3 > + lsl x10, x10, 7 > + sub sp, sp, x10 // tmp_array > + stp x0, x3, [sp, -16]! > + stp x5, x30, [sp, -16]! > + add x0, sp, 32 > + sub x1, x1, x2 > + add x3, x3, 3 > + bl ff_hevc_put_hevc_epel_h12_8_neon > + ldp x5, x30, [sp], 16 > + ldp x0, x3, [sp], 16 > + load_epel_filterh x5, x4 > + mov x5, 112 > + mov x10, 128 > + ld1 { v16.8h, v17.8h }, [sp], x10 > + ld1 { v18.8h, v19.8h }, [sp], x10 > + ld1 { v20.8h, v21.8h }, [sp], x10 > +1: ld1 { v22.8h, v23.8h }, [sp], x10 > + calc_epelh v4, v16, v18, v20, v22 > + calc_epelh2 v4, v5, v16, v18, v20, v22 > + calc_epelh v5, v17, v19, v21, v23 > + st1 { v4.8h }, [x0], 16 > + st1 { v5.4h }, [x0], x5 > + subs x3, x3, 1 > + b.eq 2f > + > + ld1 { v16.8h, v17.8h }, [sp], x10 > + calc_epelh v4, v18, v20, v22, v16 > + calc_epelh2 v4, v5, v18, v20, v22, v16 > + calc_epelh v5, v19, v21, v23, v17 > + st1 { v4.8h }, [x0], 16 > + st1 { v5.4h }, [x0], x5 > + subs x3, x3, 1 > + b.eq 2f > + > + ld1 { v18.8h, v19.8h }, [sp], x10 > + calc_epelh v4, v20, v22, v16, v18 > + calc_epelh2 v4, v5, v20, v22, v16, v18 > + calc_epelh v5, v21, v23, v17, v19 > + st1 { v4.8h }, [x0], 16 > + st1 { v5.4h }, [x0], x5 > + subs x3, x3, 1 > + b.eq 2f > + > + ld1 { v20.8h, v21.8h }, [sp], x10 > + calc_epelh v4, v22, v16, v18, v20 > + calc_epelh2 v4, v5, v22, v16, v18, v20 > + calc_epelh v5, v23, v17, v19, v21 > + st1 { v4.8h }, [x0], 16 > + st1 { v5.4h }, [x0], x5 > + subs x3, x3, 1 > + b.ne 1b Introducing macros probably makes sense in these functions [...] > +8: b 9f // 0 > + nop > + nop > + nop > + st1 { v29.b }[0], [x7] // 1 > + b 9f > + nop > + nop > + st1 { v29.h }[0], [x7] // 2 > + b 9f > + nop > + nop > + st1 { v29.h }[0], [x7], 2 // 3 > + st1 { v29.b }[2], [x7] > + b 9f > + nop > + st1 { v29.s }[0], [x7] // 4 > + b 9f > + nop > + nop > + st1 { v29.s }[0], [x7], 4 // 5 > + st1 { v29.b }[4], [x7] > + b 9f > + nop > + st1 { v29.s }[0], [x7], 4 // 6 > + st1 { v29.h }[2], [x7] > + b 9f > + nop > + st1 { v29.s }[0], [x7], 4 // 7 > + st1 { v29.h }[2], [x7], 2 > + st1 { v29.b }[6], [x7] What are these nops for? align? [...] Anyway, can you split your patch? It's really a lot of code and there is no way anyone can review it properly quickly. I also think macros would be welcome in many places to reduce the size of the patch(es). Regards, -- Clément B.
signature.asc
Description: PGP signature
_______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel