On 11/18/2017 07:41 PM, James Almer wrote:
On 11/18/2017 3:31 PM, Rostislav Pehlivanov wrote:
On 18 November 2017 at 17:35, Rafal Dabrowa <fatwild...@gmail.com> wrote:
This is a proposal of performance optimizations for 8-bit
hevc video decoding on aarch64 platform with neon (simd) extension.
I'm testing my optimizations on NanoPi M3 device. I'm using
mainly "Big Buck Bunny" video file in format 1280x720 for testing.
The video file was pulled from libde265.org page, see
http://www.libde265.org/hevc-bitstreams/bbb-1280x720-cfg06.mkv
The movie duration is 00:10:34.53.
Overall performance gain is about 2x. Without optimizations the movie
playback stops in practice after a few seconds. With
optimizations the file is played smoothly 99% of the time.
For performance testing the following command was used:
time ./ffmpeg -hide_banner -i ~/bbb-1280x720-cfg06.mkv -f yuv4mpegpipe
- >/dev/null
The video file was pre-read before test to minimize disk reads during
testing.
Program execution time without optimization was as follows:
real 11m48.576s
user 43m8.111s
sys 0m12.469s
Execution time with optimizations:
real 6m17.046s
user 21m19.792s
sys 0m14.724s
The patch contains optimizations for most heavily used qpel, epel, sao and
idct
functions. Among the functions provided for optimization there are two
intensively used, but not optimized in this patch:
hevc_v_loop_filter_luma_8
and hevc_h_loop_filter_luma_8. I have no idea how they could be optimized
hence I leaved them without optimizations.
Signed-off-by: Rafal Dabrowa <fatwild...@gmail.com>
---
libavcodec/aarch64/Makefile | 5 +
libavcodec/aarch64/hevcdsp_epel_8.S | 3949 ++++++++++++++++++++
libavcodec/aarch64/hevcdsp_idct_8.S | 1980 ++++++++++
libavcodec/aarch64/hevcdsp_init_aarch64.c | 170 +
libavcodec/aarch64/hevcdsp_qpel_8.S | 5666
+++++++++++++++++++++++++++++
libavcodec/aarch64/hevcdsp_sao_8.S | 166 +
libavcodec/hevcdsp.c | 2 +
libavcodec/hevcdsp.h | 1 +
8 files changed, 11939 insertions(+)
create mode 100644 libavcodec/aarch64/hevcdsp_epel_8.S
create mode 100644 libavcodec/aarch64/hevcdsp_idct_8.S
create mode 100644 libavcodec/aarch64/hevcdsp_init_aarch64.c
create mode 100644 libavcodec/aarch64/hevcdsp_qpel_8.S
create mode 100644 libavcodec/aarch64/hevcdsp_sao_8.S
Very nice.
The way we test SIMD is to put START_TIMER("function_name"); and
STOP_TIMER; (they're located in libavutil/timer.h) around where the
function gets called in the C code, then we do a run with the C code (no
SIMD) and a separate run with whatever SIMD optimizations we're
implementing. We take the last printed value of both runs and that's what's
used to measure speedup.
I don't think there's a need to split the patch into multiple patches for
each idividual version though yet, that's usually only done if some
function's C implementation is faster than the SIMD code.
It would be nice however to at least split it into two patches, one for
MC and one for SAO.
Could you explain whose functions are MC?
I can split patch into a few, but dependency between patches
is unavoidable because the non-optimized function pointers are
replaced with optimized all together, in one function body.
One of the patches must add the function and must add the function call.
Also, no way to use macros in aarch64 asm files? ~11k lines of code is a
lot to add, and I'm sure a sizable portion is duplicated with only some
small differences between functions.
I used macros sparingly because code without macros is
easier to understand and to improve. Sometimes even order
of assembly instructions is important. But, of course, I can reduce
the code size using macros if the patch will be accepted. I didn't know
whether you are interested with the patch at all.
Regarding performance testing. I wrapped every function with another
one, which calls START_TIMER and STOP_TIMER. It looks these macros
aren't reentrant, I needed to force the program to run in single thread.
Without this I had strange results, very differing between runs, for
example:
22190 UNITS in put_hevc_qpel_uni_h12_8, 16232 runs, 152 skips
1126 UNITS in put_hevc_qpel_uni_h12_8, 12001 runs, 4383 skips
Force to run in single-threaded mode was not easy, the -filter_threads
option didn't help.
Below is the outcome. Meaning of the columns:
FUNCTION - the function to optimize
UNITS_NOOPT - last UNITS result in run without optimization
OPT - last UNITS result in run with optimization
CALLS - sum of runs and skips
NSKIPS - number of skips in non-optimized version
OSKIPS - number of skips in optimized version
FUNCTION UNITS_NOOPT OPT CALLS NSKIPS OSKIPS
-------------------------------------------------------------------------
idct_16x16_8 113074 24079 2097152 0 0
idct_32x32_8 587447 100434 524288 0 0
put_hevc_epel_bi_h4_8 7651 3654 524288 177 1857
put_hevc_epel_bi_h6_8 18377 6668 32768 0 0
put_hevc_epel_bi_h8_8 20644 6698 1048576 34 1298
put_hevc_epel_bi_h12_8 62927 18968 16384 0 0
put_hevc_epel_bi_h16_8 78601 21254 524288 0 4
put_hevc_epel_bi_h24_8 231004 53800 4096 0 0
put_hevc_epel_bi_h32_8 294058 63302 524288 0 0
put_hevc_epel_bi_hv4_8 13183 6264 2097152 67 3057
put_hevc_epel_bi_hv6_8 27672 12706 131072 0 0
put_hevc_epel_bi_hv8_8 31908 11184 2097152 4 1688
put_hevc_epel_bi_hv12_8 86370 29497 65536 0 0
put_hevc_epel_bi_hv16_8 104623 30717 1048576 0 3
put_hevc_epel_bi_hv24_8 302361 80610 8192 0 0
put_hevc_epel_bi_hv32_8 376614 92475 1048576 0 0
put_hevc_epel_bi_v4_8 7290 3368 2097152 338 4444
put_hevc_epel_bi_v6_8 19306 8423 65536 0 0
put_hevc_epel_bi_v8_8 20431 5795 2097152 12 2252
put_hevc_epel_bi_v12_8 61368 21050 16384 0 0
put_hevc_epel_bi_v16_8 74351 17655 1048576 0 9
put_hevc_epel_bi_v24_8 226914 51601 4096 0 0
put_hevc_epel_bi_v32_8 285476 55184 1048576 0 0
put_hevc_epel_h4_8 5826 3362 524288 667 2619
put_hevc_epel_h6_8 12852 5912 32768 0 0
put_hevc_epel_h8_8 13847 6009 1048576 237 1504
put_hevc_epel_h12_8 44210 17185 16384 0 0
put_hevc_epel_h16_8 53502 18642 524288 0 5
put_hevc_epel_h24_8 157030 48086 4096 0 0
put_hevc_epel_h32_8 193877 54837 524288 0 0
put_hevc_epel_hv4_8 11031 6379 2097152 316 1886
put_hevc_epel_hv6_8 23233 12730 131072 0 0
put_hevc_epel_hv8_8 25406 10989 2097152 21 1471
put_hevc_epel_hv12_8 70139 28821 65536 0 0
put_hevc_epel_hv16_8 81318 30190 1048576 0 4
put_hevc_epel_hv24_8 230829 75079 16384 0 0
put_hevc_epel_hv32_8 285945 92143 1048576 0 0
put_hevc_epel_uni_hv4_8 13255 7571 2097152 142 582
put_hevc_epel_uni_hv6_8 29279 14637 131072 0 0
put_hevc_epel_uni_hv8_8 31783 14114 1048576 0 26
put_hevc_epel_uni_hv12_8 85576 31757 32768 0 0
put_hevc_epel_uni_hv16_8 90346 29886 524288 0 0
put_hevc_epel_uni_hv24_8 281864 76862 1024 0 0
put_hevc_epel_uni_hv32_8 322135 91541 65536 0 0
put_hevc_epel_uni_v4_8 6826 3785 2097152 494 3496
put_hevc_epel_uni_v6_8 20113 10093 32768 0 0
put_hevc_epel_uni_v8_8 18883 6444 1048576 7 448
put_hevc_epel_uni_v12_8 59989 23523 8192 0 0
put_hevc_epel_uni_v16_8 63740 18096 262144 0 0
put_hevc_epel_uni_v24_8 208109 48880 512 0 0
put_hevc_epel_uni_v32_8 249717 50660 262144 0 0
put_hevc_epel_v4_8 5834 3056 2097152 970 5422
put_hevc_epel_v6_8 15541 8900 65536 0 0
put_hevc_epel_v8_8 14549 5476 2097152 296 3129
put_hevc_epel_v12_8 48518 22362 32768 0 0
put_hevc_epel_v16_8 53909 16483 1048576 0 23
put_hevc_epel_v24_8 166783 43662 4096 0 0
put_hevc_epel_v32_8 210650 47112 1048576 0 0
put_hevc_pel_bi_pixels4_8 4751 2923 2097152 7381 9232
put_hevc_pel_bi_pixels6_8 11774 5689 65536 0 0
put_hevc_pel_bi_pixels8_8 12269 4165 4194304 2298 12731
put_hevc_pel_bi_pixels12_8 36260 14031 65536 0 0
put_hevc_pel_bi_pixels16_8 42718 10421 4194304 21 3881
put_hevc_pel_bi_pixels24_8 137480 38423 32768 0 0
put_hevc_pel_bi_pixels32_8 172166 43996 8388608 0 3
put_hevc_pel_bi_pixels48_8 520118 133238 4096 0 0
put_hevc_pel_bi_pixels64_8 671892 173615 4194304 0 0
put_hevc_pel_pixels4_8 3859 3139 1048576 8926 9478
put_hevc_pel_pixels6_8 8453 6566 32768 0 0
put_hevc_pel_pixels8_8 7144 3093 4194304 4802 30239
put_hevc_pel_pixels12_8 25096 16648 65536 0 0
put_hevc_pel_pixels16_8 25472 9538 2097152 790 3094
put_hevc_pel_pixels24_8 93108 42948 32768 0 0
put_hevc_pel_pixels32_8 100331 37550 8388608 0 2
put_hevc_pel_pixels48_8 321258 137835 4096 0 0
put_hevc_pel_pixels64_8 387236 152538 4194304 0 0
put_hevc_qpel_bi_h4_8 34054 20498 16384 0 0
put_hevc_qpel_bi_h8_8 34264 10873 524288 0 801
put_hevc_qpel_bi_h12_8 85199 22938 16384 0 0
put_hevc_qpel_bi_h16_8 107035 20526 524288 0 488
put_hevc_qpel_bi_h24_8 323233 66440 16384 0 0
put_hevc_qpel_bi_h32_8 415699 76073 262144 0 0
put_hevc_qpel_bi_h48_8 1282990 246145 2048 0 0
put_hevc_qpel_bi_h64_8 1664853 260382 262144 0 0
put_hevc_qpel_bi_hv4_8 56239 31221 32768 0 0
put_hevc_qpel_bi_hv8_8 63859 21595 1048576 0 63
put_hevc_qpel_bi_hv12_8 143173 58139 65536 0 0
put_hevc_qpel_bi_hv16_8 184410 40468 1048576 0 15
put_hevc_qpel_bi_hv24_8 509364 134833 32768 0 0
put_hevc_qpel_bi_hv32_8 647015 125581 524288 0 0
put_hevc_qpel_bi_hv48_8 1929283 385204 4096 0 0
put_hevc_qpel_bi_hv64_8 2416442 430161 524288 0 0
put_hevc_qpel_bi_v4_8 37454 22461 32768 0 0
put_hevc_qpel_bi_v8_8 34500 9218 1048576 0 1291
put_hevc_qpel_bi_v12_8 87403 31659 32768 0 0
put_hevc_qpel_bi_v16_8 106589 19326 1048576 0 971
put_hevc_qpel_bi_v24_8 332644 78044 16384 0 0
put_hevc_qpel_bi_v32_8 405835 73886 524288 0 0
put_hevc_qpel_bi_v48_8 1266494 217496 2048 0 0
put_hevc_qpel_bi_v64_8 1677771 259481 524288 0 0
put_hevc_qpel_h4_8 29542 16982 16384 0 0
put_hevc_qpel_h8_8 26710 10452 524288 5 558
put_hevc_qpel_h12_8 67708 22021 16384 0 0
put_hevc_qpel_h16_8 81849 18637 524288 0 560
put_hevc_qpel_h24_8 258384 62392 16384 0 0
put_hevc_qpel_h32_8 321281 68451 262144 0 0
put_hevc_qpel_h48_8 984759 219657 2048 0 0
put_hevc_qpel_h64_8 1224717 227914 262144 0 0
put_hevc_qpel_hv4_8 51764 32150 32768 0 0
put_hevc_qpel_hv8_8 56369 21627 1048576 0 73
put_hevc_qpel_hv12_8 125191 48671 65536 0 0
put_hevc_qpel_hv16_8 159288 40749 1048576 0 10
put_hevc_qpel_hv24_8 438656 131331 32768 0 0
put_hevc_qpel_hv32_8 551607 121954 524288 0 0
put_hevc_qpel_hv48_8 1627266 397656 4096 0 0
put_hevc_qpel_hv64_8 2016176 414765 524288 0 0
put_hevc_qpel_uni_h4_8 21301 13384 131072 0 0
put_hevc_qpel_uni_h8_8 30057 11010 524288 7 486
put_hevc_qpel_uni_h12_8 84804 25790 16384 0 0
put_hevc_qpel_uni_h16_8 95333 24267 262144 0 17
put_hevc_qpel_uni_h24_8 318029 76951 4096 0 0
put_hevc_qpel_uni_h32_8 356799 72279 65536 0 0
put_hevc_qpel_uni_h48_8 1181308 237731 128 0 0
put_hevc_qpel_uni_h64_8 1401262 231221 16384 0 0
put_hevc_qpel_uni_hv4_8 39439 22837 262144 0 1
put_hevc_qpel_uni_hv8_8 60380 23283 1048576 0 77
put_hevc_qpel_uni_hv12_8 146759 56280 32768 0 0
put_hevc_qpel_uni_hv16_8 173329 45131 524288 0 2
put_hevc_qpel_uni_hv24_8 505434 139999 16384 0 0
put_hevc_qpel_uni_hv32_8 561402 120361 131072 0 0
put_hevc_qpel_uni_hv48_8 1854753 361780 256 0 0
put_hevc_qpel_uni_hv64_8 2142627 404073 32768 0 0
put_hevc_qpel_uni_v4_8 23081 12550 262144 0 0
put_hevc_qpel_uni_v8_8 30075 9971 1048576 5 511
put_hevc_qpel_uni_v12_8 89427 38025 16384 0 0
put_hevc_qpel_uni_v16_8 96131 21727 524288 0 23
put_hevc_qpel_uni_v24_8 328019 90689 8192 0 0
put_hevc_qpel_uni_v32_8 358340 71396 131072 0 0
put_hevc_qpel_uni_v48_8 1164812 176367 256 0 0
put_hevc_qpel_uni_v64_8 1464856 232866 32768 0 0
put_hevc_qpel_v4_8 31732 19999 32768 0 0
put_hevc_qpel_v8_8 25311 8967 1048576 10 1142
put_hevc_qpel_v12_8 67764 29917 32768 0 0
put_hevc_qpel_v16_8 78023 18260 1048576 0 819
put_hevc_qpel_v24_8 254724 75185 16384 0 0
put_hevc_qpel_v32_8 305639 69130 524288 0 0
put_hevc_qpel_v48_8 892900 240703 2048 0 0
put_hevc_qpel_v64_8 1149597 221632 524288 0 0
sao_edge_filter_8 600074 91811 524288 0 0
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel