On 11/18/2017 07:41 PM, James Almer wrote:
On 11/18/2017 3:31 PM, Rostislav Pehlivanov wrote:


On 18 November 2017 at 17:35, Rafal Dabrowa <fatwild...@gmail.com> wrote:

This is a proposal of performance optimizations for 8-bit
hevc video decoding on aarch64 platform with neon (simd) extension.

I'm testing my optimizations on NanoPi M3 device. I'm using
mainly "Big Buck Bunny" video file in format 1280x720 for testing.
The video file was pulled from libde265.org page, see
http://www.libde265.org/hevc-bitstreams/bbb-1280x720-cfg06.mkv
The movie duration is 00:10:34.53.

Overall performance gain is about 2x. Without optimizations the movie
playback stops in practice after a few seconds. With
optimizations the file is played smoothly 99% of the time.

For performance testing the following command was used:

     time ./ffmpeg -hide_banner -i ~/bbb-1280x720-cfg06.mkv -f yuv4mpegpipe
- >/dev/null

The video file was pre-read before test to minimize disk reads during
testing.
Program execution time without optimization was as follows:

real    11m48.576s
user    43m8.111s
sys     0m12.469s

Execution time with optimizations:

real    6m17.046s
user    21m19.792s
sys     0m14.724s


The patch contains optimizations for most heavily used qpel, epel, sao and
idct
functions.  Among the functions provided for optimization there are two
intensively used, but not optimized in this patch:
hevc_v_loop_filter_luma_8
and hevc_h_loop_filter_luma_8. I have no idea how they could be optimized
hence I leaved them without optimizations.



Signed-off-by: Rafal Dabrowa <fatwild...@gmail.com>
---
  libavcodec/aarch64/Makefile               |    5 +
  libavcodec/aarch64/hevcdsp_epel_8.S       | 3949 ++++++++++++++++++++
  libavcodec/aarch64/hevcdsp_idct_8.S       | 1980 ++++++++++
  libavcodec/aarch64/hevcdsp_init_aarch64.c |  170 +
  libavcodec/aarch64/hevcdsp_qpel_8.S       | 5666
+++++++++++++++++++++++++++++
  libavcodec/aarch64/hevcdsp_sao_8.S        |  166 +
  libavcodec/hevcdsp.c                      |    2 +
  libavcodec/hevcdsp.h                      |    1 +
  8 files changed, 11939 insertions(+)
  create mode 100644 libavcodec/aarch64/hevcdsp_epel_8.S
  create mode 100644 libavcodec/aarch64/hevcdsp_idct_8.S
  create mode 100644 libavcodec/aarch64/hevcdsp_init_aarch64.c
  create mode 100644 libavcodec/aarch64/hevcdsp_qpel_8.S
  create mode 100644 libavcodec/aarch64/hevcdsp_sao_8.S


Very nice.
The way we test SIMD is to put START_TIMER("function_name"); and
STOP_TIMER; (they're located in libavutil/timer.h) around where the
function gets called in the C code, then we do a run with the C code (no
SIMD) and a separate run with whatever SIMD optimizations we're
implementing. We take the last printed value of both runs and that's what's
used to measure speedup.

I don't think there's a need to split the patch into multiple patches for
each idividual version though yet, that's usually only done if some
function's C implementation is faster than the SIMD code.
It would be nice however to at least split it into two patches, one for
MC and one for SAO.
Could you explain whose functions are MC?

I can split patch into a few, but dependency between patches
is unavoidable because the non-optimized function pointers are
replaced with optimized all together, in one function body.
One of the patches must add the function and must add the function call.

Also, no way to use macros in aarch64 asm files? ~11k lines of code is a
lot to add, and I'm sure a sizable portion is duplicated with only some
small differences between functions.
I used macros sparingly because code without macros is
easier to understand and to improve. Sometimes even order
of assembly instructions is important. But, of course, I can reduce
the code size using macros if the patch will be accepted. I didn't know
whether you are interested with the patch at all.


Regarding performance testing. I wrapped every function with another
one, which calls START_TIMER and STOP_TIMER. It looks these macros
aren't reentrant, I needed to force the program to run in single thread.
Without this I had strange results, very differing between runs, for example:

22190 UNITS in put_hevc_qpel_uni_h12_8,   16232 runs,    152 skips
1126 UNITS in put_hevc_qpel_uni_h12_8,   12001 runs,   4383 skips

Force to run in single-threaded mode was not easy, the -filter_threads
option didn't help.

Below is the outcome. Meaning of the columns:

FUNCTION - the function to optimize
UNITS_NOOPT - last UNITS result in run without optimization
OPT - last UNITS result in run with optimization
CALLS - sum of runs and skips
NSKIPS - number of skips in non-optimized version
OSKIPS - number of skips in optimized version


FUNCTION                 UNITS_NOOPT      OPT     CALLS   NSKIPS OSKIPS
-------------------------------------------------------------------------
idct_16x16_8                  113074    24079   2097152 0        0
idct_32x32_8                  587447   100434    524288 0        0
put_hevc_epel_bi_h4_8           7651     3654    524288      177 1857
put_hevc_epel_bi_h6_8          18377     6668     32768 0        0
put_hevc_epel_bi_h8_8          20644     6698   1048576       34 1298
put_hevc_epel_bi_h12_8         62927    18968     16384 0        0
put_hevc_epel_bi_h16_8         78601    21254    524288 0        4
put_hevc_epel_bi_h24_8        231004    53800      4096 0        0
put_hevc_epel_bi_h32_8        294058    63302    524288 0        0
put_hevc_epel_bi_hv4_8         13183     6264   2097152       67 3057
put_hevc_epel_bi_hv6_8         27672    12706    131072 0        0
put_hevc_epel_bi_hv8_8         31908    11184   2097152        4 1688
put_hevc_epel_bi_hv12_8        86370    29497     65536 0        0
put_hevc_epel_bi_hv16_8       104623    30717   1048576 0        3
put_hevc_epel_bi_hv24_8       302361    80610      8192 0        0
put_hevc_epel_bi_hv32_8       376614    92475   1048576 0        0
put_hevc_epel_bi_v4_8           7290     3368   2097152      338 4444
put_hevc_epel_bi_v6_8          19306     8423     65536 0        0
put_hevc_epel_bi_v8_8          20431     5795   2097152       12 2252
put_hevc_epel_bi_v12_8         61368    21050     16384 0        0
put_hevc_epel_bi_v16_8         74351    17655   1048576 0        9
put_hevc_epel_bi_v24_8        226914    51601      4096 0        0
put_hevc_epel_bi_v32_8        285476    55184   1048576 0        0
put_hevc_epel_h4_8              5826     3362    524288      667 2619
put_hevc_epel_h6_8             12852     5912     32768 0        0
put_hevc_epel_h8_8             13847     6009   1048576      237 1504
put_hevc_epel_h12_8            44210    17185     16384 0        0
put_hevc_epel_h16_8            53502    18642    524288 0        5
put_hevc_epel_h24_8           157030    48086      4096 0        0
put_hevc_epel_h32_8           193877    54837    524288 0        0
put_hevc_epel_hv4_8            11031     6379   2097152      316 1886
put_hevc_epel_hv6_8            23233    12730    131072 0        0
put_hevc_epel_hv8_8            25406    10989   2097152       21 1471
put_hevc_epel_hv12_8           70139    28821     65536 0        0
put_hevc_epel_hv16_8           81318    30190   1048576 0        4
put_hevc_epel_hv24_8          230829    75079     16384 0        0
put_hevc_epel_hv32_8          285945    92143   1048576 0        0
put_hevc_epel_uni_hv4_8        13255     7571   2097152 142      582
put_hevc_epel_uni_hv6_8        29279    14637    131072 0        0
put_hevc_epel_uni_hv8_8        31783    14114   1048576 0       26
put_hevc_epel_uni_hv12_8       85576    31757     32768 0        0
put_hevc_epel_uni_hv16_8       90346    29886    524288 0        0
put_hevc_epel_uni_hv24_8      281864    76862      1024 0        0
put_hevc_epel_uni_hv32_8      322135    91541     65536 0        0
put_hevc_epel_uni_v4_8          6826     3785   2097152      494 3496
put_hevc_epel_uni_v6_8         20113    10093     32768 0        0
put_hevc_epel_uni_v8_8         18883     6444   1048576 7      448
put_hevc_epel_uni_v12_8        59989    23523      8192 0        0
put_hevc_epel_uni_v16_8        63740    18096    262144 0        0
put_hevc_epel_uni_v24_8       208109    48880       512 0        0
put_hevc_epel_uni_v32_8       249717    50660    262144 0        0
put_hevc_epel_v4_8              5834     3056   2097152      970 5422
put_hevc_epel_v6_8             15541     8900     65536 0        0
put_hevc_epel_v8_8             14549     5476   2097152      296 3129
put_hevc_epel_v12_8            48518    22362     32768 0        0
put_hevc_epel_v16_8            53909    16483   1048576 0       23
put_hevc_epel_v24_8           166783    43662      4096 0        0
put_hevc_epel_v32_8           210650    47112   1048576 0        0
put_hevc_pel_bi_pixels4_8       4751     2923   2097152     7381 9232
put_hevc_pel_bi_pixels6_8      11774     5689     65536 0        0
put_hevc_pel_bi_pixels8_8      12269     4165   4194304     2298 12731
put_hevc_pel_bi_pixels12_8     36260    14031     65536 0        0
put_hevc_pel_bi_pixels16_8     42718    10421   4194304       21 3881
put_hevc_pel_bi_pixels24_8    137480    38423     32768 0        0
put_hevc_pel_bi_pixels32_8    172166    43996   8388608 0        3
put_hevc_pel_bi_pixels48_8    520118   133238      4096 0        0
put_hevc_pel_bi_pixels64_8    671892   173615   4194304 0        0
put_hevc_pel_pixels4_8          3859     3139   1048576     8926 9478
put_hevc_pel_pixels6_8          8453     6566     32768 0        0
put_hevc_pel_pixels8_8          7144     3093   4194304     4802 30239
put_hevc_pel_pixels12_8        25096    16648     65536 0        0
put_hevc_pel_pixels16_8        25472     9538   2097152      790 3094
put_hevc_pel_pixels24_8        93108    42948     32768 0        0
put_hevc_pel_pixels32_8       100331    37550   8388608 0        2
put_hevc_pel_pixels48_8       321258   137835      4096 0        0
put_hevc_pel_pixels64_8       387236   152538   4194304 0        0
put_hevc_qpel_bi_h4_8          34054    20498     16384 0        0
put_hevc_qpel_bi_h8_8          34264    10873    524288 0      801
put_hevc_qpel_bi_h12_8         85199    22938     16384 0        0
put_hevc_qpel_bi_h16_8        107035    20526    524288 0      488
put_hevc_qpel_bi_h24_8        323233    66440     16384 0        0
put_hevc_qpel_bi_h32_8        415699    76073    262144 0        0
put_hevc_qpel_bi_h48_8       1282990   246145      2048 0        0
put_hevc_qpel_bi_h64_8       1664853   260382    262144 0        0
put_hevc_qpel_bi_hv4_8         56239    31221     32768 0        0
put_hevc_qpel_bi_hv8_8         63859    21595   1048576 0       63
put_hevc_qpel_bi_hv12_8       143173    58139     65536 0        0
put_hevc_qpel_bi_hv16_8       184410    40468   1048576 0       15
put_hevc_qpel_bi_hv24_8       509364   134833     32768 0        0
put_hevc_qpel_bi_hv32_8       647015   125581    524288 0        0
put_hevc_qpel_bi_hv48_8      1929283   385204      4096 0        0
put_hevc_qpel_bi_hv64_8      2416442   430161    524288 0        0
put_hevc_qpel_bi_v4_8          37454    22461     32768 0        0
put_hevc_qpel_bi_v8_8          34500     9218   1048576        0 1291
put_hevc_qpel_bi_v12_8         87403    31659     32768 0        0
put_hevc_qpel_bi_v16_8        106589    19326   1048576 0      971
put_hevc_qpel_bi_v24_8        332644    78044     16384 0        0
put_hevc_qpel_bi_v32_8        405835    73886    524288 0        0
put_hevc_qpel_bi_v48_8       1266494   217496      2048 0        0
put_hevc_qpel_bi_v64_8       1677771   259481    524288 0        0
put_hevc_qpel_h4_8             29542    16982     16384 0        0
put_hevc_qpel_h8_8             26710    10452    524288 5      558
put_hevc_qpel_h12_8            67708    22021     16384 0        0
put_hevc_qpel_h16_8            81849    18637    524288 0      560
put_hevc_qpel_h24_8           258384    62392     16384 0        0
put_hevc_qpel_h32_8           321281    68451    262144 0        0
put_hevc_qpel_h48_8           984759   219657      2048 0        0
put_hevc_qpel_h64_8          1224717   227914    262144 0        0
put_hevc_qpel_hv4_8            51764    32150     32768 0        0
put_hevc_qpel_hv8_8            56369    21627   1048576 0       73
put_hevc_qpel_hv12_8          125191    48671     65536 0        0
put_hevc_qpel_hv16_8          159288    40749   1048576 0       10
put_hevc_qpel_hv24_8          438656   131331     32768 0        0
put_hevc_qpel_hv32_8          551607   121954    524288 0        0
put_hevc_qpel_hv48_8         1627266   397656      4096 0        0
put_hevc_qpel_hv64_8         2016176   414765    524288 0        0
put_hevc_qpel_uni_h4_8         21301    13384    131072 0        0
put_hevc_qpel_uni_h8_8         30057    11010    524288 7      486
put_hevc_qpel_uni_h12_8        84804    25790     16384 0        0
put_hevc_qpel_uni_h16_8        95333    24267    262144 0       17
put_hevc_qpel_uni_h24_8       318029    76951      4096 0        0
put_hevc_qpel_uni_h32_8       356799    72279     65536 0        0
put_hevc_qpel_uni_h48_8      1181308   237731       128 0        0
put_hevc_qpel_uni_h64_8      1401262   231221     16384 0        0
put_hevc_qpel_uni_hv4_8        39439    22837    262144 0        1
put_hevc_qpel_uni_hv8_8        60380    23283   1048576 0       77
put_hevc_qpel_uni_hv12_8      146759    56280     32768 0        0
put_hevc_qpel_uni_hv16_8      173329    45131    524288 0        2
put_hevc_qpel_uni_hv24_8      505434   139999     16384 0        0
put_hevc_qpel_uni_hv32_8      561402   120361    131072 0        0
put_hevc_qpel_uni_hv48_8     1854753   361780       256 0        0
put_hevc_qpel_uni_hv64_8     2142627   404073     32768 0        0
put_hevc_qpel_uni_v4_8         23081    12550    262144 0        0
put_hevc_qpel_uni_v8_8         30075     9971   1048576 5      511
put_hevc_qpel_uni_v12_8        89427    38025     16384 0        0
put_hevc_qpel_uni_v16_8        96131    21727    524288 0       23
put_hevc_qpel_uni_v24_8       328019    90689      8192 0        0
put_hevc_qpel_uni_v32_8       358340    71396    131072 0        0
put_hevc_qpel_uni_v48_8      1164812   176367       256 0        0
put_hevc_qpel_uni_v64_8      1464856   232866     32768 0        0
put_hevc_qpel_v4_8             31732    19999     32768 0        0
put_hevc_qpel_v8_8             25311     8967   1048576       10 1142
put_hevc_qpel_v12_8            67764    29917     32768 0        0
put_hevc_qpel_v16_8            78023    18260   1048576 0      819
put_hevc_qpel_v24_8           254724    75185     16384 0        0
put_hevc_qpel_v32_8           305639    69130    524288 0        0
put_hevc_qpel_v48_8           892900   240703      2048 0        0
put_hevc_qpel_v64_8          1149597   221632    524288 0        0
sao_edge_filter_8             600074    91811    524288 0        0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Reply via email to