On Tue, Nov 24, 2015 at 08:32:05PM -0500, Ganesh Ajjanagadde wrote: > On Tue, Nov 24, 2015 at 8:19 PM, Michael Niedermayer <michae...@gmx.at> wrote: > > On Mon, Nov 23, 2015 at 08:00:12PM -0500, Ganesh Ajjanagadde wrote: > >> This is a trivial rewrite of the loops that results in better cache > >> efficiency. Essentially, iterating backwards over an array is a bad > >> idea, since it leads to many cache misses: forward iteration fetches a > >> new cache line that gets subsequently used, while backwards iteration > >> fetches the > >> cache line thinking that the indexing would move forwards (the common > >> behavior), > >> resulting in many cache misses. > >> > >> Speedup is nontrivial. Benchmarks obtained by 10^6 iterations within > >> solve_lls, with START/STOP_TIMER. File is > >> tests/data/fate/flac-16-lpc-cholesky.err. > >> Hardware: x86-64, Haswell, GNU/Linux. > >> > >> new: > >> 12598 decicycles in solve_lls, 2096871 runs, 281 skips > >> 12393 decicycles in solve_lls, 4193889 runs, 415 skips > >> 12250 decicycles in solve_lls, 8387253 runs, 1355 skips > >> 12585 decicycles in solve_lls,16775089 runs, 2127 skips > >> 12785 decicycles in solve_lls,33550272 runs, 4160 skips > >> 12483 decicycles in solve_lls,67101734 runs, 7130 skips > >> 12610 decicycles in solve_lls,134202614 runs, 15114 skips > >> > >> old: > >> 18101 decicycles in solve_lls, 2096659 runs, 493 skips > >> 17850 decicycles in solve_lls, 4193609 runs, 695 skips > >> 17757 decicycles in solve_lls, 8387458 runs, 1150 skips > >> 17746 decicycles in solve_lls,16775207 runs, 2009 skips > >> 17906 decicycles in solve_lls,33550820 runs, 3612 skips > >> 17931 decicycles in solve_lls,67102891 runs, 5973 skips > >> 17907 decicycles in solve_lls,134206693 runs, 11035 skips > >> > >> ------------------------------------------------------------------------------- > >> Barring asm for this function, there are two (more interesting) potential > >> optimizations for the Cholesky decomposition here: > >> 1. Notice that update_lls is doing a rank one update of the matrix via the > >> var > >> vector. Rank one update of the Cholesky decomposition can be done much more > >> efficiently than a full blown decomposition, see e.g Wikipedia. This will > >> almost > >> certainly require some API thought. > >> > >> 2. LDL' form of the Cholesky factorization offers 2 main advantages: > >> a) Avoiding sqrt and its associated cost, trading it off with extra > >> multiplications. > >> This may or may not be worth the computational cost, though it seems like > >> in > >> FFmpeg, the Cholesky operates on small matrices of dim ~ 3-50, resulting in > >> not too many extra multiplies. > >> b) It provides benefits especially in the poorly conditioned > >> case since one does not have to worry about nan's and associated tainting > >> due > >> to the sqrt. > >> This may or may not require API thought. > >> > >> I personally view 1 as being more unambiguously worthy of exploration at > >> this stage. > >> > >> Signed-off-by: Ganesh Ajjanagadde <gajjanaga...@gmail.com> > >> --- > >> libavutil/lls.c | 6 +++--- > >> 1 file changed, 3 insertions(+), 3 deletions(-) > > > > this significantly changes the output for > > libavutil/lls-test > > > > is that intended ? > > Wait, I thought there was only one line changed in a diff between the > old and new output, and it was trivial involving a nan. Maybe I posted > the wrong patch, in which case sorry.
--- old 2015-11-25 02:15:20.396932045 +0100 +++ new 2015-11-25 02:15:26.620932176 +0100 @@ -2,299 +2,299 @@ real:-0.828987 order:1 pred:-0.828987 var:0.000000 coeffs:2.166433 0.000000 0.000000 real:-0.828987 order:2 pred:-0.828987 var:0.000000 coeffs:2.166433 0.000000 0.000000 real:-0.326534 order:0 pred:-0.588792 var:0.272632 coeffs:1.427836 -0.385559 0.000000 -real:-0.326534 order:1 pred:-0.326534 var:-nan coeffs:3.274787 -1.436430 0.000000 -real:-0.326534 order:2 pred:-0.326534 var:-nan coeffs:3.274787 -1.436430 0.000000 +real:-0.326534 order:1 pred: 0.435087 var:0.734692 coeffs:1.427836 -1.436430 0.000000 +real:-0.326534 order:2 pred: 0.435087 var:0.734692 coeffs:1.427836 -1.436430 0.000000 real:-0.496309 order:0 pred:-0.560478 var:0.227332 coeffs:1.343229 -0.128675 -0.372131 -real:-0.496309 order:1 pred:-0.649268 var:0.214850 coeffs:1.541607 -0.245332 0.000000 -real:-0.496309 order:2 pred:-0.496309 var:0.000000 coeffs:1.629421 0.626407 -0.697480 +real:-0.496309 order:1 pred:-1.051144 var:0.444802 coeffs:2.504735 -0.245332 0.000000 +real:-0.496309 order:2 pred:-0.882916 var:0.272824 coeffs:2.504735 -0.245332 -0.697480 real: 0.386201 order:0 pred: 0.644642 var:0.263532 coeffs:1.005471 -0.232581 -0.189366 -real: 0.386201 order:1 pred: 0.590550 var:0.236486 coeffs:1.416443 -0.420822 0.000000 -real: 0.386201 order:2 pred: 0.653364 var:0.216704 coeffs:1.417320 -0.084129 -0.312548 +real: 0.386201 order:1 pred: 0.480672 var:0.250092 coeffs:1.245061 -0.420822 0.000000 +real: 0.386201 order:2 pred: 1.327563 var:0.521050 coeffs:1.980792 0.330539 -0.312548 real:-0.193176 order:0 pred:-0.419259 var:0.261372 coeffs:0.886941 -0.255165 -0.256307 -real:-0.193176 order:1 pred:-0.397912 var:0.235146 coeffs:1.347037 -0.459663 0.000000 -real:-0.193176 order:2 pred:-0.320068 var:0.205317 coeffs:1.374542 -0.016455 -0.397716 +real:-0.193176 order:1 pred:-0.379534 var:0.235868 coeffs:1.308159 -0.459663 0.000000 +real:-0.193176 order:2 pred:-0.058885 var:0.330313 coeffs:0.926332 -0.111365 -0.397716 [...and lots more...] [...] -- Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB No great genius has ever existed without some touch of madness. -- Aristotle
signature.asc
Description: Digital signature
_______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel