On Wed, 20 Jan 2016 13:26:05 +0100, you wrote: >Hi, > >2016-01-19 13:46 GMT+01:00 John Cox <j...@kynesim.co.uk>: >> I've just done a fair bit of work on hevc_cabac decode for the Rasberry >> Pi2 and I think that the patch is generally applicable. Patch is >> attached but you may prefer to take it from git: > >This work is certainly impressive, and most people would have come >only with some of the "tricks" you used. >Although it already represents quite a bit of work, I echo others' >suggestions to have more incremental changes. > >> I have not yet run fate over it as I haven't yet finished downloading >> the samples (the internet connection here isn't wildly fast), but I have >> run it against the H265.1 conformance streams on both x86 and ARM and it >> causes no regressions. > >Your patch fails on the later fate tests linked to range extensions >(RExt sequences) on Win64. I didn't investigate why. Random thoughts: >transform_skip, cross-channel residual, some bypass-coded elements (eg >SAO).
Yeah - that does fail (and I'm not sure why either at the moment) - I only tested against the published H.265.1 conformance suite and that doesn't contain the RExt tests. Do you believe that master ffmpeg produces the right answer for these tests? I didn't spot any RExt logic in the scale code when I rewrote it (it does affect how numbers are processed there) and it warns that it isn't supported when ffmpeg runs. Having said that I would still have expected my code to produce the same result as the old code so I'll look into it. >> 3) Uses clz which doesn't seem to exist in the ffmpeg int libs (though >> ctz does) > >That could be a patch in and by itself. Apparently ff_clz is now on master - but wasn't in 2.8 (which is what RPi need) >So, referring to your changes, it would be nice to have the following >changes split in their own patches: >1) significant coeff flag decoding, which probably is the largest gain >(and therefore would be even nicer if further sliced): > a) for instance, you avoid an indirection by flattening/merging >context tables; > b) other parts, which I fear may not translate that well for other >platforms (at least without matching x86 code), or sequences >2) you use native sized integers in some places (not sure if that can >cause issues); >3) bypass-coded stuff is a fairly large change (both in terms of code, >review and impacting the cabac struct also used by h264); it would be >nice knowing how much you gain here >4) the replacing of !!something by something when the flag is already 0/1 >5) coefficient saturation I don't have formal numbers for everything but from the profiling I did in development: The by22 code gained me an overall factor of two in the abs level decode - the gains do depend a lot on the quantity of residual - you gain a lot more on I-frames than you do otherwise as they tend to have much longer residuals. The higher the bitrate the more useful this code is. But as you note it didn't use vast amounts of time relative to everything else anyway. The reworking / simplification of the loop(s) around the abs level decode and the scaling gave me the biggest single improvement. After that the reworking of get_sig_ceoff_flag_idxs was a useful gain Special caseing the single coeff path gave a similar gain After that the scale rework - now probably 75% faster than it was previously but it wasn't taking a huge amount of time. And after that all the other bits - my experience with optimising this sort of code (I did a lot of work on a TI H.264 implementation in the past) is that no single change is going to do everything, you just have to polish everything until it goes fast enough. >3) is indeed the largest chunk. I don't know what your profiling >indicated, but the original code didn't seem that high-profile. But I >haven't split it to see what it actually provided, but overall numbers >look good: > >I quickly hacked (quickly being the keyword as it also means poor and >potentially resulting in faulty conclusion) something that is close to >2) + 4) for reference. >Benching REF+1)a) vs REF+1), it did seem slower on Win64/Haswell for >significant flag decoding by a few cycles (around 1% of the codeblock) >Benching REF+1)a) vs your patch, I see around 3% improvement with >something that is fairly more optimized overall than ffmpeg's master, >ie ff_hevc_hls_residual_coding is a lot more prevalent, which is >probably also the case in your rpi2 benchmarks. Sorry - I don't quite understand what you've said here. >Note: I don't think I'll review next iterations of the patch(set) with >any shape of diligence, but some of the above parts (1.a, 4 and 5) are >ok if not the cause of the fate issues. > >Best regards, Thanks JC _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel