> -----Original Message----- > From: ffmpeg-devel [mailto:ffmpeg-devel-boun...@ffmpeg.org] On Behalf Of > Niklas Haas > Sent: Tuesday, May 22, 2018 10:28 AM > To: ffmpeg-devel@ffmpeg.org > Cc: Mark Thompson <s...@jkqxz.net> > Subject: Re: [FFmpeg-devel] [PATCH] lavfi: add opencl tonemap filter. > > On Tue, 22 May 2018 01:18:30 +0100, Mark Thompson <s...@jkqxz.net> wrote: > > On 21/05/18 07:50, Ruiling Song wrote: > > > This filter does HDR(HDR10/HLG) to SDR conversion with tone-mapping. > > > > > > An example command to use this filter with vaapi codecs: > > > FFMPEG -init_hw_device vaapi=va:/dev/dri/renderD128 -init_hw_device \ > > > opencl=ocl@va -hwaccel vaapi -hwaccel_device va -hwaccel_output_format > \ > > > vaapi -i INPUT -filter_hw_device ocl -filter_complex \ > > > '[0:v]hwmap,tonemap_opencl=t=bt2020:tonemap=linear:format=p010[x1]; > \ > > > [x1]hwmap=derive_device=vaapi:reverse=1' -c:v hevc_vaapi -profile 2 > OUTPUT > > > > > > Signed-off-by: Ruiling Song <ruiling.s...@intel.com> > > > --- > > > > I assume you're testing with Beignet for this sort of mapping to work? I > > tried it > with Beignet on Coffee Lake with 10-bit videos and it looks sensible, though > it is > rather hard to tell whether it is in some sense "correct". > > It's also rather hard to define whether it is in some sense "correct". > The methodology employed here is generally based on ITU-R > recommendations, however the ITU-R advises multiple possible ways of > doing tone-mapping. They also highlight their own curve function, which > we don't use (for performance/simplicity reasons - iirc I gave it a try > and the result was not visually dissimilar enough from the hable > function, but my memory could be wrong). There's nothing resembling an > official "standard" way to tone-map defined by the ITU-R. > > This algorithm is also generally based on the results obtained from the > "official" ACES implementation of HDR->SDR tone mapping (obtainable > here: https://github.com/ampas/aces-dev), with the key difference that > we do chroma-invariant tone mapping whereas hollywood tends to use > channel-independent tone mapping. I think the latter distorts the colors > too much for taste and generally results in a worse looking image. The > only important bit to make chroma-invariant tone mapping work well, > however, is the need for a good desaturation algorithm. This one is > based on original research and experimentation. The desaturation > strength with a parameter of 1.0 is comparable to the one achieved by > the ACES algorithm, although I pick a lower strength by default (0.5), > because I found it too damaging for some types of sources (particularly > bright skies) as a result of the chroma-invariant nature. > > In addition to the desaturation step, the other key innovation which I > cannot find mentioned in ITU-R literature is the importance of adjusting > the overall average brightness before tone mapping. I suspect the reason > this isn't considered by the ITU-R is because the ITU-R assumes that HDR > sources actually follow their standards, which in practice none seem to > do. In theory, HDR material isn't supposed to have a significantly > higher average brightness than SDR material. Apart from the addition of > isolated super-highlights, nothing should have changed about the image > appearance. In practice, HDR directors like to point their expensive > cameras at very bright objects (e.g. the sun) and blind viewers' eyes by > failing to adjust the brightness during the mastering step. Our > algorithm compensates for this by essentially "correcting" the bad > mastering in real-time. [1] Of course, the result here is not as good as > doing it ahead of time by a human, but unfortunately we don't have a say > in this matter. > > As long as the implementation is correct, I'd be confident in assuming > that this produces pleasurable results for all the sources I've thrown > at it, often even exceeding in quality the "official" SDR-mapped blu-ray > versions of the same sources on the same scenes. (Partially due to the > preserved higher color gamut) > > In order to ascertain whether or not the implementation is correct, you > could compare it to results obtained by latest `mpv` (might need git > master) or `libplacebo`, both of which implement the same algorithm. > > > [1] The algorithm I use in mpv and libplacebo does this with one frame > of latency, because I don't want to round-trip through an intermediate > buffer in my processing chain, and there's no other way to communicate > back the measured frame statistics to the rest of the kernels in > OpenGL/Vulkan land. I do this because of my realtime requirements as > well as the structure of my processing chain. > > Since you are implementing an offline filter and neither of these > restrictions apply to you, I would recommend changing the implementation > to separate the peak measurement step from the tone mapping step, so > that the former completes first and then the second runs from scratch > and can use the results computed by the former in the same frame. Yes, your idea sounds reasonable. But it may need much effort to re-structure the code to make it (that would launch two kernels, and we may need a wait between them) and evaluate the performance. Although we are developing offline filter, I think that performance is still very important as well as quality. Given that the current implementation does well for video transcoding. I would leave it in my TODO list. Sounds ok?
> > If you don't do this, you run the risk of failing to tone map single > frame data (e.g. screenshots), because no data about the previous frame > is available at the time. > > > > +// detect peak/average signal of a frame, the algorithm was ported from: > > > +// libplacebo (https://github.com/haasn/libplacebo) > > > +struct detection_result > > > +detect_peak_avg(global uint *util_buf, __local uint *sum_wg, > > > + float signal, float peak) { > > > + global uint *avg_buf = util_buf; > > > + global uint *peak_buf = avg_buf + DETECTION_FRAMES + 1; > > > + global uint *counter_wg_p = peak_buf + DETECTION_FRAMES + 1; > > > + global uint *max_total_p = counter_wg_p + 1; > > > + global uint *avg_total_p = max_total_p + 1; > > > + global uint *frame_idx_p = avg_total_p + 1; > > > + global uint *scene_frame_num_p = frame_idx_p + 1; > > > + > > > + uint frame_idx = *frame_idx_p; > > > + uint scene_frame_num = *scene_frame_num_p; > > > + > > > + size_t lidx = get_local_id(0); > > > + size_t lidy = get_local_id(1); > > > + size_t lsizex = get_local_size(0); > > > + size_t lsizey = get_local_size(1); > > > + uint num_wg = get_num_groups(0) * get_num_groups(1); > > > + size_t group_idx = get_group_id(0); > > > + size_t group_idy = get_group_id(1); > > > + struct detection_result r = {peak, sdr_avg}; > > > + *sum_wg = 0; > > > > This is technically a data race - maybe set it in only the first workitem? > > I'm not sure where the data race is here. There's a barrier immediately > below it, which ensures that all of the *sum_wg writes must complete > before progressing further, no? So even though all of the threads conflict > in their write to *sum_wg, they all write the same thing and wait for > each other before continuing. > > > > > > + barrier(CLK_LOCAL_MEM_FENCE); > > > + > > > + // update workgroup sum > > > + atomic_add(sum_wg, (uint)(signal * REFERENCE_WHITE)); > > > > I think the numbers you're adding together here sum to at most something > > like > 16 * 16 * 100 * 1023? Can you make sure this can't overflow and add a > comment on that. > > It's not * 1023, the highest possible peak in practice is 100.0 (PQ's > peak brightness). So the limit per workgroup is 16 * 16 * 10000, > requiring 22 bits to not overflow on a pathological input. > > > > > > + barrier(CLK_LOCAL_MEM_FENCE); > > > + > > > + // update frame peak/avg using work-group-average. > > > + if (lidx == 0 && lidy == 0) { > > > + uint avg_wg = *sum_wg / (lsizex * lsizey); > > > + atomic_max(&peak_buf[frame_idx], avg_wg); > > > + atomic_add(&avg_buf[frame_idx], avg_wg); > > > > Similarly this one? (width/16 * height/16 * 100 * 1023, I think, which > > might > overflow for 8K?) > > For 8K it's 8192/16 * 4320/16 * 10000, requiring 31 bits to store > without theoretical risk of overflow. > > And actually, there is a third source of overflow worth investigating, > namely the *avg_total_p variable, since this accumulates across frames. > It stores a value of 10000 * (PEAK_DETECTION_FRAMES+1). In practice, > however, this shouldn't cause any issues for typical buffer sizes. > (Needing 20 bits for a buffer size of 100). > > Note: In practice, none of these considerations are that worth worrying > about, since the average illumination of a scene is generally around at > most 50, so it's more like 23 bits needed to store a typical scene > rather than the 31 worst case I calculated earlier. The only scenario in > which I could imagine a worst case like that occurring in normal content > is if some mastering engineer mistakenly implements a "fade to white" by > fading to the highest possible HDR peak, and this were to somehow > survive being reviewed by other humans who presumably have functioning > retinas that would be screaming in pain as their displays blasted 10000 > cd/m² during the fade. > > > > + // de-saturate > > > + if (desat_param > 0.0f) { > > > + float luma = get_luma_dst(rgb); > > > + float base = 0.18f * dst_peak; > > > > Magic number might want some explaination. > > It is derived from experimentation and visual comparisons with e.g. the > ACES algorithm. There is no theoretical basis for it. > > > +float3 ootf_hlg(float3 c) { > > + float luma = get_luma_src(c); > > + // assume a reference display with 1000 nits peak > > + float factor = 1000.0f / REFERENCE_WHITE * pow(luma, 0.2f) / pow(12.0f, > 1.2f); > > + return c * factor; > > +} > > + > > +float3 inverse_ootf_hlg(float3 c) { > > + // assume a reference display with 1000 nits peak > > + c *= pow(12.0f, 1.2f) / (1000.0f / REFERENCE_WHITE); > > + c /= pow(get_luma_dst(c), 0.2f / 1.2f); > > + return c; > > +} > > I would recommend parametrizing these by the peak variable. When you > tone map from HLG to HLG at a lower peak, the inverse OOTF call needs to > use the new peak. (You should also update the peak tagging in the > frame's side channel data, not sure if you do). Are you talking about display-referred HLG? I didn't update frame side channel data. I am not sure when do I need to update it. I thought all HLG should be scene-referred, seems not? Could you tell me more about display-referred HLG? I don't find anything about it. What metadata in HEVC indicate display-referred? Any display-referred HLG video sample? Thanks for your comment. Ruiling > > Ditto, for the forwards OOTF, the `peak` needs to match the value you > assume for the src sig peak down below. You have it hard-coded as 12.0 > for HLG, which is the correct peak in scene-referred space, but that > doesn't necessarily need to match the display referred case, which is > what's relevant for tone mapping. If you tune the OOTF for a 1000 nits > peak display, the source peak after applying the OOTF would be 10.0, not > 12.0. Alternatively, you could tune the OOTF for 1200 nits instead. > (This is what libplacebo does, although I think not intentionally. I'll > change it to use 1000 nits as well.) > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > http://ffmpeg.org/mailman/listinfo/ffmpeg-devel _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel