On 02/09/2016 04:03 PM, Daniel Vetter wrote: > On Tue, Feb 09, 2016 at 04:11:49PM +0200, Ville Syrjälä wrote: >> On Tue, Feb 09, 2016 at 02:53:30PM +0100, Mario Kleiner wrote: >>> On 02/09/2016 11:09 AM, Daniel Vetter wrote: >>>> On Mon, Feb 08, 2016 at 02:13:28AM +0100, Mario Kleiner wrote: >>>>> The changes to drm_update_vblank_count() in Linux 4.4 added a >>>>> method to emulate a hardware vblank counter by use of high >>>>> precision vblank timestamps if a kms driver supports those, >>>>> but doesn't suppport hw vblank counters. >>>>> >>>>> That method assumes that the old timestamp from a previous >>>>> invocation is valid, but that is not always the case. E.g., >>>>> if drm_reset_vblank_timestamp() gets called during drm_vblank_on() >>>>> or drm_update_vblank_count() gets called outside vblank irq and >>>>> the high precision timestamping can't deliver a precise timestamp, >>>>> ie. drm_get_last_vbltimestamp() delivers a return value of false, >>>>> then those functions will initialize the old timestamp to zero >>>>> to mark it as invalid. >>>>> >>>>> A following call to drm_update_vblank_count() would then calculate >>>>> elapsed time with vblank irqs off as current vblank timestamp minus >>>>> the zero old timestamp and compute a software vblank counter increment >>>>> that corresponds to system uptime, causing a large forward jump of the >>>>> software vblank counter. That jump in turn can cause too long waits >>>>> in drmWaitVblank and very long delays in delivery of vblank events, >>>>> resulting in hangs of userspace clients. >>>>> >>>>> This problem can be observed on nouveau-kms during machine >>>>> suspend->resume cycles, where drm_vblank_off is called during >>>>> suspend, drm_vblank_on is called during resume and the first >>>>> queries to drm_get_last_vbltimestamp() don't deliver high >>>>> precision timestamps, resulting in a large harmful counter jump. >>>> >>>> Why does nouveau enable vblank interrupts before it can get valid >>>> timestamps? That sounds like the core bug here, and papering over that in >>>> the vblank code feels very wrong to me. Maybe we should instead just not >>>> sample the vblank at all if the timestamp fails and splat a big WARN_ON >>>> about this, to give driver writers a chance to fix up their mess? >>>> -Daniel >>>> >>> >>> The high precision timestamping is allowed to fail for a kms driver >>> under some conditions which are not implementation errors of the driver, >>> or getting disabled by user override, so we should handle that robustly. >>> We handle it robustly everywhere else in the drm, so we should do it >>> here as well. >>> >>> E.g., nouveau generally supports timestamping/scanout position queries, >>> but can't support it on old pre NV-50 hardware with some output type >>> (either on analog VGA, or DVI-D, can't remember atm. which one is >>> unsupported). >> >> I think the surprising thing here is that it fails first and then >> succeeds on the same crtc/output combo presumably. > > Yeah exactly this. Failing consistently is ok imo (and probably should be > handled). Failing first and then later on working (without changing the > mode config in between) seems suspicous. That shouldn't ever happen really > and seems like a driver bug (like enabling vblanks too early, before the > pipe is fully up&running). > -Daniel > >> >> I guess in theory the driver could fail during random times for whatever >> reason, though I tend to think that the driver should either make it >> robust or not even pretend to support it. >> >> I suppose one failure mode the driver can't really do anything about is >> some random SMI crap or something stalling the timestamp queries enough >> that we fail the precisions tests and exhaust the retry limits. So yeah, >> making it robust against that kind of nastyness sounds line it might be >> a good idea. Though perhaps it should be something a bit more severe >> than a DRM_DEBUG since I think it really shouldn't happen when the >> driver and system are otherwise sane. Of course if it routinely fails >> with some driver then simply making it spew errors into dmesg isn't >> so nice, unless the driver also gets fixed. >>
I think i have an idea what might go wrong with nouveau, so i'll see if i can add a fixup patch. There's another scenario where this zero-ts case can be hit. If the driver drm_vblank_init()'s - setting all timestamps to zero - and then code starts using vblanks (drm_vblank_get()) without drm_vblank_on beforehand, which is afaics the case with nouveau. Unless that's considered an error as well, we'd need to init the timestamps to something non-zero but harmless like 1 usecs at drm_vblank_init() time? What makes sense as output here? DRM_WARN_ONCE? -mario >>> >>> There are also new Soc drivers showing up which support those methods, >>> but at least i didn't verify or test if their implementations are good >>> enough for the needs of the new drm_udpate_vblank_count() which is more >>> sensitive to kms drivers being even slightly off. >>> >>> -mario >>> >>>>> >>>>> Fix this by checking if the old timestamp used for this calculations >>>>> is zero == invalid. If so, perform a counter increment of +1 to >>>>> prevent large counter jumps and reinitialize the timestamps to >>>>> sane values. >>>>> >>>>> Signed-off-by: Mario Kleiner <mario.kleiner.de at gmail.com> >>>>> Cc: <stable at vger.kernel.org> # 4.4+ >>>>> Cc: michel at daenzer.net >>>>> Cc: vbabka at suse.cz >>>>> Cc: ville.syrjala at linux.intel.com >>>>> Cc: daniel.vetter at ffwll.ch >>>>> Cc: dri-devel at lists.freedesktop.org >>>>> Cc: alexander.deucher at amd.com >>>>> Cc: christian.koenig at amd.com >>>>> --- >>>>> drivers/gpu/drm/drm_irq.c | 7 +++++++ >>>>> 1 file changed, 7 insertions(+) >>>>> >>>>> diff --git a/drivers/gpu/drm/drm_irq.c b/drivers/gpu/drm/drm_irq.c >>>>> index fb17c45..88bdf19 100644 >>>>> --- a/drivers/gpu/drm/drm_irq.c >>>>> +++ b/drivers/gpu/drm/drm_irq.c >>>>> @@ -216,6 +216,13 @@ static void drm_update_vblank_count(struct >>>>> drm_device *dev, unsigned int pipe, >>>>> DRM_DEBUG_VBL("crtc %u: Redundant vblirq >>>>> ignored." >>>>> " diff_ns = %lld, framedur_ns = >>>>> %d)\n", >>>>> pipe, (long long) diff_ns, >>>>> framedur_ns); >>>>> + >>>>> + /* No valid t_old to calculate diff? Bump +1 to force reinit. */ >>>>> + if (t_old->tv_sec == 0 && t_old->tv_usec == 0) { >>>>> + DRM_DEBUG_VBL("crtc %u: No baseline ts. Bump +1.\n", >>>>> + pipe); >>>>> + diff = 1; >>>>> + } >>>>> } else { >>>>> /* some kind of default for drivers w/o accurate vbl >>>>> timestamping */ >>>>> diff = (flags & DRM_CALLED_FROM_VBLIRQ) != 0; >>>>> -- >>>>> 1.9.1 >>>>> >>>> >> >> -- >> Ville Syrjälä >> Intel OTC >