On Mon, Feb 05, 2018 at 11:54:04PM +0000, Andy Lutomirski wrote: > > > > On Feb 5, 2018, at 2:50 PM, Rodrigo Vivi <rodrigo.v...@intel.com> wrote: > > > >> On Sat, Feb 03, 2018 at 05:33:08PM +0000, Andy Lutomirski wrote: > >>> On Fri, Feb 2, 2018 at 7:18 PM, Andy Lutomirski <l...@kernel.org> wrote: > >>>> On Fri, Feb 2, 2018 at 1:24 AM, Andy Lutomirski <l...@kernel.org> wrote: > >>>>> On Thu, Feb 1, 2018 at 9:20 PM, Chris Wilson <ch...@chris-wilson.co.uk> > >>>>> wrote: > >>>>> Quoting Andy Lutomirski (2018-02-01 21:04:30) > >>>>>> I got this after a recent suspend/resume: > >>>>>> > >>>>>> Feb 01 09:44:34 laptop systemd-logind[2412]: Lid closed. > >>>>>> Feb 01 09:44:34 laptop systemd-logind[2412]: device-enumerator: scan > >>>>>> all dirs > >>>>>> Feb 01 09:44:34 laptop systemd-logind[2412]: device-enumerator: > >>>>>> scanning /sys/bus > >>>>>> Feb 01 09:44:34 laptop systemd-logind[2412]: device-enumerator: > >>>>>> scanning /sys/class > >>>>>> Feb 01 09:44:34 laptop systemd-logind[2412]: Failed to open > >>>>>> configuration file '/etc/systemd/sleep.conf': No such file or > >>>>>> directory > >>>>>> Feb 01 09:44:34 laptop systemd-logind[2412]: Suspending... > >>>>>> Feb 01 09:44:34 laptop systemd-logind[2412]: Sent message type=signal > >>>>>> sender=n/a destination=n/a object=/org/freedesktop/login1 > >>>>>> interface=org.freedesktop.login1.Manager member=PrepareForSleep > >>>>>> cookie=570 reply > >>>>>> Feb 01 09:44:34 laptop systemd-logind[2412]: Got message > >>>>>> type=method_call sender=:1.46 destination=:1.1 > >>>>>> object=/org/freedesktop/login1/session/_32 > >>>>>> interface=org.freedesktop.login1.Session member=ReleaseDevice > >>>>>> Feb 01 09:44:34 laptop systemd-logind[2412]: Sent message type=signal > >>>>>> sender=n/a destination=:1.46 > >>>>>> object=/org/freedesktop/login1/session/_32 > >>>>>> interface=org.freedesktop.login1.Session member=PauseDevice cookie > >>>>>> Feb 01 09:44:34 laptop gnome-shell[2630]: Failed to apply DRM plane > >>>>>> transform 0: Permission denied > >>>>>> Feb 01 09:44:34 laptop gnome-shell[2630]: drmModeSetCursor2 failed > >>>>>> with (Permission denied), drawing cursor with OpenGL from now on > >>>>>> > >>>>>> But I don't see the word "cursor" in my system logs before the first > >>>>>> suspend. What am I looking for? This is Fedora 27 running a Gnome > >>>>>> Wayland session, but it hasn't been reinstalled in some time, so it's > >>>>>> possible that there are some weird settings sitting around. But I did > >>>>>> check and I have no weird i915 parameters. > >>>>> > >>>>> You are using gnome-shell as the display server. From that it appears to > >>>>> have started off with a HW cursor and switched to a SW cursor after > >>>>> suspend. Did you notice a change in behaviour? After rebooting or just > >>>>> restarting gnome-shell? > >>>> > >>>> I think it's less consistently bad after a reboot before suspending. > >>>> > >>>>> > >>>>>> Also, are these things potentially related: > >>>>>> > >>>>>> [ 3067.702527] [drm:intel_pipe_update_start [i915]] *ERROR* Potential > >>>>>> atomic update failure on pipe A > >>>>> > >>>>> They are just "missed the immediate vblank for the screen update" > >>>>> messages. Should not be related to PSR, but may cause jitter by delaying > >>>>> the odd screen update. > >>>> > >>>> I just got this one, and the timestamp is at least reasonably close to > >>>> a giant latency spike: > >>>> > >>>> [ 288.799654] [drm:intel_pipe_update_end [i915]] *ERROR* Atomic > >>>> update failure on pipe A (start=31 end=32) time 15 us, min 1073, max > >>>> 1079, scanline start 1087, end 1088 > >>>> > >>>>> > >>>>>> As I'm typing this, I've seen a couple instances of what seems like a > >>>>>> full *second* of cursor latency, but I've only gotten the potential > >>>>>> atomic update failure once. > >>>>>> > >>>>>> And is there any straightforward tracing to do to distinguish between > >>>>>> PSR exit latency and other potential sources of latency? > >>>>> > >>>>> It looks plausible that we could at least report how long it takes the > >>>>> registers to reflect the change in state (but we don't). The best source > >>>>> of information atm is /sys/kernel/debug/dri/0/i915_edp_psr_status. > >>>> > >>>> Hmm. > >>>> > >>>> I went and looked at the code, and I noticed what could be bugs or > >>>> could (more likely) be my confusion since I don't know this code at > >>>> all: > >>>> > >>>> intel_single_frame_update() does something inscrutable to me, but I > >>>> imagine it does something that causes the next page flip to get > >>>> noticed by the panel even with PSR on. But how does the code that > >>>> calls it know that anything happened? (Looking at the commit history, > >>>> maybe this is something special that's only needed on some platforms > >>>> but doesn't replace the normal PSR exit sequence.) > >>>> > >>>> Perhaps more interestingly, intel_psr_flush() does this: > >>>> > >>>> /* By definition flush = invalidate + flush */ > >>>> if (frontbuffer_bits) > >>>> intel_psr_exit(dev_priv); > >>>> > >>>> if (!dev_priv->psr.active && !dev_priv->psr.busy_frontbuffer_bits) > >>>> if (!work_busy(&dev_priv->psr.work.work)) > >>>> schedule_delayed_work(&dev_priv->psr.work, > >>>> msecs_to_jiffies(100)); > >>>> > >>>> I'm guessing that the idea is that we're turning off PSR because we > >>>> want the panel to update and we expect that, in 100ms, the update will > >>>> have hit the panel and we'll have been idle long enough for it to make > >>>> sense to re-enter PSR. IOW, the code wants PSR to be off for at least > >>>> 100ms and then to turn back on. But this code actually says "turn PSR > >>>> back on in at *most* 100ms". What happens if there are two screen > >>>> updates 99ms apart? The first one should work fine, but the next one > >>>> will hit with 1ms left on the delayed work, and intel_psr_work() will > >>>> get called in 1ms. There's some magic with busy_frontbuffer_bits, but > >>>> it seems questionable to me that intel_psr_flush() clears > >>>> busy_frontbuffer_bits and *then* calls intel_psr_exit(). > >>>> > >>>> Naively, I would expect that PSR needs to be kept off until the vblank > >>>> following the page flip. > >>>> > >>>> Also, in intel_psr_work(), shouldn't this code: > >>>> > >>>> /* > >>>> * The delayed work can race with an invalidate hence we need to > >>>> * recheck. Since psr_flush first clears this and then reschedules we > >>>> * won't ever miss a flush when bailing out here. > >>>> */ > >>>> if (dev_priv->psr.busy_frontbuffer_bits) > >>>> goto unlock; > >>>> > >>>> re-arm the delayed work? > >>>> > >>>> Anyway, this is all on a 4.14 kernel. I should update to 4.16 and see > >>>> what happens. > >>> > >>> I updated to 4.15, and the situation is much worse. With > >>> enable_psr=1, the system survives for several seconds and then the > >>> screen stops updating entirely. If I boot with i915.enable_psr=1, I > >>> get to the Fedora login screen and then the system dies. If I set > >>> enable_psr=1 using sysfs, it does a bit after the next resume. It > >>> seems like it also sometimes hangs even worse a bit after the screen > >>> stops updating, but it's hard to tell. > >>> > >>> I see this in my logs: > >>> > >>> [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* > >>> [CRTC:37:pipe A] flip_done timed out > >>> > >>> Sometimes I see this a bit later: > >>> > >>> [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* > >>> [CRTC:37:pipe A] flip_done timed out > >>> > >>> I'm able to get some debugging out before the system dies. I see > >>> intel_psr_flush() getting called a bunch, and I don't see > >>> intel_psr_invalidate() being called at all. I also see > >>> intel_psr_work() activating psr as little as 2ms after > >>> intel_psr_flush() finishes. So I think the code is indeed buggy or at > >>> least questionable. I'd try to fix it (at least as well as I can > >>> without knowing anything about how the PSR state machine actually > >>> works), but the fact that the system hangs would make it very hard to > >>> test. > >> > >> I filed https://bugs.freedesktop.org/show_bug.cgi?id=104931 for the > >> questionable reactivation logic issue. If nothing else, I doubt that > >> the reactivation timeout logic does what its author intended for it to > >> do. > > > > Thanks for that. I will check the logic and your patch here. > > > > One thing to consider is that when activating PSR HW engine will > > consider the number of static frames before transitioning it to > > active state. So, in general the immediate return to active status > > shouldn't be a problem. > > At least not for core platforms. maybe for vlv/chv. > > > > Hmm. Is there some register that can be poked to tell the hardware "hey, I > updated the image even if your dirty tracking didn't notice, so turn off PSR > if it's on and reset your idle frame counter"? If so, the logic could be > dramatically simplified.
that would be my dream ;) But the only thing we can do is to disable psr and re-enable it. So the frame counter will restart. _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel