On 20/04/16 17:23, Robert Bragg wrote: > Gen graphics hardware can be set up to periodically write snapshots of > performance counters into a circular buffer via its Observation > Architecture and this patch exposes that capability to userspace via the > i915 perf interface. > > Cc: Chris Wilson <chris at chris-wilson.co.uk> > Signed-off-by: Robert Bragg <robert at sixbynine.org> > Signed-off-by: Zhenyu Wang <zhenyuw at linux.intel.com> > --- > drivers/gpu/drm/i915/i915_drv.h | 56 +- > drivers/gpu/drm/i915/i915_gem_context.c | 24 +- > drivers/gpu/drm/i915/i915_perf.c | 940 > +++++++++++++++++++++++++++++++- > drivers/gpu/drm/i915/i915_reg.h | 338 ++++++++++++ > include/uapi/drm/i915_drm.h | 70 ++- > 5 files changed, 1408 insertions(+), 20 deletions(-) > > + > + > + /* It takes a fairly long time for a new MUX configuration to > + * be be applied after these register writes. This delay > + * duration was derived empirically based on the render_basic > + * config but hopefully it covers the maximum configuration > + * latency... > + */ > + mdelay(100);
With such a HW and SW design, how can we ever expose hope to get any kind of performance when we are trying to monitor different metrics on each draw call? This may be acceptable for system monitoring, but it is problematic for the GL extensions :s Since it seems like we are going for a perf API, it means that for every change of metrics, we need to flush the commands, wait for the GPU to be done, then program the new set of metrics via an IOCTL, wait 100 ms, and then we may resume rendering ... until the next change. We are talking about a latency of 6-7 frames at 60 Hz here... this is non-negligeable... I understand that we have a ton of counters and we may hide latency by not allowing using more than half of the counters for every draw call or frame, but even then, this 100ms delay is killing this approach altogether. To be honest, if it indeed is an HW bug, then the approach that Samuel Pitoiset and I used for Nouveau involving pushing an handle representing a pre-computed configuration to the command buffer so as a software method can be ask the kernel to reprogram the counters with as little idle time as possible, would be useless as waiting for the GPU to be idle would usually not take more than a few ms... which is nothing compared to waiting 100ms. So, now, the elephant in the room, how can it take that long to apply the change? Are the OA registers double buffered (NVIDIA's are, so as we can reconfigure and start monitoring multiple counters at the same time)? Maybe this 100ms is the polling period and the HW does not allow changing the configuration in the middle of a polling session. In this case, this delay should be dependent on the polling frequency. But even then, I would really hope that the HW would allow us to tear down everything, reconfigure and start polling again without waiting for the next tick. If not possible, maybe we can change the frequency for the polling clock to make the polling event happen sooner. HW delays are usually a few microseconds, not milliseconds, that really suggests that something funny is happening and the HW design is not understood properly. If the documentation has nothing on this and the HW teams cannot help, then I suggest a little REing session. I really want to see this work land, but the way I see it right now is that we cannot rely on it because of this bug. Maybe fixing this bug would require changing the architecture, so better address it before landing the patches. Worst case scenario, do not hesitate to contact me if non of the proposed explanation pans out, I will take the time to read through the OA material and try my REing skills on it. As I said, I really want to see this upstream! Sorry... Martin