On 13/02/2021 03:38, Rob Clark wrote:
On Fri, Feb 12, 2021 at 5:08 PM Lionel Landwerlin
<lionel.g.landwer...@intel.com> wrote:
We're kind of in the same boat for Intel.

Access to GPU perf counters is exclusive to a single process if you want
to build a timeline of the work (because preemption etc...).
ugg, does that mean extensions like AMD_performance_monitor doesn't
actually work on intel?


It work,s but only a single app can use it at a time.



The best information we could add from mesa would a timestamp of when a
particular drawcall started.
But that's pretty much when timestamps queries are.

Were you thinking of particular GPU generated data you don't get from
gfx-pps?
>From the looks of it, currently I don't get *any* GPU generated data
from gfx-pps ;-)


Maybe file a bug? : https://gitlab.freedesktop.org/Fahien/gfx-pps/-/blob/master/src/gpu/intel/intel_driver.cc



We can ofc sample counters from a separate process as well... I have a
curses tool (fdperf) which does this.. but running outside of gpu
cmdstream plus counters losing context across suspend/resume makes it
less than perfect.


Our counters are global so to give per application values, we need to post process a stream of HW counter snapshots.


   And something that works the same way as
AMD_performance_monitor under the hook gives a more precise look at
which shaders (for ex) are consuming the most cycles.


In our implementation that precision (in particular when a drawcall ends) comes at a stalling cost unfortunately.


   For cases where
we can profile a trace, frameretrace and related tools is pretty
great.. but it would be nice to have similar visibility for actual
games (which for me, mostly means android games, since so far no
aarch64 steam store), but also give game developers good tools (or at
least the same tools that they get with other closed src drivers on
android).


Sure, but frame analysis is different than live monitoring of the system.

On Intel's HW you don't get the same level of details in both cases, and apart for a few timestamps, I think gfx-pps is as good as you gonna get for live stuff.


-Lionel



BR,
-R

Thanks,

-Lionel


On 13/02/2021 00:12, Alyssa Rosenzweig wrote:
My 2c for Mali/Panfrost --

For us, capturing GPU perf counters is orthogonal to rendering. It's
expected (e.g. with Arm's tools) to do this from a separate process.
Neither Mesa nor the DDK should require custom instrumentation for the
low-level data. Fahien's gfx-pps handles this correctly for Panfrost +
Perfetto as it is. So for us I don't see the value in modifying Mesa for
tracing.

On Fri, Feb 12, 2021 at 01:34:51PM -0800, John Bates wrote:
(responding from correct address this time)

On Fri, Feb 12, 2021 at 12:03 PM Mark Janes <mark.a.ja...@intel.com> wrote:

I've recently been using GPUVis to look at trace events.  On Intel
platforms, GPUVis incorporates ftrace events from the i915 driver,
performance metrics from igt-gpu-tools, and userspace ftrace markers
that I locally hack up in Mesa.

GPUVis is great. I would love to see that data combined with
userspace events without any need for local hacks. Perfetto provides
on-demand trace events with lower overhead compared to ftrace, so for
example it is acceptable to have production trace instrumentation that can
be captured without dev builds. To do that with ftrace it may require a way
to enable and disable the ftrace file writes to avoid the overhead when
tracing is not in use. This is what Android does with systrace/atrace, for
example, it uses Binder to notify processes about trace sessions. Perfetto
does that in a more portable way.


It is very easy to compile the GPUVis UI.  Userspace instrumentation
requires a single C/C++ header.  You don't have to access an external
web service to analyze trace data (a big no-no for devs working on
preproduction hardware).

Is it possible to build and run the Perfetto UI locally?
Yes, local UI builds are possible
<https://github.com/google/perfetto/blob/5ff758df67da94d17734c2e70eb6738c4902953e/ui/README.md>.
Also confirmed with the perfetto team <https://discord.gg/35ShE3A> that
trace data is not uploaded unless you use the 'share' feature.


    Can it display
arbitrary trace events that are written to
/sys/kernel/tracing/trace_marker ?
Yes, I believe it does support that via linux.ftrace data source
<https://perfetto.dev/docs/quickstart/linux-tracing>. We use that for
example to overlay CPU sched data to show what process is on each core
throughout the timeline. There are many ftrace event types
<https://github.com/google/perfetto/tree/5ff758df67da94d17734c2e70eb6738c4902953e/protos/perfetto/trace/ftrace>
in
the perfetto protos.


Can it be extended to show i915 and
i915-perf-recorder events?

It can be extended to consume custom data sources. One way this is done is
via a bridge daemon, such as traced_probes which is responsible for
capturing data from ftrace and /proc during a trace session and sending it
to traced. traced is the main perfetto tracing daemon that notifies all
trace data sources to start/stop tracing and communicates with user tracing
requests via the 'perfetto' command.



John Bates <jba...@chromium.org> writes:

I recently opened issue 4262
<https://gitlab.freedesktop.org/mesa/mesa/-/issues/4262> to begin the
discussion on integrating perfetto into mesa.

*Background*

System-wide tracing is an invaluable tool for developers to find and fix
performance problems. The perfetto project enables a combined view of
trace
data from kernel ftrace, GPU driver and various manually-instrumented
tracepoints throughout the application and system. This helps developers
quickly answer questions like:

     - How long are frames taking?
     - What caused a particular frame drop?
     - Is it CPU bound or GPU bound?
     - Did a CPU core frequency drop cause something to go slower than
usual?
     - Is something else running that is stealing CPU or GPU time? Could I
     fix that with better thread/context priorities?
     - Are all CPU cores being used effectively? Do I need
sched_setaffinity
     to keep my thread on a big or little core?
     - What’s the latency between CPU frame submit and GPU start?

*What Does Mesa + Perfetto Provide?*

Mesa is in a unique position to produce GPU trace data for several GPU
vendors without requiring the developer to build and install additional
tools like gfx-pps <https://gitlab.freedesktop.org/Fahien/gfx-pps>.

The key is making it easy for developers to use. Ideally, perfetto is
eventually available by default in mesa so that if your system has
perfetto
traced running, you just need to run perfetto (perhaps along with setting
an environment variable) with the mesa categories to see:

     - GPU processing timeline events.
     - GPU counters.
     - CPU events for potentially slow functions in mesa like shader
compiles.
Example of what this data might look like (with fake GPU events):
[image: percetto-gpu-example.png]

*Runtime Characteristics*

     - ~500KB additional binary size. Even with using only the basic
features
     of perfetto, it will increase the binary size of mesa by about 500KB.
     - Background thread. Perfetto uses a background thread for
communication
     with the system tracing daemon (traced) to advertise trace data and
get
     notification of trace start/stop.
     - Runtime overhead when disabled is designed to be optimal with one
     predicted branch, typically a few CPU cycles
     <https://perfetto.dev/docs/instrumentation/track-events#performance>
per
     event. While enabled, the overhead can be around 1 us per event.

*Integration Challenges*

     - The perfetto SDK is C++ and designed around macros, lambdas, inline
     templates, etc. There are ongoing discussions on providing an official
     perfetto C API, but it is not yet clear when this will land on the
perfetto
     roadmap.
     - The perfetto SDK is an amalgamated .h and .cc that adds up to 100K
     lines of code.
     - Anything that includes perfetto.h takes a long time to compile.
     - The current Perfetto SDK design is incompatible with being a shared
     library behind a C API.

*Percetto*

The percetto library <https://github.com/olvaffe/percetto> was recently
implemented to provide an interim C API for perfetto. It provides
efficient
support for scoped trace events, multiple categories, counters, custom
timestamps, and debug data annotations. Percetto also provides some
features that are important to mesa, but not available yet with perfetto
SDK:

     - Trace events from multiple perfetto instances in separate shared
     libraries (like mesa and virglrenderer) show correctly in a single
process
     and thread view.
     - Counter tracks and macro API.

Percetto is missing API for perfetto's GPU DataSource and counter
support,
but that feature could be implemented next if it is important for mesa.
With the existing percetto API mesa could present GPU trace data as named
'slice' events and int64_t counters with custom timestamps as shown in
the
image above (based on this sample
<https://github.com/olvaffe/percetto/blob/main/examples/timestamps.c>).

*Mesa Integration Alternatives*

Note: we have some pressing needs for performance analysis in Chrome OS,
so
I'm intentionally leaving out the alternative of waiting for an official
perfetto C API. Of course, once that C API is available it would become
an
option to migrate to it from any of the alternatives below.

Ordered by difficulty with easiest first:

     1. Statically link with percetto as an optional external dependency
(virglrenderer
     now has this approach
     <
https://gitlab.freedesktop.org/virgl/virglrenderer/-/merge_requests/480>
     ).
     - Pros: API already supports most common tracing needs. Tested and
used
        by an increasing number of CrOS components.
        - Cons: External dependency for optional mesa build option.
     2. Embed Perfetto SDK + a Percetto fork/copy.
        - Pros: API already supports most common tracing needs. No added
        external dependency for mesa.
        - Cons: Percetto code divergence, bug fixes need to land in two
trees.
     3. Embed Perfetto SDK + custom C wrapper.
        - Pros: Tailored API for mesa's needs.
        - Cons: Nontrivial development efforts and maintenance.
     4. Generate C stubs for the Perfetto protobuf and reimplement the
     Perfetto SDK in C.
        - Pros: Tailored API for mesa's needs. Possible smaller binary
impact
        from simpler implementation.
        - Cons: Significant development efforts and maintenance.

Regardless of the integration direction, I expect we would disable
perfetto
in the default build for now to minimize disruption.

I like #1, because there are some nontrivial subtleties to the C wrapper
that provide both API conveniences and runtime performance that would
need
to be reimplemented or maintained with the other options. I will also
volunteer to do #1 or #2, but I'm not sure I have time for #3 or #4 :D.

Any other thoughts on how best to integrate perfetto into mesa?

-jb
_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev
_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Reply via email to