Hi,

 Right now the way the thing works is that it walks the batchbuffer just after 
the kernel returns from the ioctl and updates its internal view of the GPU 
state as it walks and emits to the log file the data. The log on a single 
batchbuffer is (essentially) just a list of call ID's from the apitrace 
together of "where in the batchbuffer" that call started. 

 I confess that I had not realized the potential application for using 
something like this to help diagnose GPU hangs! I think it is a really good 
idea. What I could do is the following (and it is not terribly hard to do):

   1. -BEFORE- issuing the ioctl, the logger walks just the api markers in the 
log of the batchbuffer, and makes a new GEM BO filled with apitrace data (call 
ID, and maybe GL function data) and modify the ioctl to have an extra buffer.

  2. -AFTER- the ioctl returns, emit the log data (as now) and delete the GEM 
BO; In order to read the GPU state more accurately I need to walk the log and 
update the GPU state after the ioctl (mostly out of paranoia for values copied 
from BO's to pipeline registers).

What would happen, is that if a batchbuffer made the GPU hang, you would then 
know all the GL commands (trace ID's from the API trace) that made stuff on 
that batchbuffer. Then one could go back to the apitrace of the troublesome 
application  and have a much better starting place to debug.

We could also do something evil looking and put another modification on 
apitrace where it can have a list of call trace ranges where it inserts 
glFinish after each call. Those glFinish()'s will then force the ioctl of the 
exact troublesome draw call without needing to tell i965 to flush after each 
draw call.

Just to make sure, you want the "apitrace" data (call ID list, maybe function 
name) in a GEM BO? Which GEM BO should it be in the list so that kernel debug 
code know which one to use to dump? I would guess if the batchbuffer is the 
first buffer, then it would be the last buffer, otherwise if the batch buffer 
is the last one, I guess it would be one just before, but that might screw up 
reloc-data if any of the relocs in the batchbuffer refer to itself. I can also 
emit the data to a file and close the file before the ioctl and if the ioctl 
returns, delete said file (assuming a GPU hang always stops the process, then a 
hang would leave behind a file). 

Let me know, what is best, and I will do it.

-Kevin


-----Original Message-----
From: Chris Wilson [mailto:ch...@chris-wilson.co.uk] 
Sent: Tuesday, September 26, 2017 11:20 PM
To: Rogovin, Kevin <kevin.rogo...@intel.com>; mesa-dev@lists.freedesktop.org
Subject: Re: [Mesa-dev] [PATCH 00/22] RFC: Batchbuffer Logger for Intel GPU

Quoting Rogovin, Kevin (2017-09-26 10:35:44)
> Hi,
> 
>   Attached to this message are the following:
>      1. a file giving example usage of the tool with a modified 
> apitrace to produce json output
> 
>      2. the patches to apitrace to make it BatchbufferLogger aware
> 
>      3. the JSON files (gzipped) made from the example.
> 
> 
> I encourage (and hope) people will take a look at the JSON to see the 
> potential of the tool.

The automatic apitrace-esque logging seems very useful. How easy would it be to 
write that trace into a bo and associate with the execbuffer (from my pov, it 
should be that hard)? That way you could get the most recent actions before a 
GPU hang, attach them to a bug and decode them at leisure. (An extension may be 
to keep a ring of the last N traces so that you can see some setup a few 
batches ago that triggered a hang in this one.)

I presume you already have such a plan, and I'm just preaching to the choir.
-Chris
_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Reply via email to