Re: [PATCH] accel/qaic: Add crashdump to Sahara

Konrad Dybcio Thu, 19 Sep 2024 08:37:10 -0700

On 19.09.2024 5:00 PM, Jeffrey Hugo wrote:
> On 9/18/2024 5:41 PM, Konrad Dybcio wrote:
>> On 18.09.2024 5:52 PM, Jeffrey Hugo wrote:
>>> The Sahara protocol has a crashdump functionality. In the hello
>>> exchange, the device can advertise it has a memory dump available for
>>> the host to collect. Instead of the device making requests of the host,
>>> the host requests data from the device which can be later analyzed.
>>>
>>> Implement this functionality and utilize the devcoredump framework for
>>> handing the dump over to userspace.
>>>
>>> Similar to how firmware loading in Sahara involves multiple files,
>>> crashdump can consist of multiple files for different parts of the dump.
>>> Structure these into a single buffer that userspace can parse and
>>> extract the original files from.
>>>
>>> Reviewed-by: Carl Vanderlip <quic_ca...@quicinc.com>
>>> Signed-off-by: Jeffrey Hugo <quic_jh...@quicinc.com>
>>> ---
>>
>> I gave this a brief read, but.. aren't you dumping however much DRAM the
>> AIC100 has (and then some SRAM) onto the host machine without the user
>> asking for it (i.e. immediately after the AIC crashes)?
> 
> I'm not entirely clear what the concern is.  Too much host RAM usage maybe?


Yes

> In short, I think the direct answer is yes and no.
> 
> We put the dump content in the host RAM and allow the user to decide if they 
> want to save it.  The user has 5 minutes to do something with the dump, then 
> the devcoredump framework automatically frees the content in RAM.  Typically 
> the user would access the sysfs file provided by devcoredump, and save the 
> contents to the file system for offline processing.
> 
> There are a few other GPUs and several other devices that do the same. 
> Panfrost appears to save every BO the user allocated into the dump, which 
> would suggest that the user could create an arbitrarily large dump.

Right, freedreno does something similar. Perhaps a user concerned about this
could simply disable CONFIG_DEVCOREDUMP.

> 
> In the case of AIC100, it is technically possible for the entire device DRAM 
> and SRAM to be offloaded.  That is up to the FW to decide if all of that is 
> relevant.  Current implementation of the FW is heavily aggressive on what it 
> selects for the dump, and current dumps are in the 100-200MB range.

OK this basically invalidates my concerns.. other boards I had ramdump on me
just spat out all the 16 or however much gigabytes of RAM they have..


> It feels like you are implying the user should somehow request the dump by 
> having devcoredump publish something, and then hook into the user's reads of 
> the sysfs to go collect the dump.  I worry that means the driver would then 
> need to determine when there is no user interested in collecting the dump, in 
> order to continue the reboot process.  I expect that would be a 5 minute 
> delay (devcoredump deciding there is no user interest after 5 minutes).  I 
> worry that users would object to such a delay given customer feedback we've 
> had on getting the devices into service quickly.

Yeah no, this wouldn't be good.

Konrad

Re: [PATCH] accel/qaic: Add crashdump to Sahara

Reply via email to