On 19.09.2024 5:00 PM, Jeffrey Hugo wrote: > On 9/18/2024 5:41 PM, Konrad Dybcio wrote: >> On 18.09.2024 5:52 PM, Jeffrey Hugo wrote: >>> The Sahara protocol has a crashdump functionality. In the hello >>> exchange, the device can advertise it has a memory dump available for >>> the host to collect. Instead of the device making requests of the host, >>> the host requests data from the device which can be later analyzed. >>> >>> Implement this functionality and utilize the devcoredump framework for >>> handing the dump over to userspace. >>> >>> Similar to how firmware loading in Sahara involves multiple files, >>> crashdump can consist of multiple files for different parts of the dump. >>> Structure these into a single buffer that userspace can parse and >>> extract the original files from. >>> >>> Reviewed-by: Carl Vanderlip <quic_ca...@quicinc.com> >>> Signed-off-by: Jeffrey Hugo <quic_jh...@quicinc.com> >>> --- >> >> I gave this a brief read, but.. aren't you dumping however much DRAM the >> AIC100 has (and then some SRAM) onto the host machine without the user >> asking for it (i.e. immediately after the AIC crashes)? > > I'm not entirely clear what the concern is. Too much host RAM usage maybe?
Yes > In short, I think the direct answer is yes and no. > > We put the dump content in the host RAM and allow the user to decide if they > want to save it. The user has 5 minutes to do something with the dump, then > the devcoredump framework automatically frees the content in RAM. Typically > the user would access the sysfs file provided by devcoredump, and save the > contents to the file system for offline processing. > > There are a few other GPUs and several other devices that do the same. > Panfrost appears to save every BO the user allocated into the dump, which > would suggest that the user could create an arbitrarily large dump. Right, freedreno does something similar. Perhaps a user concerned about this could simply disable CONFIG_DEVCOREDUMP. > > In the case of AIC100, it is technically possible for the entire device DRAM > and SRAM to be offloaded. That is up to the FW to decide if all of that is > relevant. Current implementation of the FW is heavily aggressive on what it > selects for the dump, and current dumps are in the 100-200MB range. OK this basically invalidates my concerns.. other boards I had ramdump on me just spat out all the 16 or however much gigabytes of RAM they have.. > It feels like you are implying the user should somehow request the dump by > having devcoredump publish something, and then hook into the user's reads of > the sysfs to go collect the dump. I worry that means the driver would then > need to determine when there is no user interested in collecting the dump, in > order to continue the reboot process. I expect that would be a 5 minute > delay (devcoredump deciding there is no user interest after 5 minutes). I > worry that users would object to such a delay given customer feedback we've > had on getting the devices into service quickly. Yeah no, this wouldn't be good. Konrad