On Wed, 2014-09-03 at 16:19 +0200, Daniel Vetter wrote: > [super-embarrassing resend, the previous one contained html gunk.] > > If the idea is to also convert gpu crash dumps to this we should add > dri-devel. And there the crashes are usually not due to firmware, but > because the shaders and command batches userspace submitted have > issues, so this should also be renamed to dev_coredump I think.
I don't know if the idea is to convert gpu crash dumps - I was just wondering if you could and would want to use such a generic framework. If the answer turns out to be no, that's perfectly reasonable I think. However, renaming seems easy to do anyway :) > On the overall design I wonder whether this shouldn't work more like a > real core dump and dump to a real file. At least currently the dumps > i915 creates are only useful as a general guide to where things went > wrong, but if we actually want to submit them as traces to the > hardware people we need to dump a _lot_ more. Otoh with the future of > shared virtual address spaces between gpu/cpu we might just do a real > core dump, so maybe this use case should be out of scope for your > patch here. I'm not really sure I'd want to actually sys_write() to a file here - sounds like a big can of worms. If you have direct access (like shared memory space) it seems we could still use the same mechanisms with the coredumpm() method, no? > On the logic itself I'm not sure whether the timeout is all that > useful - at least in i915 our crash recovery works well enough that > reporters often don't realize right away when it happened, but only > later on when looking through logs to explain the tiny corruptions. If > the crashdupm has evapored meanwhile that's not that useful. Right. We might want to make it configurable, maybe even in Kconfig. I was thinking that there would be userspace that would (automatically) pick it up, and if such userspace doesn't exist or isn't running then we'd want to free the memory eventually. > Also, at least for gpus it's usually not interesting to grab > subsequent dumps: Often the gpu is in a bad mood due to the first > crash, or it's just a massive row of duplicated dumps. So in i915 we > only record the first crash and keep it around forever. And tooling > can still free it by writing to the file. This also ensures that we > don't waste excessive amounts of memory with crash dumps. Right, we discussed this but then I completely forgot. I think keeping the first one is reasonable. If userspace has already picked it up you'll still get multiple and maybe want to have a policy there as well. > And if we want to use this for i915 we need some way for tools to go > from the i915 drm class device node to the error state, not just from > the error state back to the device. Interesting. That's probably not all that difficult to do (maybe even set up a child/parent relationship?) but I actually wanted to avoid a hard dependency since there may be cases where the failing device disappears, e.g. in the case of USB. I have to think about this case more, I guess. johannes -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/