On 09/03/2023 17.05, Christian König wrote: > Am 09.03.23 um 07:30 schrieb Asahi Lina: >> On 09/03/2023 05.14, Christian König wrote: >>>> I think you mean wake_up_interruptible(). That would be >>>> drm_sched_job_done(), on the fence callback when a job completes, which >>>> as I keep saying is the same logic used for >>>> hw_rq_count/hw_submission_limit tracking. >>> As the documentation to wait_event says: >>> >>> * wake_up() has to be called after changing any variable that could >>> * change the result of the wait condition. >>> >>> So what you essentially try to do here is to skip that and say >>> drm_sched_job_done() would call that anyway, but when you read any >>> variable to determine that state then as far as I can see nothing is >>> guarantying that order. >> The driver needs to guarantee that any changes to that state precede a >> job completion fence signal of course, that's the entire idea of the >> API. It's supposed to represent a check for per-scheduler (or more >> specific, but not more global) resources that are released on job >> completion. Of course if you misuse the API you could cause a problem, >> but what I'm trying to say is that the API as designed and when used as >> intended does work properly. >> >> Put another way: job completions always need to cause the sched main >> loop to run an iteration anyway (otherwise we wouldn't make forward >> progress), and job completions are exactly the signal that the >> can_run_job() condition may have changed. >> >>> The only other possibility how you could use the callback correctly >>> would be to call drm_fence_is_signaled() to query the state of your hw >>> submission from the same fence which is then signaled. But then the >>> question is once more why you don't give that fence directly to the >>> scheduler? >> But the driver is supposed to guarantee that the ordering is always 1. >> resources freed, 2. fence signaled. So you don't need to check for the >> fence, you can just check for the resource state. > > Yeah, but this is exactly what the dma_fence framework tried to prevent. > We try very hard to avoid such side channel signaling :)
Right, and it's fine, I can use the fences directly easily enough. I'm just trying to explain why my original idea works too, even if it's not the best solution for other reasons! Of course I don't have the context of what other drivers are doing or did historically and what the pitfalls are, so I can't know what the "right" solution for any of this is in that context. I did my best to understand the drm_sched code and come up with a solution that works (which it does) without any more info. When I saw the hw submission limit stuff, I thought "okay, I need the same thing but with slightly more complex logic, so let's add a callback so the driver can customize it and do its own inflight counting". After this discussion, I can see that this is equivalent to doing the same check in prepare_job() followed by returning the oldest running job's fence (as long as there's no race there... it should be fine if the fence reference is taken first, before the resource check, or if everything is done within the same critical section taking the firmware queue lock), so I'm happy to switch to that and drop this patch. But keep in mind none of this is documented, and there's no way for us driver authors to understand what we're supposed to do without documentation. As I said I spent a long time trying to understand drm_sched, and then my original attempt missed the drm_sched_fini() issue with dangling jobs and Alyssa managed to hit an oops on the test branch, I guessed what the problem was from her trace, figured out a way to reproduce it (the kill-loop glmark2 thing), and fixed it in the next patch in this series. So even trying my best to figure out how to do this, reading the code and what scarce docs there are, I managed to miss something that caused a potential oops on the first try. If I can't even get the API usage right after spending hours on it trying really hard not to (because it's not just about my driver, I need the Rust abstraction to be safe for any driver), there's no way I'm going to divine what approaches to resource/dependency signaling are problematic/easy to abuse... the most I can hope for is "I got the wrapper right and the API/driver interaction is correct and guarantees forward progress if the driver follows the rules". So when I submit something, and you reply with "Well complete NAK", that's just not nice. Honestly, I was kind of upset when I got that email. It sounded as if you were saying my solution was completely broken and couldn't work, but no matter how I looked at it I couldn't figure out how it's broken. And then it took several emails to even understand what you were suggesting with the prepare_job callback (and yes, that works too and is probably harder to abuse than a new callback). I'm trying really hard to make this all work and be correct, and of course I make mistakes too... but then I look at the code and no matter what I can come up with it seems to work and be correct, what am I supposed to do? I'm happy to learn and figure out better approaches for everything that lead to better drivers, but I need an actual explanation of the issues, not just a NAK... I also would appreciate it if people give me the benefit of the doubt and let me explain what I'm doing and how I'm doing it and how this hardware works, because the whole thing is subtle to the core and very different to other GPUs. Honestly, I don't think any reviewer that hasn't spent hours poring over the driver/abstraction code could confidently say that a certain subtle sync issue exists at a first pass (other than for really obvious bad code sequences). I'm happy to look into issues and I definitely want to know what cases to look at and what to check for and fix anything we find... but isn't it better if we work together instead of shouting "this is broken" at the first hint of possible trouble? > But putting that issue aside for a moment. What I don't get is when you > have such intra queue dependencies, then why can't you check that at a > much higher level? > > In other words even userspace should be able to predict that for it's > submissions X amount of resources are needed and when all of my > submissions run in parallel that won't work. Technically yes, but we can't trust userspace to honor this, since overflowing the firmware queue breaks everything, so the kernel has to do the check... plus we're trying to insulate userspace from the details of how work is queued at the firmware. We need to support multiple firmware versions including future ones we can't predict yet without breaking UAPI, so the less the UAPI depends on firmware details, the better. That's why at the UAPI level, this is boiled down to a simpler "max commands per submission" limit that gets passed in the params struct, which is conservative, and then the kernel can deal with the actual in-flight count tracking and only submit things to the hardware when they fit. In the future we could even support job splitting on the kernel side and remove the max commands per submission limit altogether (though it probably still makes sense to have for other reasons, like bounding how much kernel/firmware memory a single queue can consume, so I'm not sure this is even worth doing at all). > Asking the firmware for a status is usually a magnitudes slower than > just computing it before submission. I'm not asking the firmware for status, I'm just asking my own firmware queue code how many slots are currently free in each backing queue. That's just based on internal driver state, there is no firmware round trip! I could technically compute this before submission and figure out how much work has been queued and pre-populate fences that ensure we never exceed the max, but honestly that's a lot more code to track job sizes and I don't think it makes sense when I can just ask "Do we have space? No? Okay, return the oldest running job fence for now and try again when it completes" in prepare_job(). Maybe it's faster in pathological cases to do something fancier, but let's wait until Vulkan works and we can run real AAA games and see where the bottlenecks are before going down the optimization road ^^ ~~ Lina