On Wed, Jan 11, 2023 at 2:50 AM Tvrtko Ursulin < tvrtko.ursu...@linux.intel.com> wrote:
> > On 10/01/2023 14:08, Jason Ekstrand wrote: > > On Tue, Jan 10, 2023 at 5:28 AM Tvrtko Ursulin > > <tvrtko.ursu...@linux.intel.com <mailto:tvrtko.ursu...@linux.intel.com>> > > > wrote: > > > > > > > > On 09/01/2023 17:27, Jason Ekstrand wrote: > > > > [snip] > > > > > >>> AFAICT it proposes to have 1:1 between *userspace* > created > > > contexts (per > > > >>> context _and_ engine) and drm_sched. I am not sure > avoiding > > > invasive changes > > > >>> to the shared code is in the spirit of the overall idea > > and instead > > > >>> opportunity should be used to look at way to > > refactor/improve > > > drm_sched. > > > > > > > > > Maybe? I'm not convinced that what Xe is doing is an abuse at > > all or > > > really needs to drive a re-factor. (More on that later.) > > There's only > > > one real issue which is that it fires off potentially a lot of > > kthreads. > > > Even that's not that bad given that kthreads are pretty light and > > you're > > > not likely to have more kthreads than userspace threads which are > > much > > > heavier. Not ideal, but not the end of the world either. > > Definitely > > > something we can/should optimize but if we went through with Xe > > without > > > this patch, it would probably be mostly ok. > > > > > > >> Yes, it is 1:1 *userspace* engines and drm_sched. > > > >> > > > >> I'm not really prepared to make large changes to DRM > > scheduler > > > at the > > > >> moment for Xe as they are not really required nor does > Boris > > > seem they > > > >> will be required for his work either. I am interested to > see > > > what Boris > > > >> comes up with. > > > >> > > > >>> Even on the low level, the idea to replace drm_sched > threads > > > with workers > > > >>> has a few problems. > > > >>> > > > >>> To start with, the pattern of: > > > >>> > > > >>> while (not_stopped) { > > > >>> keep picking jobs > > > >>> } > > > >>> > > > >>> Feels fundamentally in disagreement with workers (while > > > obviously fits > > > >>> perfectly with the current kthread design). > > > >> > > > >> The while loop breaks and worker exists if no jobs are > ready. > > > > > > > > > I'm not very familiar with workqueues. What are you saying would > fit > > > better? One scheduling job per work item rather than one big work > > item > > > which handles all available jobs? > > > > Yes and no, it indeed IMO does not fit to have a work item which is > > potentially unbound in runtime. But it is a bit moot conceptual > > mismatch > > because it is a worst case / theoretical, and I think due more > > fundamental concerns. > > > > If we have to go back to the low level side of things, I've picked > this > > random spot to consolidate what I have already mentioned and perhaps > > expand. > > > > To start with, let me pull out some thoughts from workqueue.rst: > > > > """ > > Generally, work items are not expected to hog a CPU and consume many > > cycles. That means maintaining just enough concurrency to prevent > work > > processing from stalling should be optimal. > > """ > > > > For unbound queues: > > """ > > The responsibility of regulating concurrency level is on the users. > > """ > > > > Given the unbound queues will be spawned on demand to service all > > queued > > work items (more interesting when mixing up with the > > system_unbound_wq), > > in the proposed design the number of instantiated worker threads does > > not correspond to the number of user threads (as you have elsewhere > > stated), but pessimistically to the number of active user contexts. > > > > > > Those are pretty much the same in practice. Rather, user threads is > > typically an upper bound on the number of contexts. Yes, a single user > > thread could have a bunch of contexts but basically nothing does that > > except IGT. In real-world usage, it's at most one context per user > thread. > > Typically is the key here. But I am not sure it is good enough. Consider > this example - Intel Flex 170: > > * Delivers up to 36 streams 1080p60 transcode throughput per card. > * When scaled to 10 cards in a 4U server configuration, it can support > up to 360 streams of HEVC/HEVC 1080p60 transcode throughput. > I had a feeling it was going to be media.... 😅 > One transcode stream from my experience typically is 3-4 GPU contexts > (buffer travels from vcs -> rcs -> vcs, maybe vecs) used from a single > CPU thread. 4 contexts * 36 streams = 144 active contexts. Multiply by > 60fps = 8640 jobs submitted and completed per second. > > 144 active contexts in the proposed scheme means possibly means 144 > kernel worker threads spawned (driven by 36 transcode CPU threads). (I > don't think the pools would scale down given all are constantly pinged > at 60fps.) > > And then each of 144 threads goes to grab the single GuC CT mutex. First > threads are being made schedulable, then put to sleep as mutex > contention is hit, then woken again as mutexes are getting released, > rinse, repeat. > Why is every submission grabbing the GuC CT mutex? I've not read the GuC back-end yet but I was under the impression that most run_job() would be just shoving another packet into a ring buffer. If we have to send the GuC a message on the control ring every single time we submit a job, that's pretty horrible. --Jason (And yes this backend contention is there regardless of 1:1:1, it would > require a different re-design to solve that. But it is just a question > whether there are 144 contending threads, or just 6 with the thread per > engine class scheme.) > > Then multiply all by 10 for a 4U server use case and you get 1440 worker > kthreads, yes 10 more CT locks, but contending on how many CPU cores? > Just so they can grab a timeslice and maybe content on a mutex as the > next step. > > This example is where it would hurt on large systems. Imagine only an > even wider media transcode card... > > Second example is only a single engine class used (3d desktop?) but with > a bunch of not-runnable jobs queued and waiting on a fence to signal. > Implicit or explicit dependencies doesn't matter. Then the fence signals > and call backs run. N work items get scheduled, but they all submit to > the same HW engine. So we end up with: > > /-- wi1 --\ > / .. .. \ > cb --+--- wi.. ---+-- rq1 -- .. -- rqN > \ .. .. / > \-- wiN --/ > > > All that we have achieved is waking up N CPUs to contend on the same > lock and effectively insert the job into the same single HW queue. I > don't see any positives there. > > This example I think can particularly hurt small / low power devices > because of needless waking up of many cores for no benefit. Granted, I > don't have a good feel on how common this pattern is in practice. > > > > > That > > is the number which drives the maximum number of not-runnable jobs > that > > can become runnable at once, and hence spawn that many work items, > and > > in turn unbound worker threads. > > > > Several problems there. > > > > It is fundamentally pointless to have potentially that many more > > threads > > than the number of CPU cores - it simply creates a scheduling storm. > > > > Unbound workers have no CPU / cache locality either and no connection > > with the CPU scheduler to optimize scheduling patterns. This may > matter > > either on large systems or on small ones. Whereas the current design > > allows for scheduler to notice userspace CPU thread keeps waking up > the > > same drm scheduler kernel thread, and so it can keep them on the same > > CPU, the unbound workers lose that ability and so 2nd CPU might be > > getting woken up from low sleep for every submission. > > > > Hence, apart from being a bit of a impedance mismatch, the proposal > has > > the potential to change performance and power patterns and both large > > and small machines. > > > > > > Ok, thanks for explaining the issue you're seeing in more detail. Yes, > > deferred kwork does appear to mismatch somewhat with what the scheduler > > needs or at least how it's worked in the past. How much impact will > > that mismatch have? Unclear. > > > > > >>> Secondly, it probably demands separate workers (not > > optional), > > > otherwise > > > >>> behaviour of shared workqueues has either the potential > to > > > explode number > > > >>> kernel threads anyway, or add latency. > > > >>> > > > >> > > > >> Right now the system_unbound_wq is used which does have a > > limit > > > on the > > > >> number of threads, right? I do have a FIXME to allow a > > worker to be > > > >> passed in similar to TDR. > > > >> > > > >> WRT to latency, the 1:1 ratio could actually have lower > > latency > > > as 2 GPU > > > >> schedulers can be pushing jobs into the backend / > cleaning up > > > jobs in > > > >> parallel. > > > >> > > > > > > > > Thought of one more point here where why in Xe we > > absolutely want > > > a 1 to > > > > 1 ratio between entity and scheduler - the way we implement > > > timeslicing > > > > for preempt fences. > > > > > > > > Let me try to explain. > > > > > > > > Preempt fences are implemented via the generic messaging > > > interface [1] > > > > with suspend / resume messages. If a suspend messages is > > received to > > > > soon after calling resume (this is per entity) we simply > > sleep in the > > > > suspend call thus giving the entity a timeslice. This > > completely > > > falls > > > > apart with a many to 1 relationship as now a entity > > waiting for a > > > > timeslice blocks the other entities. Could we work aroudn > > this, > > > sure but > > > > just another bunch of code we'd have to add in Xe. Being to > > > freely sleep > > > > in backend without affecting other entities is really, > really > > > nice IMO > > > > and I bet Xe isn't the only driver that is going to feel > > this way. > > > > > > > > Last thing I'll say regardless of how anyone feels about > > Xe using > > > a 1 to > > > > 1 relationship this patch IMO makes sense as I hope we can > all > > > agree a > > > > workqueue scales better than kthreads. > > > > > > I don't know for sure what will scale better and for what use > > case, > > > combination of CPU cores vs number of GPU engines to keep > > busy vs other > > > system activity. But I wager someone is bound to ask for some > > > numbers to > > > make sure proposal is not negatively affecting any other > drivers. > > > > > > > > > Then let them ask. Waving your hands vaguely in the direction of > > the > > > rest of DRM and saying "Uh, someone (not me) might object" is > > profoundly > > > unhelpful. Sure, someone might. That's why it's on dri-devel. > > If you > > > think there's someone in particular who might have a useful > > opinion on > > > this, throw them in the CC so they don't miss the e-mail thread. > > > > > > Or are you asking for numbers? If so, what numbers are you > > asking for? > > > > It was a heads up to the Xe team in case people weren't appreciating > > how > > the proposed change has the potential influence power and performance > > across the board. And nothing in the follow up discussion made me > think > > it was considered so I don't think it was redundant to raise it. > > > > In my experience it is typical that such core changes come with some > > numbers. Which is in case of drm scheduler is tricky and probably > > requires explicitly asking everyone to test (rather than count on > > "don't > > miss the email thread"). Real products can fail to ship due ten mW > here > > or there. Like suddenly an extra core prevented from getting into > deep > > sleep. > > > > If that was "profoundly unhelpful" so be it. > > > > > > With your above explanation, it makes more sense what you're asking. > > It's still not something Matt is likely to be able to provide on his > > own. We need to tag some other folks and ask them to test it out. We > > could play around a bit with it on Xe but it's not exactly production > > grade yet and is going to hit this differently from most. Likely > > candidates are probably AMD and Freedreno. > > Whoever is setup to check out power and performance would be good to > give it a spin, yes. > > PS. I don't think I was asking Matt to test with other devices. To start > with I think Xe is a team effort. I was asking for more background on > the design decision since patch 4/20 does not say anything on that > angle, nor later in the thread it was IMO sufficiently addressed. > > > > Also, If we're talking about a design that might paint us into an > > > Intel-HW-specific hole, that would be one thing. But we're not. > > We're > > > talking about switching which kernel threading/task mechanism to > > use for > > > what's really a very generic problem. The core Xe design works > > without > > > this patch (just with more kthreads). If we land this patch or > > > something like it and get it wrong and it causes a performance > > problem > > > for someone down the line, we can revisit it. > > > > For some definition of "it works" - I really wouldn't suggest > > shipping a > > kthread per user context at any point. > > > > > > You have yet to elaborate on why. What resources is it consuming that's > > going to be a problem? Are you anticipating CPU affinity problems? Or > > does it just seem wasteful? > > Well I don't know, commit message says the approach does not scale. :) > > > I think I largely agree that it's probably unnecessary/wasteful but > > reducing the number of kthreads seems like a tractable problem to solve > > regardless of where we put the gpu_scheduler object. Is this the right > > solution? Maybe not. It was also proposed at one point that we could > > split the scheduler into two pieces: A scheduler which owns the kthread, > > and a back-end which targets some HW ring thing where you can have > > multiple back-ends per scheduler. That's certainly more invasive from a > > DRM scheduler internal API PoV but would solve the kthread problem in a > > way that's more similar to what we have now. > > > > > In any case that's a low level question caused by the high > > level design > > > decision. So I'd think first focus on the high level - which > > is the 1:1 > > > mapping of entity to scheduler instance proposal. > > > > > > Fundamentally it will be up to the DRM maintainers and the > > community to > > > bless your approach. And it is important to stress 1:1 is > about > > > userspace contexts, so I believe unlike any other current > > scheduler > > > user. And also important to stress this effectively does not > > make Xe > > > _really_ use the scheduler that much. > > > > > > > > > I don't think this makes Xe nearly as much of a one-off as you > > think it > > > does. I've already told the Asahi team working on Apple M1/2 > > hardware > > > to do it this way and it seems to be a pretty good mapping for > > them. I > > > believe this is roughly the plan for nouveau as well. It's not > > the way > > > it currently works for anyone because most other groups aren't > > doing FW > > > scheduling yet. In the world of FW scheduling and hardware > > designed to > > > support userspace direct-to-FW submit, I think the design makes > > perfect > > > sense (see below) and I expect we'll see more drivers move in this > > > direction as those drivers evolve. (AMD is doing some customish > > thing > > > for how with gpu_scheduler on the front-end somehow. I've not dug > > into > > > those details.) > > > > > > I can only offer my opinion, which is that the two options > > mentioned in > > > this thread (either improve drm scheduler to cope with what is > > > required, > > > or split up the code so you can use just the parts of > > drm_sched which > > > you want - which is frontend dependency tracking) shouldn't > be so > > > readily dismissed, given how I think the idea was for the new > > driver to > > > work less in a silo and more in the community (not do kludges > to > > > workaround stuff because it is thought to be too hard to > > improve common > > > code), but fundamentally, "goto previous paragraph" for what > I am > > > concerned. > > > > > > > > > Meta comment: It appears as if you're falling into the standard > > i915 > > > team trap of having an internal discussion about what the > community > > > discussion might look like instead of actually having the > community > > > discussion. If you are seriously concerned about interactions > with > > > other drivers or whether or setting common direction, the right > > way to > > > do that is to break a patch or two out into a separate RFC series > > and > > > tag a handful of driver maintainers. Trying to predict the > > questions > > > other people might ask is pointless. Cc them and asking for their > > input > > > instead. > > > > I don't follow you here. It's not an internal discussion - I am > raising > > my concerns on the design publicly. I am supposed to write a patch to > > show something, but am allowed to comment on a RFC series? > > > > > > I may have misread your tone a bit. It felt a bit like too many > > discussions I've had in the past where people are trying to predict what > > others will say instead of just asking them. Reading it again, I was > > probably jumping to conclusions a bit. Sorry about that. > > Okay no problem, thanks. In any case we don't have to keep discussing > it, since I wrote one or two emails ago it is fundamentally on the > maintainers and community to ack the approach. I only felt like RFC did > not explain the potential downsides sufficiently so I wanted to probe > that area a bit. > > > It is "drm/sched: Convert drm scheduler to use a work queue rather > than > > kthread" which should have Cc-ed _everyone_ who use drm scheduler. > > > > > > Yeah, it probably should have. I think that's mostly what I've been > > trying to say. > > > > > > > > Regards, > > > > > > Tvrtko > > > > > > P.S. And as a related side note, there are more areas where > > drm_sched > > > could be improved, like for instance priority handling. > > > Take a look at msm_submitqueue_create / > > msm_gpu_convert_priority / > > > get_sched_entity to see how msm works around the drm_sched > > hardcoded > > > limit of available priority levels, in order to avoid having > > to leave a > > > hw capability unused. I suspect msm would be happier if they > > could have > > > all priority levels equal in terms of whether they apply only > > at the > > > frontend level or completely throughout the pipeline. > > > > > > > [1] > > > > > https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1 > > <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1 > > > > > > > < > https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1 < > https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1>> > > > > > > > >>> What would be interesting to learn is whether the option > of > > > refactoring > > > >>> drm_sched to deal with out of order completion was > > considered > > > and what were > > > >>> the conclusions. > > > >>> > > > >> > > > >> I coded this up a while back when trying to convert the > > i915 to > > > the DRM > > > >> scheduler it isn't all that hard either. The free flow > > control > > > on the > > > >> ring (e.g. set job limit == SIZE OF RING / MAX JOB SIZE) > is > > > really what > > > >> sold me on the this design. > > > > > > > > > You're not the only one to suggest supporting out-of-order > > completion. > > > However, it's tricky and breaks a lot of internal assumptions of > the > > > scheduler. It also reduces functionality a bit because it can no > > longer > > > automatically rate-limit HW/FW queues which are often > > fixed-size. (Ok, > > > yes, it probably could but it becomes a substantially harder > > problem.) > > > > > > It also seems like a worse mapping to me. The goal here is to > turn > > > submissions on a userspace-facing engine/queue into submissions > > to a FW > > > queue submissions, sorting out any dma_fence dependencies. Matt's > > > description of saying this is a 1:1 mapping between sched/entity > > doesn't > > > tell the whole story. It's a 1:1:1 mapping between xe_engine, > > > gpu_scheduler, and GuC FW engine. Why make it a 1:something:1 > > mapping? > > > Why is that better? > > > > As I have stated before, what I think what would fit well for Xe is > one > > drm_scheduler per engine class. In specific terms on our current > > hardware, one drm scheduler instance for render, compute, blitter, > > video > > and video enhance. Userspace contexts remain scheduler entities. > > > > > > And this is where we fairly strongly disagree. More in a bit. > > > > That way you avoid the whole kthread/kworker story and you have it > > actually use the entity picking code in the scheduler, which may be > > useful when the backend is congested. > > > > > > What back-end congestion are you referring to here? Running out of FW > > queue IDs? Something else? > > CT channel, number of context ids. > > > > > Yes you have to solve the out of order problem so in my mind that is > > something to discuss. What the problem actually is (just TDR?), how > > tricky and why etc. > > > > And yes you lose the handy LRCA ring buffer size management so you'd > > have to make those entities not runnable in some other way. > > > > Regarding the argument you raise below - would any of that make the > > frontend / backend separation worse and why? Do you think it is less > > natural? If neither is true then all remains is that it appears extra > > work to support out of order completion of entities has been > discounted > > in favour of an easy but IMO inelegant option. > > > > > > Broadly speaking, the kernel needs to stop thinking about GPU scheduling > > in terms of scheduling jobs and start thinking in terms of scheduling > > contexts/engines. There is still some need for scheduling individual > > jobs but that is only for the purpose of delaying them as needed to > > resolve dma_fence dependencies. Once dependencies are resolved, they > > get shoved onto the context/engine queue and from there the kernel only > > really manages whole contexts/engines. This is a major architectural > > shift, entirely different from the way i915 scheduling works. It's also > > different from the historical usage of DRM scheduler which I think is > > why this all looks a bit funny. > > > > To justify this architectural shift, let's look at where we're headed. > > In the glorious future... > > > > 1. Userspace submits directly to firmware queues. The kernel has no > > visibility whatsoever into individual jobs. At most it can pause/resume > > FW contexts as needed to handle eviction and memory management. > > > > 2. Because of 1, apart from handing out the FW queue IDs at the > > beginning, the kernel can't really juggle them that much. Depending on > > FW design, it may be able to pause a client, give its IDs to another, > > and then resume it later when IDs free up. What it's not doing is > > juggling IDs on a job-by-job basis like i915 currently is. > > > > 3. Long-running compute jobs may not complete for days. This means > > that memory management needs to happen in terms of pause/resume of > > entire contexts/engines using the memory rather than based on waiting > > for individual jobs to complete or pausing individual jobs until the > > memory is available. > > > > 4. Synchronization happens via userspace memory fences (UMF) and the > > kernel is mostly unaware of most dependencies and when a context/engine > > is or is not runnable. Instead, it keeps as many of them minimally > > active (memory is available, even if it's in system RAM) as possible and > > lets the FW sort out dependencies. (There may need to be some facility > > for sleeping a context until a memory change similar to futex() or > > poll() for userspace threads. There are some details TBD.) > > > > Are there potential problems that will need to be solved here? Yes. Is > > it a good design? Well, Microsoft has been living in this future for > > half a decade or better and it's working quite well for them. It's also > > the way all modern game consoles work. It really is just Linux that's > > stuck with the same old job model we've had since the monumental shift > > to DRI2. > > > > To that end, one of the core goals of the Xe project was to make the > > driver internally behave as close to the above model as possible while > > keeping the old-school job model as a very thin layer on top. As the > > broader ecosystem problems (window-system support for UMF, for instance) > > are solved, that layer can be peeled back. The core driver will already > > be ready for it. > > > > To that end, the point of the DRM scheduler in Xe isn't to schedule > > jobs. It's to resolve syncobj and dma-buf implicit sync dependencies > > and stuff jobs into their respective context/engine queue once they're > > ready. All the actual scheduling happens in firmware and any scheduling > > the kernel does to deal with contention, oversubscriptions, too many > > contexts, etc. is between contexts/engines, not individual jobs. Sure, > > the individual job visibility is nice, but if we design around it, we'll > > never get to the glorious future. > > > > I really need to turn the above (with a bit more detail) into a blog > > post.... Maybe I'll do that this week. > > > > In any case, I hope that provides more insight into why Xe is designed > > the way it is and why I'm pushing back so hard on trying to make it more > > of a "classic" driver as far as scheduling is concerned. Are there > > potential problems here? Yes, that's why Xe has been labeled a > > prototype. Are such radical changes necessary to get to said glorious > > future? Yes, I think they are. Will it be worth it? I believe so. > > Right, that's all solid I think. My takeaway is that frontend priority > sorting and that stuff isn't needed and that is okay. And that there are > multiple options to maybe improve drm scheduler, like the fore mentioned > making it deal with out of order, or split into functional components, > or split frontend/backend what you suggested. For most of them cost vs > benefit is more or less not completely clear, neither how much effort > was invested to look into them. > > One thing I missed from this explanation is how drm_scheduler per engine > class interferes with the high level concepts. And I did not manage to > pick up on what exactly is the TDR problem in that case. Maybe the two > are one and the same. > > Bottom line is I still have the concern that conversion to kworkers has > an opportunity to regress. Possibly more opportunity for some Xe use > cases than to affect other vendors, since they would still be using per > physical engine / queue scheduler instances. > > And to put my money where my mouth is I will try to put testing Xe > inside the full blown ChromeOS environment in my team plans. It would > probably also be beneficial if Xe team could take a look at real world > behaviour of the extreme transcode use cases too. If the stack is ready > for that and all. It would be better to know earlier rather than later > if there is a fundamental issue. > > For the patch at hand, and the cover letter, it certainly feels it would > benefit to record the past design discussion had with AMD folks, to > explicitly copy other drivers, and to record the theoretical pros and > cons of threads vs unbound workers as I have tried to highlight them. > > Regards, > > Tvrtko >