Quoting Jason Ekstrand (2017-06-13 22:53:20) > As I've been working on converting more things in the GL driver over to > blorp, I've been highly annoyed by all of the hangs on Haswell. About one > in 3-5 Jenkins runs would hang somewhere. After looking at about a > half-dozen error states, I noticed that all of the hangs seemed to be on > fast-clear operations (clear or resolve) that happen at the start of a > batch, right after STATE_BASE_ADDRESS. > > Haswell seems to be a bit more picky than other hardware about having > fast-clear operations in flight at the same time as regular rendering and > hangs if the two ever overlap. (Other hardware can get rendering > corruption but not usually hangs.) Also, Haswell doesn't fully stall if > you just do a RT flush and a CS stall. The hardware docs refer to > something they call an "end of pipe sync" which is a CS stall with a write > to the workaround BO. On Haswell, you also need to read from that same > address to create a memory dependency and make sure the system is fully > stalled. > > When you call brw_blorp_resolve_color it calls brw_emit_pipe_control_flush > and does the correct flushes and then calls into core blorp to do the > actual resolve operation. If the batch doesn't have enough space left in > it for the fast-clear operation, the batch will get split and the > fast-clear will happen in the next batch. I believe what is happening is > that while we're building the second batch that actually contains the > fast-clear, some other process completes a batch and inserts it between our > PIPE_CONTROL to do the stall and the actual fast-clear. We then end up > with more stuff in flight than we can handle and the GPU explodes. > > I'm not 100% convinced of this explanation because it seems a bit fishy > that a context switch wouldn't be enough to fully flush out the GPU. > However, what I do know is that, without these patches I get a hang in one > out of three to five Jenkins runs on my wip/i965-blorp-ds branch. With the > patches (or an older variant that did the same thing), I have done almost 20 > Jenkins runs and have yet to see a hang. I'd call that success.
Note that a context switch is itself just a batch that restores the registers and GPU state. The kernel does PIPE_CONTROLs for invalidate-caches MI_SET_CONTEXT MI_BB_START PIPE_CONTROLs for flush-caches MI_STORE_DWORD (seqno) MI_USER_INTERRUPT What I believe you are seeing is that MI_SET_CONTEXT is leaving the GPU in an active state requiring a pipeline barrier before adjusting. It will be the equivalent of switching between GL and blorp in the middle of a batch. The question I have is whether we apply the fix in the kernel, i.e. do a full end of pipe sync after every MI_SET_CONTEXT. Userspace has the advantage of knowing if/when such a hammer is required, but equally we have to learn where by trial-and-error and if a second context user ever manifests, they will have to be taught the same lessons. -Chris _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev