On Thu, Jun 15, 2017 at 4:15 AM, Chris Wilson <ch...@chris-wilson.co.uk> wrote:
> Quoting Kenneth Graunke (2017-06-14 21:44:45) > > On Tuesday, June 13, 2017 2:53:20 PM PDT Jason Ekstrand wrote: > > > As I've been working on converting more things in the GL driver over to > > > blorp, I've been highly annoyed by all of the hangs on Haswell. About > one > > > in 3-5 Jenkins runs would hang somewhere. After looking at about a > > > half-dozen error states, I noticed that all of the hangs seemed to be > on > > > fast-clear operations (clear or resolve) that happen at the start of a > > > batch, right after STATE_BASE_ADDRESS. > > > > > > Haswell seems to be a bit more picky than other hardware about having > > > fast-clear operations in flight at the same time as regular rendering > and > > > hangs if the two ever overlap. (Other hardware can get rendering > > > corruption but not usually hangs.) Also, Haswell doesn't fully stall > if > > > you just do a RT flush and a CS stall. The hardware docs refer to > > > something they call an "end of pipe sync" which is a CS stall with a > write > > > to the workaround BO. On Haswell, you also need to read from that same > > > address to create a memory dependency and make sure the system is fully > > > stalled. > > > > > > When you call brw_blorp_resolve_color it calls > brw_emit_pipe_control_flush > > > and does the correct flushes and then calls into core blorp to do the > > > actual resolve operation. If the batch doesn't have enough space left > in > > > it for the fast-clear operation, the batch will get split and the > > > fast-clear will happen in the next batch. I believe what is happening > is > > > that while we're building the second batch that actually contains the > > > fast-clear, some other process completes a batch and inserts it > between our > > > PIPE_CONTROL to do the stall and the actual fast-clear. We then end up > > > with more stuff in flight than we can handle and the GPU explodes. > > > > > > I'm not 100% convinced of this explanation because it seems a bit fishy > > > that a context switch wouldn't be enough to fully flush out the GPU. > > > However, what I do know is that, without these patches I get a hang in > one > > > out of three to five Jenkins runs on my wip/i965-blorp-ds branch. > With the > > > patches (or an older variant that did the same thing), I have done > almost 20 > > > Jenkins runs and have yet to see a hang. I'd call that success. > > > > > > Jason Ekstrand (6): > > > i965: Flush around state base address > > > i965: Take a uint64_t immediate in emit_pipe_control_write > > > i965: Unify the two emit_pipe_control functions > > > i965: Do an end-of-pipe sync prior to STATE_BASE_ADDRESS > > > i965/blorp: Do an end-of-pipe sync around CCS ops > > > i965: Do an end-of-pipe sync after flushes > > > > > > Topi Pohjolainen (1): > > > i965: Add an end-of-pipe sync helper > > > > > > src/mesa/drivers/dri/i965/brw_blorp.c | 16 +- > > > src/mesa/drivers/dri/i965/brw_context.h | 3 +- > > > src/mesa/drivers/dri/i965/brw_misc_state.c | 38 +++++ > > > src/mesa/drivers/dri/i965/brw_pipe_control.c | 243 > ++++++++++++++++++--------- > > > src/mesa/drivers/dri/i965/brw_queryobj.c | 5 +- > > > src/mesa/drivers/dri/i965/gen6_queryobj.c | 2 +- > > > src/mesa/drivers/dri/i965/genX_blorp_exec.c | 2 +- > > > 7 files changed, 211 insertions(+), 98 deletions(-) > > > > > > > > > > The series is: > > Reviewed-by: Kenneth Graunke <kenn...@whitecape.org> > > > > If Chris is right, and what we're really seeing is that MI_SET_CONTEXT > > needs additional flushing, it probably makes sense to fix the kernel. > > If it's really fast clear related, then we should do it in Mesa. > > If I'm right, it's more of a userspace problem because you have to > insert a pipeline stall before STATE_BASE_ADDRESS when switching between > blorp/normal and back again, in the same batch. That the MI_SET_CONTEXT > may be restoring the dirty GPU state from the previous batch just means > that > you have to think of batches as being one long continuous batch. > -Chris > Given that, I doubt your explanation is correct. Right now, we should be correct under the "long continuous batch" assumption and we're hanging. So I think that either MI_SET_CONTEXT doesn't stall hard enough or we're conflicting with another process somehow.
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev