From: Oscar Mateo <oscar.ma...@intel.com>

Hi all,

This patch series implement execlists for GEN8+. Before continuing, it is 
important to mention that I might have taken upon myself to assemble the series 
and rewrite it for upstreaming, but many people have worked on this series 
before me. Namely:

Ben Widawsky (benjamin.widaw...@intel.com).
Jesse Barnes (jbar...@virtuousgeek.org).
Michel Thierry (michel.thie...@intel.com).
Thomas Daniel (thomas.dan...@intel.com).
Rafael Barbalho (rafael.barba...@intel.com).

All good ideas in the series belong to these authors, and so I have tried to 
maintain authorship in the patches accordingly (to the extent possible, since 
the patches have suffered a lot of squashing & splitting). These authors do 
not, however, bear any of the blame for errors: I am solely responsible for 
them. 

Now, let's get back to the subject at hand:

With GEN8 comes an expansion of the HW contexts: "Logical Ring Contexts". One 
of the main differences with the legacy HW contexts is that logical ring 
contexts incorporate many more things to the context's state, like PDPs or 
ringbuffer control registers. These logical ring contexts enable a number of 
new abilities, especially "Execlists". Execlists are the new method by which, 
on GEN8+ hardware, workloads are submitted for execution (as opposed to the 
legacy, ringbuffer-based). With this new method, commands in the context's 
ringbuffer are executed when the GPU moves to this context from a previous one 
(a.k.a. context switch).

On a context switch, the GPU has to remember the current state of the context 
being switched out including the head and tail pointers of the ring buffer, so 
it:

- Flushes the pipe.
- Saves ringbuffer head pointer.
- Saves engine state.

Similarly, on a context restore (When a previously switched out context is 
resubmitted), the GPU restores the saved context and resumes execution where it 
stopped:

- Restores PDPs and sets-up PPGTT.
- Restores ringbuffer.
- Restores engine state.

The way in which contexts are submitted for execution is the GPU's ExecLists 
Submit Port (ELSP, for short). This port supports the submission of two 
contexts at a time, which are executed in a serial way (Context-0 first, 
Context-1 next) upon every context completion. The GPU keeps the software 
informed about the status of this list via context switch interrupts and 
context status buffers, to help software keep track of the progress. The 
existance of a second context ensures some useful work done in HW while the 
Context-0 switch status is being processed by SW. After Context-1 completion, 
HW goes IDLE if there is no further contexts scheduled in the ELSP.

Every time a new Execution List is submitted to the ELSP where one of the 
contexts is already running will result in a Lite Restore (sampling of the new 
tail pointer).

Regarding the creation of logical ring contexts, we had before (since PPGTT was 
introduced):

- One global default context.
- One private default context for each opened fd.
- One extra private context for each context create ioctl call.

The global default context existed for future shrinker usage as well as reset 
handling. At the same time, every file got it's own context, plus any number of 
extra contexts if the context create ioctl call was used by the userspace 
driver. These private contexts were the ones used by the driver for execbuffer 
calls.

Now that ringbuffers belong per-context (and not per-engine, like before) and 
that contexts are uniquely tied to a given engine (and not reusable, like 
before) we need:

- No. of engines global default contexts.
- Up to no. of engines private default contexts for each opened fd.
- Up to no. of engines extra private contexts for each context create ioctl 
call.

Given that at creation time of a non-global context we don't know which engine 
is going to use it, we have implemented a deferred creation of logical ring 
contexts: the private default context starts its life as a hollow or blank 
holder, that gets populated once we receive an execbuffer ioctl (for a 
particular engine) on that fd. If later on we receive another execbuffer ioctl 
for a different engine, we create a second private default context and so on. 
The same rules apply to the create context ioctl call.

Execlists have been implemented as follows:

When a request is committed, its commands (the BB start and any leading or 
trailing commands, like the seqno breadcrumbs) are placed in the ringbuffer for 
the appropriate context. The tail pointer in the hardware context is not 
updated at this time, but instead, kept by the driver in the ringbuffer 
structure. A structure representing this execution request is added to a 
request queue for the appropriate engine: this structure contains a copy of the 
context's tail after the request was written to the ringbuffer and a pointer to 
the context itself.

If the engine's request queue was empty before the request was added, the queue 
is processed immediately. Otherwise the queue will be processed during a 
context switch interrupt. In any case, elements on the queue will get sent (in 
pairs) to the ELSP with a globally unique 20-bits submission ID (constructed 
with the fd's ID, plus our own context ID, plus the engine's ID).

When execution of a request completes, the GPU updates the context status 
buffer with a context complete event and generates a context switch interrupt. 
During context switch interrupt handling, the driver examines the context 
status events in the context status buffer: for each context complete event, if 
the announced ID matches that on the head of the request queue, then that 
request is retired and removed from the queue.

After processing, if any requests were retired and the queue is not empty then 
a new execution list can be submitted. The two requests at the front of the 
queue are next to be submitted but since a context may not occur twice in an 
execution list, if subsequent requests have the same ID as the first then the 
two requests must be combined. This is done simply by discarding requests at 
the head of the queue until either only one requests is left (in which case we 
use a NULL second context) or the first two requests have unique IDs.

By always executing the first two requests in the queue the driver ensures that 
the GPU is kept as busy as possible. In the case where a single context 
completes but a second context is still executing, the request for the second 
context will be at the head of the queue when we remove the first one. This 
request will then be resubmitted along with a new request for a different 
context, which will cause the hardware to continue executing the second request 
and queue the new request (the GPU detects the condition of a context getting 
preempted with the same context and optimizes the context switch flow by not 
doing preemption, but just sampling the new tail pointer).

Because the GPU continues to execute while the context switch interrupt is 
being handled, there is a race condition where a second context completes while 
handling the completion of the previous. This results in the second context 
being resubmitted (potentially along with a third), and an extra context 
complete event for that context will occur. The request will be removed from 
the queue at the first context complete event, and the second context complete 
event will not result in removal of a request from the queue because the IDs of 
the request and the event will not match.

Cheers,
Oscar

Ben Widawsky (15):
  drm/i915/bdw: Macro to distinguish LRCs (Logical Ring Contexts)
  drm/i915: s/for_each_ring/for_each_active_ring
  drm/i915: for_each_ring
  drm/i915: Extract trivial parts of ring init (early init)
  drm/i915/bdw: Rework init code for gen8 contexts
  drm/i915: Extract ringbuffer obj alloc & destroy
  drm/i915/bdw: LR context ring init
  drm/i915/bdw: GEN8 semaphoreless ring add request
  drm/i915/bdw: GEN8 new ring flush
  drm/i915/bdw: A bit more advanced context init/fini
  drm/i915/bdw: Allocate ringbuffer for LR contexts
  drm/i915/bdw: Populate LR contexts (somewhat)
  drm/i915/bdw: Status page for LR contexts
  drm/i915/bdw: Enable execlists in the hardware
  drm/i915/bdw: Implement context switching (somewhat)

Michel Thierry (1):
  drm/i915/bdw: Get prepared for a two-stage execlist submit process

Oscar Mateo (30):
  drm/i915: Simplify a couple of functions thanks to for_each_ring
  drm/i915/bdw: New file for logical ring contexts and execlists
  drm/i915: Make i915_gem_create_context outside accessible
  drm/i915: s/intel_ring_buffer/intel_engine
  drm/i915: Split the ringbuffers and the rings
  drm/i915: Rename functions that mention ringbuffers (meaning rings)
  drm/i915/bdw: Execlists ring tail writing
  drm/i915/bdw: Plumbing for user LR context switching
  drm/i915: s/__intel_ring_advance/intel_ringbuffer_advance_and_submit
  drm/i915/bdw: Write a new set of context-aware ringbuffer management
    functions
  drm/i915: Final touches to LR contexts plumbing and refactoring
  drm/i915/bdw: Set the request context information correctly in the LRC
    case
  drm/i915/bdw: Prepare for user-created LR contexts
  drm/i915/bdw: Start creating & destroying user LR contexts
  drm/i915/bdw: Pin context pages at context create time
  drm/i915/bdw: Extract LR context object populating
  drm/i915/bdw: Introduce dependent contexts
  drm/i915/bdw: Create stand-alone and dependent contexts
  drm/i915/bdw: Allow non-default, non-render user LR contexts
  drm/i915/bdw: Fix reset stats ioctl with LR contexts
  drm/i915: Allocate an integer ID for each new file descriptor
  drm/i915/bdw: Prepare for a 20-bits globally unique submission ID
  drm/i915/bdw: Swap the PPGTT PDPs, LRC style
  drm/i915/bdw: Write the tail pointer, LRC style
  drm/i915/bdw: Display execlists info in debugfs
  drm/i915/bdw: Display context ringbuffer info in debugfs
  drm/i915/bdw: Start queueing contexts to be submitted
  drm/i915/bdw: Always write seqno to default context
  drm/i915/bdw: Enable logical ring contexts
  drm/i915/bdw: Document execlists and logical ring contexts

Thomas Daniel (3):
  drm/i915/bdw: Add forcewake lock around ELSP writes
  drm/i915/bdw: LR context switch interrupts
  drm/i915/bdw: Handle context switch events

 drivers/gpu/drm/i915/Makefile              |   1 +
 drivers/gpu/drm/i915/i915_cmd_parser.c     |  14 +-
 drivers/gpu/drm/i915/i915_debugfs.c        | 103 +++-
 drivers/gpu/drm/i915/i915_dma.c            |  57 +-
 drivers/gpu/drm/i915/i915_drv.h            |  90 +++-
 drivers/gpu/drm/i915/i915_gem.c            | 153 +++---
 drivers/gpu/drm/i915/i915_gem_context.c    | 109 ++--
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |  85 +--
 drivers/gpu/drm/i915/i915_gem_gtt.c        |  39 +-
 drivers/gpu/drm/i915/i915_gem_gtt.h        |   2 +-
 drivers/gpu/drm/i915/i915_gpu_error.c      |  12 +-
 drivers/gpu/drm/i915/i915_irq.c            |  93 ++--
 drivers/gpu/drm/i915/i915_lrc.c            | 826 +++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_reg.h            |  10 +
 drivers/gpu/drm/i915/i915_trace.h          |  26 +-
 drivers/gpu/drm/i915/intel_display.c       |  26 +-
 drivers/gpu/drm/i915/intel_drv.h           |   4 +-
 drivers/gpu/drm/i915/intel_overlay.c       |  12 +-
 drivers/gpu/drm/i915/intel_pm.c            |  18 +-
 drivers/gpu/drm/i915/intel_ringbuffer.c    | 796 +++++++++++++++++----------
 drivers/gpu/drm/i915/intel_ringbuffer.h    | 187 ++++---
 drivers/gpu/drm/i915/intel_uncore.c        |  15 +
 22 files changed, 2043 insertions(+), 635 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/i915_lrc.c

-- 
1.9.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

Reply via email to