On Monday, July 06, 2015 11:33:09 AM Chris Wilson wrote: > When submitting commands to the GPU every cycle of latency counts; > mutexes, spinlocks, even atomics quickly add to substantial overhead. > > This "batch manager" acts as thread-local shim over the buffer manager > (drm_intel_bufmgr_gem). As we are only ever used from within a single > context, we can rely on the upper layers providing thread safety. > This allows us to import buffers from the shared screen (sharing buffers > between multiple contexts, threads and users) and wrap that handle in > our own. Similarly, we want to share the buffer cache between all > users on the file and so allocate from the global threadsafe buffer > manager, with a very small and transient local cache of active buffers. > > The batch manager provides a cheap way of busyness tracking and very > efficient batch construction and kernel submission. > > The restrictions over and above the generic submission engine in > intel_bufmgr_gem are: > - not thread-safe > - flat relocations, only the batch buffer itself carries > relocations. Relocations relative to auxiliary buffers > must be performed via STATE_BASE > - direct mapping of the batch for writes, expect reads > from the batch to be slow > - the batch is a fixed 64k in size > - access to the batch must be wrapped by brw_batch_begin/_end > - all relocations must be immediately written into the batch > > The importance of the flat relocation tree with local offset handling is > that it allows us to use the "relocation-less" execbuffer interfaces, > dramatically reducing the overhead of batch submission. However, that > can be relaxed to allow other buffers than the batch buffer to carry > relocations, if need be. > > ivb/bdw OglBatch7 improves by ~20% above and beyond my kernel relocation > speedups. > > ISSUES: > * shared mipmap trees > - we instantiate a context local copy on use, but what are the semantics for > serializing read/writes between them - do we need automagic flushing of > execution on other contexts and common busyness tracking? > - we retain references to the bo past the lifetime of its parent > batchmgr as the mipmap_tree is retained past the lifetime of its > original context, see glx_arb_create_context/default_major_version > * OglMultithread is nevertheless unhappy; but that looks like undefined > behaviour - i.e. a buggy client concurrently executing the same GL > context in multiple threads, unpatched is equally buggy. > * Add full-ppgtt softpinning support (no more relocations, at least for > the first 256TiB), at the moment there is a limited proof-of-principle > demonstration > * polish and move to libdrm; though at the cost of sealing the structs? > > Signed-off-by: Chris Wilson <ch...@chris-wilson.co.uk> > Cc: Daniel Vetter <daniel.vet...@ffwll.ch> > Cc: Kristian Høgsberg <k...@bitplanet.net> > Cc: Kenneth Graunke <kenn...@whitecape.org> > Cc: Jesse Barnes <jbar...@virtuousgeek.org> > Cc: Ian Romanick <ian.d.roman...@intel.com> > Cc: Abdiel Janulgue <abdiel.janul...@linux.intel.com> > Cc: Eero Tamminen <eero.t.tammi...@intel.com> > Cc: Martin Peres <martin.pe...@linux.intel.com> > --- > src/mesa/drivers/dri/i965/Makefile.sources | 4 +- > src/mesa/drivers/dri/i965/brw_batch.c | 1946 > ++++++++++++++++++++ > src/mesa/drivers/dri/i965/brw_batch.h | 377 ++++ > src/mesa/drivers/dri/i965/brw_binding_tables.c | 1 - > src/mesa/drivers/dri/i965/brw_blorp.cpp | 46 +- > src/mesa/drivers/dri/i965/brw_cc.c | 16 +- > src/mesa/drivers/dri/i965/brw_clear.c | 1 - > src/mesa/drivers/dri/i965/brw_clip.c | 2 - > src/mesa/drivers/dri/i965/brw_clip_line.c | 2 - > src/mesa/drivers/dri/i965/brw_clip_point.c | 2 - > src/mesa/drivers/dri/i965/brw_clip_state.c | 14 +- > src/mesa/drivers/dri/i965/brw_clip_tri.c | 2 - > src/mesa/drivers/dri/i965/brw_clip_unfilled.c | 2 - > src/mesa/drivers/dri/i965/brw_clip_util.c | 2 - > src/mesa/drivers/dri/i965/brw_compute.c | 42 +- > src/mesa/drivers/dri/i965/brw_conditional_render.c | 2 +- > src/mesa/drivers/dri/i965/brw_context.c | 233 ++- > src/mesa/drivers/dri/i965/brw_context.h | 144 +- > src/mesa/drivers/dri/i965/brw_cs.cpp | 6 +- > src/mesa/drivers/dri/i965/brw_curbe.c | 1 - > src/mesa/drivers/dri/i965/brw_draw.c | 103 +- > src/mesa/drivers/dri/i965/brw_draw_upload.c | 23 +- > src/mesa/drivers/dri/i965/brw_ff_gs.c | 2 - > src/mesa/drivers/dri/i965/brw_ff_gs_emit.c | 1 - > src/mesa/drivers/dri/i965/brw_fs.cpp | 5 +- > src/mesa/drivers/dri/i965/brw_meta_fast_clear.c | 11 +- > src/mesa/drivers/dri/i965/brw_meta_stencil_blit.c | 1 - > src/mesa/drivers/dri/i965/brw_meta_updownsample.c | 1 - > src/mesa/drivers/dri/i965/brw_misc_state.c | 10 +- > src/mesa/drivers/dri/i965/brw_object_purgeable.c | 8 +- > .../drivers/dri/i965/brw_performance_monitor.c | 88 +- > src/mesa/drivers/dri/i965/brw_pipe_control.c | 24 +- > src/mesa/drivers/dri/i965/brw_primitive_restart.c | 2 - > src/mesa/drivers/dri/i965/brw_program.c | 23 +- > src/mesa/drivers/dri/i965/brw_queryobj.c | 68 +- > src/mesa/drivers/dri/i965/brw_reset.c | 14 +- > src/mesa/drivers/dri/i965/brw_sampler_state.c | 8 +- > src/mesa/drivers/dri/i965/brw_sf.c | 2 - > src/mesa/drivers/dri/i965/brw_sf_emit.c | 2 - > src/mesa/drivers/dri/i965/brw_sf_state.c | 21 +- > src/mesa/drivers/dri/i965/brw_state.h | 2 +- > src/mesa/drivers/dri/i965/brw_state_batch.c | 41 +- > src/mesa/drivers/dri/i965/brw_state_cache.c | 70 +- > src/mesa/drivers/dri/i965/brw_state_dump.c | 77 +- > src/mesa/drivers/dri/i965/brw_state_upload.c | 16 +- > src/mesa/drivers/dri/i965/brw_structs.h | 33 +- > src/mesa/drivers/dri/i965/brw_urb.c | 9 +- > src/mesa/drivers/dri/i965/brw_vec4.cpp | 5 +- > src/mesa/drivers/dri/i965/brw_vs_state.c | 33 +- > src/mesa/drivers/dri/i965/brw_vs_surface_state.c | 4 +- > src/mesa/drivers/dri/i965/brw_wm_state.c | 38 +- > src/mesa/drivers/dri/i965/brw_wm_surface_state.c | 76 +- > src/mesa/drivers/dri/i965/gen6_blorp.cpp | 17 +- > src/mesa/drivers/dri/i965/gen6_cc.c | 1 - > src/mesa/drivers/dri/i965/gen6_clip_state.c | 1 - > src/mesa/drivers/dri/i965/gen6_depth_state.c | 1 - > src/mesa/drivers/dri/i965/gen6_depthstencil.c | 1 - > src/mesa/drivers/dri/i965/gen6_gs_state.c | 1 - > src/mesa/drivers/dri/i965/gen6_multisample_state.c | 1 - > src/mesa/drivers/dri/i965/gen6_queryobj.c | 56 +- > src/mesa/drivers/dri/i965/gen6_sampler_state.c | 1 - > src/mesa/drivers/dri/i965/gen6_scissor_state.c | 1 - > src/mesa/drivers/dri/i965/gen6_sf_state.c | 1 - > src/mesa/drivers/dri/i965/gen6_sol.c | 9 +- > src/mesa/drivers/dri/i965/gen6_surface_state.c | 13 +- > src/mesa/drivers/dri/i965/gen6_urb.c | 1 - > src/mesa/drivers/dri/i965/gen6_viewport_state.c | 1 - > src/mesa/drivers/dri/i965/gen6_vs_state.c | 2 +- > src/mesa/drivers/dri/i965/gen6_wm_state.c | 1 - > src/mesa/drivers/dri/i965/gen7_blorp.cpp | 16 +- > src/mesa/drivers/dri/i965/gen7_disable.c | 1 - > src/mesa/drivers/dri/i965/gen7_gs_state.c | 1 - > src/mesa/drivers/dri/i965/gen7_misc_state.c | 3 +- > src/mesa/drivers/dri/i965/gen7_sf_state.c | 1 - > src/mesa/drivers/dri/i965/gen7_sol_state.c | 49 +- > src/mesa/drivers/dri/i965/gen7_urb.c | 1 - > src/mesa/drivers/dri/i965/gen7_viewport_state.c | 1 - > src/mesa/drivers/dri/i965/gen7_vs_state.c | 1 - > src/mesa/drivers/dri/i965/gen7_wm_state.c | 1 - > src/mesa/drivers/dri/i965/gen7_wm_surface_state.c | 60 +- > src/mesa/drivers/dri/i965/gen8_blend_state.c | 1 - > src/mesa/drivers/dri/i965/gen8_depth_state.c | 16 +- > src/mesa/drivers/dri/i965/gen8_disable.c | 1 - > src/mesa/drivers/dri/i965/gen8_draw_upload.c | 1 - > src/mesa/drivers/dri/i965/gen8_gs_state.c | 1 - > src/mesa/drivers/dri/i965/gen8_misc_state.c | 1 - > src/mesa/drivers/dri/i965/gen8_multisample_state.c | 1 - > src/mesa/drivers/dri/i965/gen8_ps_state.c | 1 - > src/mesa/drivers/dri/i965/gen8_sf_state.c | 1 - > src/mesa/drivers/dri/i965/gen8_sol_state.c | 3 +- > src/mesa/drivers/dri/i965/gen8_surface_state.c | 73 +- > src/mesa/drivers/dri/i965/gen8_viewport_state.c | 1 - > src/mesa/drivers/dri/i965/gen8_vs_state.c | 1 - > src/mesa/drivers/dri/i965/gen8_wm_depth_stencil.c | 1 - > src/mesa/drivers/dri/i965/intel_batchbuffer.c | 480 ----- > src/mesa/drivers/dri/i965/intel_batchbuffer.h | 179 -- > src/mesa/drivers/dri/i965/intel_blit.c | 68 +- > src/mesa/drivers/dri/i965/intel_blit.h | 10 +- > src/mesa/drivers/dri/i965/intel_buffer_objects.c | 222 +-- > src/mesa/drivers/dri/i965/intel_buffer_objects.h | 18 +- > src/mesa/drivers/dri/i965/intel_debug.c | 6 - > src/mesa/drivers/dri/i965/intel_extensions.c | 48 +- > src/mesa/drivers/dri/i965/intel_fbo.c | 46 +- > src/mesa/drivers/dri/i965/intel_fbo.h | 4 - > src/mesa/drivers/dri/i965/intel_image.h | 6 +- > src/mesa/drivers/dri/i965/intel_mipmap_tree.c | 98 +- > src/mesa/drivers/dri/i965/intel_mipmap_tree.h | 11 +- > src/mesa/drivers/dri/i965/intel_pixel_bitmap.c | 3 +- > src/mesa/drivers/dri/i965/intel_pixel_copy.c | 3 - > src/mesa/drivers/dri/i965/intel_pixel_draw.c | 2 +- > src/mesa/drivers/dri/i965/intel_pixel_read.c | 28 +- > src/mesa/drivers/dri/i965/intel_screen.c | 68 +- > src/mesa/drivers/dri/i965/intel_screen.h | 16 +- > src/mesa/drivers/dri/i965/intel_syncobj.c | 86 +- > src/mesa/drivers/dri/i965/intel_tex.c | 6 +- > src/mesa/drivers/dri/i965/intel_tex_image.c | 35 +- > src/mesa/drivers/dri/i965/intel_tex_subimage.c | 33 +- > src/mesa/drivers/dri/i965/intel_tiled_memcpy.c | 14 +- > src/mesa/drivers/dri/i965/intel_tiled_memcpy.h | 4 +- > src/mesa/drivers/dri/i965/intel_upload.c | 33 +- > 120 files changed, 3341 insertions(+), 2199 deletions(-) > create mode 100644 src/mesa/drivers/dri/i965/brw_batch.c > create mode 100644 src/mesa/drivers/dri/i965/brw_batch.h > delete mode 100644 src/mesa/drivers/dri/i965/intel_batchbuffer.c > delete mode 100644 src/mesa/drivers/dri/i965/intel_batchbuffer.h
Hi Chris, Thanks so much for taking the time to do this - I've known that we needed to revamp this code for a while, but am not terribly familiar with the details at the kernel level. You understand it much better than most of us. 20% is a huge improvement! While I really like this idea in principle, the current patch is rather huge, making it difficult to review; bisectability would also suffer. Would it be possible to split it up into several smaller patches? One idea I had for how you might accomplish that is to introduce the brw_bo struct and related functions, but make it simply contain a drm_intel_bo * field, and fall back to the existing libdrm-based code. That way, you could put all the mechanical renaming and refactoring across the entire code-base in one patch that should have no functional change. You could then replace the contents or brw_bo and those functions with your new improved batch manager. We've talked about moving away from libdrm for a while, so having our own functions and structures makes a lot of sense. I'm also curious about the performance on non-LLC platforms (CHV or BYT)? You've dropped a number of non-LLC paths - which is probably fine, honestly...they were always of dubious value - but it'd be nice to know the record the effects of this change on non-LLC in the commit message. [snip] > +/** > + * Number of bytes to reserve for commands necessary to complete a batch. > + * > + * This includes: > + * - MI_BATCHBUFFER_END (4 bytes) > + * - Optional MI_NOOP for ensuring the batch length is qword aligned (4 > bytes) > + * - Any state emitted by vtbl->finish_batch(): > + * - Gen4-5 record ending occlusion query values (4 * 4 = 16 bytes) > + * - Disabling OA counters on Gen6+ (3 DWords = 12 bytes) > + * - Ending MI_REPORT_PERF_COUNT on Gen5+, plus associated PIPE_CONTROLs: > + * - Two sets of PIPE_CONTROLs, which become 3 PIPE_CONTROLs each on SNB, > + * which are 4 DWords each ==> 2 * 3 * 4 * 4 = 96 bytes > + * - 3 DWords for MI_REPORT_PERF_COUNT itself on Gen6+. ==> 12 bytes. > + * On Ironlake, it's 6 DWords, but we have some slack due to the lack > of > + * Sandybridge PIPE_CONTROL madness. > + * > + * Total: 140 bytes > + */ > +#define BATCH_RESERVED 140 I see you noticed that 146 was the result of me flubbing the arithmetic. :) I apparently also flubbed up the Gen6 PIPE_CONTROL size, and just landed a patch to fix that. It should be 152 now, IIRC. > + > +/* Surface offsets are limited to a maximum of 64k from the surface base */ > +#define BATCH_SIZE (64 << 10) So...our current batches are 32k (8192 * sizeof(uint32_t)). Doubling the size of the batch has always seemed like a good idea; we have a bunch of overhead when BRW_NEW_BATCH triggers, and the "other driver" uses larger batches too. I'd experienced GPU hangs in the past when going beyond 64k, and I think you've explained that with this comment. Good catch! > +/* XXX Temporary home until kernel patches land */ > +#define I915_PARAM_HAS_EXEC_SOFTPIN 37 > +#define EXEC_OBJECT_PINNED (1<<4) > +#define I915_PARAM_HAS_EXEC_BATCH_FIRST 38 > +#define I915_EXEC_BATCH_FIRST (1<<16) > + > +#define DBG_NO_FAST_RELOC 0 > +#define DBG_NO_HANDLE_LUT 0 > +#define DBG_NO_BATCH_FIRST 0 > +#define DBG_NO_SOFTPIN 0 > + > +#define PERF_IDLE 0 /* ring mask */ > + > +inline static void list_move(struct list_head *from, struct list_head *to) > +{ > + list_del(from); > + list_add(from, to); > +} > + > +inline static void list_move_tail(struct list_head *from, struct list_head > *to) > +{ > + list_del(from); > + list_addtail(from, to); > +} These should go into src/util/list.h in a separate patch. > diff --git a/src/mesa/drivers/dri/i965/brw_batch.h > b/src/mesa/drivers/dri/i965/brw_batch.h > new file mode 100644 > index 0000000..0b5468b > --- /dev/null > +++ b/src/mesa/drivers/dri/i965/brw_batch.h > @@ -0,0 +1,377 @@ > +#ifndef BRW_BATCH_H > +#define BRW_BATCH_H > + > +#include <stdbool.h> > +#include <stdint.h> > +#include <string.h> > +#include <setjmp.h> > +#include <assert.h> > + > +#include <intel_aub.h> > +#include <intel_bufmgr.h> > + > +#include "util/list.h" > + > +#define HAS_GCC(major, minor) defined(__GNUC__) && (__GNUC__ > (major) || > __GNUC__ == (major) && __GNUC_MINOR__ >= (minor)) > + > +#if HAS_GCC(3, 4) > +#define must_check __attribute__((warn_unused_result)) > +#else > +#define must_check > +#endif I've sent out a patch that adds MUST_CHECK to Mesa as a whole, with the appropriate configure.ac magic, so you should be able to just use that. > diff --git a/src/mesa/drivers/dri/i965/brw_blorp.cpp > b/src/mesa/drivers/dri/i965/brw_blorp.cpp > index 2ccfae1..e1a9f56 100644 > --- a/src/mesa/drivers/dri/i965/brw_blorp.cpp > +++ b/src/mesa/drivers/dri/i965/brw_blorp.cpp > @@ -22,7 +22,6 @@ > */ > > #include <errno.h> > -#include "intel_batchbuffer.h" > #include "intel_fbo.h" > > #include "brw_blorp.h" > @@ -211,7 +210,9 @@ brw_blorp_exec(struct brw_context *brw, const > brw_blorp_params *params) > { > struct gl_context *ctx = &brw->ctx; > uint32_t estimated_max_batch_usage = 1500; > - bool check_aperture_failed_once = false; > + > + if (brw_batch_begin(&brw->batch, estimated_max_batch_usage, RENDER_RING) > < 0) > + return; > > /* Flush the sampler and render caches. We definitely need to flush the > * sampler cache so that we get updated contents from the render cache for > @@ -222,13 +223,6 @@ brw_blorp_exec(struct brw_context *brw, const > brw_blorp_params *params) > */ > brw_emit_mi_flush(brw); > > -retry: > - intel_batchbuffer_require_space(brw, estimated_max_batch_usage, > RENDER_RING); > - intel_batchbuffer_save_state(brw); > - drm_intel_bo *saved_bo = brw->batch.bo; > - uint32_t saved_used = brw->batch.used; > - uint32_t saved_state_batch_offset = brw->batch.state_batch_offset; > - > switch (brw->gen) { > case 6: > gen6_blorp_exec(brw, params); > @@ -241,37 +235,18 @@ retry: > unreachable("not reached"); > } > > - /* Make sure we didn't wrap the batch unintentionally, and make sure we > - * reserved enough space that a wrap will never happen. > - */ > - assert(brw->batch.bo == saved_bo); > - assert((brw->batch.used - saved_used) * 4 + > - (saved_state_batch_offset - brw->batch.state_batch_offset) < > - estimated_max_batch_usage); > - /* Shut up compiler warnings on release build */ > - (void)saved_bo; > - (void)saved_used; > - (void)saved_state_batch_offset; > + brw_emit_mi_flush(brw); > > /* Check if the blorp op we just did would make our batch likely to fail > to > * map all the BOs into the GPU at batch exec time later. If so, flush > the > * batch and try again with nothing else in the batch. > */ > - if (dri_bufmgr_check_aperture_space(&brw->batch.bo, 1)) { > - if (!check_aperture_failed_once) { > - check_aperture_failed_once = true; > - intel_batchbuffer_reset_to_saved(brw); > - intel_batchbuffer_flush(brw); > - goto retry; > - } else { > - int ret = intel_batchbuffer_flush(brw); > - WARN_ONCE(ret == -ENOSPC, > - "i965: blorp emit exceeded available aperture space\n"); > - } > + if (brw_batch_end(&brw->batch)) { > + WARN_ONCE(1, "i965: blorp emit exceeded available aperture space\n"); > + return; > } > > - if (unlikely(brw->always_flush_batch)) > - intel_batchbuffer_flush(brw); > + brw_batch_maybe_flush(&brw->batch); > > /* We've smashed all state compared to what the normal 3D pipeline > * rendering tracks for GL. > @@ -279,11 +254,6 @@ retry: > brw->ctx.NewDriverState = ~0ull; > brw->no_depth_or_stencil = false; > brw->ib.type = -1; > - > - /* Flush the sampler cache so any texturing from the destination is > - * coherent. > - */ > - brw_emit_mi_flush(brw); It looks like you moved this flush earlier, so it's in the section of code that retries? That's probably reasonable. Seems worth keeping the comment, and this could be done as a separate patch... [snip] > +static void > +load_sized_register_mem(struct brw_context *brw, > + uint32_t reg, > + struct brw_bo *bo, > + uint32_t read_domains, uint32_t write_domain, > + uint32_t offset, > + int size) > +{ > + int i; > + > + /* MI_LOAD_REGISTER_MEM only exists on Gen7+. */ > + assert(brw->gen >= 7); > + > + if (brw->gen >= 8) { > + BEGIN_BATCH(4 * size); > + for (i = 0; i < size; i++) { > + OUT_BATCH(GEN7_MI_LOAD_REGISTER_MEM | (4 - 2)); > + OUT_BATCH(reg + i * 4); > + OUT_RELOC64(bo, read_domains, write_domain, offset + i * 4); > + } > + ADVANCE_BATCH(); > + } else { > + BEGIN_BATCH(3 * size); > + for (i = 0; i < size; i++) { > + OUT_BATCH(GEN7_MI_LOAD_REGISTER_MEM | (3 - 2)); > + OUT_BATCH(reg + i * 4); > + OUT_RELOC(bo, read_domains, write_domain, offset + i * 4); > + } > + ADVANCE_BATCH(); > + } > +} > + > +void > +brw_load_register_mem(struct brw_context *brw, > + uint32_t reg, > + struct brw_bo *bo, > + uint32_t read_domains, uint32_t write_domain, > + uint32_t offset) > +{ > + load_sized_register_mem(brw, reg, bo, read_domains, write_domain, offset, > 1); > +} > + > +void > +brw_load_register_mem64(struct brw_context *brw, > + uint32_t reg, > + struct brw_bo *bo, > + uint32_t read_domains, uint32_t write_domain, > + uint32_t offset) > +{ > + load_sized_register_mem(brw, reg, bo, read_domains, write_domain, offset, > 2); > +} Maybe just make these static inlines in brw_state.h or somewhere?
signature.asc
Description: This is a digitally signed message part.
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev