On 11.05.2017 00:45, Marek Olšák wrote:
Hi,

This series adds an optional module into gallium/util that wraps
around pipe_context and moves execution of all pipe_context calls into
a separate thread.

It puts a lot of new requirements on the driver, especially on thread-
safetiness of pipe_context functions, and even expects different
behavior from pipe_context in some cases, so it may be non-trivial
to enable. All of it is necessary to have a perfectly scalable
threaded execution. (Any new drivers should be built around it from
the beginning)

The performance improvement isn't very high (it's just hiding overhead
of pipe_context only), but I can tell you and I have tested a lot of
apps with this, it really doesn't sync the thread with majority of
apps except for SwapBuffers.

It can do these:
- unsychronized buffer mappings don't sync
- ordinary buffer mappings are promoted to unsynchronized when it's safe
- full buffer invalidations are implemented as reallocations and don't sync
- partial buffer invalidations are implemented as copy_buffer and don't sync
- get_query_result doesn't sync when the threaded context has seen flush()
  (i.e. get_query_result is contextless in that case)

Missing:
- deferred fences - mainly Bioshock Infinite might benefit
- texture mappings (meaning CPU access) always sync, texture_subdata
  doesn't sync for small uploads only, but we can make all texture
  uploads asynchronous by simply copying what is done for buffers

Note that it has a very low overhead when it's always synchronous
(i.e. not multithreaded), because it's really fast to enqueue and
execute calls. The worst case scenario might be -3% performance (just
guessing here).

All requirements on Gallium drivers and other information can be found
in the header file:
https://cgit.freedesktop.org/~mareko/mesa/tree/src/gallium/auxiliary/util/u_threaded_context.h?h=gallium-threaded2#n26

RadeonSI enables threaded Gallium by default for OpenGL Core and
Compatibility profiles and all OpenGL ES variants.

There is a small performance concern for RadeonSI: If non-contiguous
VRAM mappings are not supported (amdgpu - kernel 4.11 and older,
radeon - all kernels), the performance difference might be negative,
because buffer invalidations are done unconditionally, meaning that
there can be more live and mapped VRAM buffers. It's difficult to tell
whether any real apps are affected in a measurable way.

Here are performance numbers:

APPS: MORE IS BETTER
Alien Isolation: +16%
Bioshock Infinite: +13%
Borderlands 2: +12%
Civilization 5: +12%
Civilization 6: +10%
CS:GO: +8%
ET Legacy: +12%
Openarena: +27%
Talos Principle (high details, 1680x1050 internal resolution): +17%
glmark2: no change in the final score

When games are GPU-bound: no change

Because of not taking advantage of deferred fences, Bioshock runs
80% of time asynchronously and 20% of time synchronously.
All other games run 100% of time asynchronously.

x11perf: MORE IS BETTER
x11perf: Test: 500px PutImage Square: -3%
x11perf: Test: Scrolling 500 x 500 px: +16%
x11perf: Test: Char in 80-char aa line: +13%
x11perf: Test: PutImage XY 500x500 Square: +1%
x11perf: Test: Fill 300 x 300px AA Trapezoid: NO CHANGE
x11perf: Test: 500px Copy From Window To Window: +14%
x11perf: Test: Copy 500x500 From Pixmap To Pixmap: -1%
x11perf: Test: 500px Compositing From Pixmap To Window: +21%
x11perf: Test: 500px Compositing From Window To Window: +18%

gtkperf: LESS IS BETTER
gtkperf: GTK Widget: Total Time: -2%
gtkperf: GTK Widget: GtkComboBox: +7%
gtkperf: GTK Widget: GtkCheckButton: -15%
gtkperf: GTK Widget: GtkRadioButton: -13%
gtkperf: GTK Widget: GtkToggleButton: -2%
gtkperf: GTK Widget: GtkComboBoxEntry: -1%
gtkperf: GTK Widget: GtkTextView - Scroll: NO CHANGE
gtkperf: GTK Widget: GtkTextView - Add Text: NO CHANGE
gtkperf: GTK Widget: GtkDrawingArea - Circles: -9%
gtkperf: GTK Widget: GtkDrawingArea - Pixbufs: -3%

Hence the decision to enable it by default.

Those are some pretty impressive numbers! I sent comments / questions on patches 3 & 9, the rest are:

Reviewed-by: Nicolai Hähnle <nicolai.haeh...@amd.com>

Some general remarks:

Violating the "async" promise on debug callbacks is a problem. This breaks the OpenGL API in a place where it wasn't broken before, and that's not okay. I'm not sure what to do about this precisely, but the spec is very explicit:

   "When DEBUG_OUTPUT_SYNCHRONOUS is enabled, the driver guarantees
    synchronous calls to the callback routine by the context. When
    synchronous callbacks are enabled, all calls to the callback
    routine will be made by the thread that owns the current context;
    all such calls will be made serially by the current context; and
    each call will be made before the GL command that generated the
    debug message is allowed to return."

The last part is the strictest and implies that sync-ing becomes mandatory.

Maybe this can be handled without a performance impact by swapping out pipe_context function pointers when the debug callback changes to !async.

I'm also not too happy about ignoring resource_commit errors. Since the idea of sparse buffers/textures is to potentially allocate lots of memory, getting out-of-memory notifications there is kind of important. On the other hand, we handle out-of-memory inconsistently already, and forcing a sync is too high a price. I think we can live with it for now.

If having more buffers alive due to more invalidations ever becomes a serious issue, we could consider exposing user fences to the threaded_context. I see no reasons why that wouldn't work, but of course it requires some more re-work across the winsys (and wouldn't work with radeon).

Cheers,
Nicolai



Please review.

Marek
_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev



--
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.
_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Reply via email to