Hi,

On 30.08.2016 12:51, Marek Olšák wrote:
Recently I discovered that our GLSL compiler spends a lot of time in
rzalloc_size, so I looked at possible options to optimize that. It's
worth noting that too many existing allocations slow down subsequent
malloc calls, which in turn slows down the GLSL compiler. When I kept
5 instances of LLVMContext alive between compilations (I wanted to
reuse them), the GLSL compiler slowed down. That shows that the GLSL
compiler performance is too dependent on the size and complexity of
the heap.

So I decided to write my own linear allocator and then compared it
with jemalloc preloaded by LD, and jemalloc linked statically and used
by ralloc only.

The test was shader-db using AMD's shader collection. The command line was:
time GALLIUM_NOOP=1 shader-db/run shaders
The noop driver ensures the compilation process ends with TGSI.


Default Mesa:
real    0m58.343s
user    3m48.828s
sys    0m0.760s

Mesa with LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1:
real    0m48.550s (17% less time)
user    3m9.544s
sys    0m1.700s

Ralloc using _mesa_je_{calloc, realloc, free} and Mesa links against
my libmesa_jemalloc_pic.a:
real    0m49.580s (15% less time)
user    3m14.452s
sys    0m0.996s

Ralloc using my own linear allocator that allocates out of 32KB
buffers for 512b and smaller allocations:
real    0m46.521s (20% less time)
user    3m1.304s
sys    0m1.740s


Now let's test complete compilation down to GCN bytecode:

Default Mesa:
real    1m57.634s
user    7m41.692s
sys    0m1.824s

Mesa with LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1:
real    1m42.604s (13% less time)
user    6m39.776s
sys    0m3.828s

Ralloc using _mesa_je_{calloc, realloc, free} and Mesa links against
my libmesa_jemalloc_pic.a:
real    1m44.413s (11% less time)
user    6m48.808s
sys    0m2.480s

Ralloc using my own linear allocator:
real    1m40.486s (14.6% less time)
user    6m34.456s
sys    0m2.224s


The linear allocator that I wrote has a very high memory usage due to
the inability to free 32KB blocks if those blocks have at least one
living allocation. The workaround would be to do realloc() when
changing a ralloc parent in order to "defragment" the memory, but
that's more involved.

I don't know much about glibc, but it's hard to believe that glibc
people have been purposely ignoring jemalloc for so long. There must
be some anti-performance politics going on, but enough of
speculations.

Different allocators have different trade-offs:
* single-core speed
* multi-core speed
* memory usage
* long time memory fragmentation
* alloc debugging support & robustness

And they can behave different with different allocation patterns and sizes. Jemalloc being better in one test than ptmalloc doesn't necessarily mean that it's better in another.

Here's some discussion on the subject:
        https://lwn.net/Articles/273084/

The used algorithms and some of the trade-offs are described in allocators' source codes.


If we don't care about memory usage, let's use my allocator.

Modern games are most demanding use-case for compiler, use largest number of shaders, but almost all (>90%) Steam games are *still* 32-bit. Before compiler memory usage optimizations by Ian & Co, several of them crashed because they ran out of 32-bit address space.

(DOTA2 is nowadays thankfully 64-bit so it doesn't anymore crash because of that.)


If we do,
let's import jemalloc into the Mesa tree and use it for ralloc. That
"11% less time" spent in the shader compiler (which includes LLVM)
would be nice to have.

I don't think above jemalloc testing is enough, you should also:
* Test performance with 32-bit builds
* Do some memory usage comparisons

I'm not sure what's the best way to track memory usage for this though. From proc you get total mapping sizes, but typically dirty memory usage is more relevant and that you see from smaps data.

Easiest start could be with Valgrind massif as it can show heap memory usage over time:
        http://valgrind.org/docs/manual/ms-manual.html


        - Eero

PS. This Valgrind tool can be used to optimize memory allocations efficiency:
        http://valgrind.org/docs/manual/dh-manual.html

It tells which parts of the allocs are hot and which are cold, or unused completely, so that things within allocations can be arranged in most efficient manner.


_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Reply via email to