On Tue, Aug 30, 2016 at 4:06 PM, Marek Olšák <mar...@gmail.com> wrote: > On Tue, Aug 30, 2016 at 3:21 PM, Eero Tamminen > <eero.t.tammi...@intel.com> wrote: >> Hi, >> >> >> On 30.08.2016 12:51, Marek Olšák wrote: >>> >>> Recently I discovered that our GLSL compiler spends a lot of time in >>> rzalloc_size, so I looked at possible options to optimize that. It's >>> worth noting that too many existing allocations slow down subsequent >>> malloc calls, which in turn slows down the GLSL compiler. When I kept >>> 5 instances of LLVMContext alive between compilations (I wanted to >>> reuse them), the GLSL compiler slowed down. That shows that the GLSL >>> compiler performance is too dependent on the size and complexity of >>> the heap. >>> >>> So I decided to write my own linear allocator and then compared it >>> with jemalloc preloaded by LD, and jemalloc linked statically and used >>> by ralloc only. >>> >>> The test was shader-db using AMD's shader collection. The command line >>> was: >>> time GALLIUM_NOOP=1 shader-db/run shaders >>> The noop driver ensures the compilation process ends with TGSI. >>> >>> >>> Default Mesa: >>> real 0m58.343s >>> user 3m48.828s >>> sys 0m0.760s >>> >>> Mesa with LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1: >>> real 0m48.550s (17% less time) >>> user 3m9.544s >>> sys 0m1.700s >>> >>> Ralloc using _mesa_je_{calloc, realloc, free} and Mesa links against >>> my libmesa_jemalloc_pic.a: >>> real 0m49.580s (15% less time) >>> user 3m14.452s >>> sys 0m0.996s >>> >>> Ralloc using my own linear allocator that allocates out of 32KB >>> buffers for 512b and smaller allocations: >>> real 0m46.521s (20% less time) >>> user 3m1.304s >>> sys 0m1.740s >>> >>> >>> Now let's test complete compilation down to GCN bytecode: >>> >>> Default Mesa: >>> real 1m57.634s >>> user 7m41.692s >>> sys 0m1.824s >>> >>> Mesa with LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1: >>> real 1m42.604s (13% less time) >>> user 6m39.776s >>> sys 0m3.828s >>> >>> Ralloc using _mesa_je_{calloc, realloc, free} and Mesa links against >>> my libmesa_jemalloc_pic.a: >>> real 1m44.413s (11% less time) >>> user 6m48.808s >>> sys 0m2.480s >>> >>> Ralloc using my own linear allocator: >>> real 1m40.486s (14.6% less time) >>> user 6m34.456s >>> sys 0m2.224s >>> >>> >>> The linear allocator that I wrote has a very high memory usage due to >>> the inability to free 32KB blocks if those blocks have at least one >>> living allocation. The workaround would be to do realloc() when >>> changing a ralloc parent in order to "defragment" the memory, but >>> that's more involved. >>> >>> I don't know much about glibc, but it's hard to believe that glibc >>> people have been purposely ignoring jemalloc for so long. There must >>> be some anti-performance politics going on, but enough of >>> speculations. >> >> >> Different allocators have different trade-offs: >> * single-core speed >> * multi-core speed >> * memory usage >> * long time memory fragmentation >> * alloc debugging support & robustness >> >> And they can behave different with different allocation patterns and sizes. >> Jemalloc being better in one test than ptmalloc doesn't necessarily mean >> that it's better in another. >> >> Here's some discussion on the subject: >> https://lwn.net/Articles/273084/ >> >> The used algorithms and some of the trade-offs are described in allocators' >> source codes. >> >> >>> If we don't care about memory usage, let's use my allocator. >> >> >> Modern games are most demanding use-case for compiler, use largest number of >> shaders, but almost all (>90%) Steam games are *still* 32-bit. Before >> compiler memory usage optimizations by Ian & Co, several of them crashed >> because they ran out of 32-bit address space. > > Did the games crash because i965 was using GLSL IR as its main > compiler IR? Or was the problem that GLSL IR hadn't been released at > link time, because the driver had to keep all of it for compiling > shader variants? The memory usage issue might have been i965-specific > and not relevant right now. > > Note that Gallium releases GLSL IR in glLinkProgram and other drivers > should do that too. If some drivers don't, they are going to have > memory usage issues either way.
Just to clarify, I don't care that much about memory usage as long as the trade-off is worth it, but I understand there are people who care, e.g. drivers that don't release GLSL IR or small devices (ARM, embedded). If I choose to finish up my allocator and put it below ralloc (which would be easier than importing jemalloc), I will make it conditional depending on the driver and CPU architecture. Marek _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev