Ping. On 08/03/2015 11:40 AM, Mikhail Maltsev wrote: > On Jul 26, 2015, at 11:50 AM, Andi Kleen <a...@firstfloor.org> wrote: >> I've been compiling gcc with tcmalloc to do a similar speedup. It would be >> interesting to compare that to your patch. > I repeated the test with TCMalloc and jemalloc. TCMalloc shows nice results, > though it required some tweaks: this allocator has a threshold block size > equal > to 32 KB, larger blocks are allocated from global heap, rather than thread > cache > (and this operation is expensive), so the original patch shows worse > performance > when used with TCMalloc. In order to fix this, I reduced the block size to 8 > KB. > Here there are 5 columns for each value: pristine version, pristine version + > TCMalloc (and the difference in parenthesis), and patched version with > TCMalloc > (difference is relative to pristine version). Likewise, for memory usage. > > 400.perlbench 26.86 26.17 ( -2.57%) 26.17 ( -2.57%) user > 0.56 0.64 ( +14.29%) 0.61 ( +8.93%) sys > 27.45 26.84 ( -2.22%) 26.81 ( -2.33%) real > 401.bzip2 2.53 2.5 ( -1.19%) 2.48 ( -1.98%) user > 0.07 0.09 ( +28.57%) 0.1 ( +42.86%) sys > 2.61 2.6 ( -0.38%) 2.59 ( -0.77%) real > 403.gcc 73.59 72.62 ( -1.32%) 71.72 ( -2.54%) user > 1.59 1.88 ( +18.24%) 1.88 ( +18.24%) sys > 75.27 74.58 ( -0.92%) 73.67 ( -2.13%) real > 429.mcf 0.4 0.41 ( +2.50%) 0.4 ( +0.00%) user > 0.03 0.05 ( +66.67%) 0.05 ( +66.67%) sys > 0.44 0.47 ( +6.82%) 0.47 ( +6.82%) real > 433.milc 3.22 3.24 ( +0.62%) 3.25 ( +0.93%) user > 0.22 0.32 ( +45.45%) 0.3 ( +36.36%) sys > 3.48 3.59 ( +3.16%) 3.59 ( +3.16%) real > 444.namd 7.54 7.41 ( -1.72%) 7.37 ( -2.25%) user > 0.1 0.15 ( +50.00%) 0.15 ( +50.00%) sys > 7.66 7.58 ( -1.04%) 7.54 ( -1.57%) real > 445.gobmk 20.24 19.59 ( -3.21%) 19.6 ( -3.16%) user > 0.52 0.67 ( +28.85%) 0.59 ( +13.46%) sys > 20.8 20.29 ( -2.45%) 20.23 ( -2.74%) real > 450.soplex 19.08 18.47 ( -3.20%) 18.51 ( -2.99%) user > 0.87 1.11 ( +27.59%) 1.06 ( +21.84%) sys > 19.99 19.62 ( -1.85%) 19.6 ( -1.95%) real > 453.povray 42.27 41.42 ( -2.01%) 41.32 ( -2.25%) user > 2.71 3.11 ( +14.76%) 3.09 ( +14.02%) sys > 45.04 44.58 ( -1.02%) 44.47 ( -1.27%) real > 456.hmmer 7.27 7.22 ( -0.69%) 7.15 ( -1.65%) user > 0.31 0.36 ( +16.13%) 0.39 ( +25.81%) sys > 7.61 7.61 ( +0.00%) 7.57 ( -0.53%) real > 458.sjeng 3.22 3.14 ( -2.48%) 3.15 ( -2.17%) user > 0.09 0.16 ( +77.78%) 0.14 ( +55.56%) sys > 3.32 3.32 ( +0.00%) 3.3 ( -0.60%) real > 462.libquantum 0.86 0.87 ( +1.16%) 0.85 ( -1.16%) user > 0.05 0.08 ( +60.00%) 0.08 ( +60.00%) sys > 0.92 0.96 ( +4.35%) 0.94 ( +2.17%) real > 464.h264ref 27.62 27.27 ( -1.27%) 27.16 ( -1.67%) user > 0.63 0.73 ( +15.87%) 0.75 ( +19.05%) sys > 28.28 28.03 ( -0.88%) 27.95 ( -1.17%) real > 470.lbm 0.27 0.27 ( +0.00%) 0.27 ( +0.00%) user > 0.01 0.01 ( +0.00%) 0.01 ( +0.00%) sys > 0.29 0.29 ( +0.00%) 0.29 ( +0.00%) real > 471.omnetpp 28.29 27.63 ( -2.33%) 27.54 ( -2.65%) user > 1.5 1.57 ( +4.67%) 1.62 ( +8.00%) sys > 29.84 29.25 ( -1.98%) 29.21 ( -2.11%) real > 473.astar 1.14 1.12 ( -1.75%) 1.11 ( -2.63%) user > 0.05 0.07 ( +40.00%) 0.09 ( +80.00%) sys > 1.21 1.21 ( +0.00%) 1.2 ( -0.83%) real > 482.sphinx3 4.65 4.57 ( -1.72%) 4.59 ( -1.29%) user > 0.2 0.3 ( +50.00%) 0.26 ( +30.00%) sys > 4.88 4.89 ( +0.20%) 4.88 ( +0.00%) real > 483.xalancbmk 284.5 276.4 ( -2.85%) 276.48 ( -2.82%) user > 20.29 23.03 ( +13.50%) 22.82 ( +12.47%) sys > 305.19 299.79 ( -1.77%) 299.67 ( -1.81%) real > > 400.perlbench 102308kB 123004kB ( +20696kB) 116104kB ( +13796kB) > 401.bzip2 74628kB 86936kB ( +12308kB) 84316kB ( +9688kB) > 403.gcc 190284kB 218180kB ( +27896kB) 212480kB ( +22196kB) > 429.mcf 19804kB 24464kB ( +4660kB) 24320kB ( +4516kB) > 433.milc 36940kB 45308kB ( +8368kB) 44652kB ( +7712kB) > 444.namd 183548kB 193856kB ( +10308kB) 192632kB ( +9084kB) > 445.gobmk 73724kB 78792kB ( +5068kB) 79192kB ( +5468kB) > 450.soplex 62076kB 67596kB ( +5520kB) 66856kB ( +4780kB) > 453.povray 180620kB 208480kB ( +27860kB) 207576kB ( +26956kB) > 456.hmmer 39544kB 47380kB ( +7836kB) 46776kB ( +7232kB) > 458.sjeng 40144kB 48652kB ( +8508kB) 47608kB ( +7464kB) > 462.libquantum 23464kB 28576kB ( +5112kB) 28260kB ( +4796kB) > 464.h264ref 708760kB 738400kB ( +29640kB) 734224kB ( +25464kB) > 470.lbm 26552kB 31684kB ( +5132kB) 31348kB ( +4796kB) > 471.omnetpp 152000kB 172924kB ( +20924kB) 167204kB ( +15204kB) > 473.astar 27036kB 31472kB ( +4436kB) 31380kB ( +4344kB) > 482.sphinx3 33100kB 40812kB ( +7712kB) 39496kB ( +6396kB) > 483.xalancbmk 368844kB 393292kB ( +24448kB) 393032kB ( +24188kB) > > > jemalloc causes regression (and that is rather surprising, because my previous > tests showed the opposite result, but those tests had very small workload - in > fact, a single file). > > On 07/27/2015 12:13 PM, Richard Biener wrote: >>>> On Jul 26, 2015, at 11:50 AM, Andi Kleen <a...@firstfloor.org> wrote: >>>> Another useful optimization is to adjust the allocation size to be >= >>>> 2MB. Then modern Linux kernels often can give you a large page, >>>> which cuts down TLB overhead. I did similar changes some time >>>> ago for the garbage collector. >>> >>> Unless you are running with 64k pages which I do all the time on my armv8 >>> system. >> >> This can be a host configurable value of course. > Yes, I actually mentioned that among possible enhancements. I think that code > from ggc-page.c can be reused (it already implements querying page size from > OS). > >> But first of all (without looking at the patch but just reading the >> description) this >> sounds like a good idea. Maybe still allow pools to use their own backing if >> the object size is larger than the block size of the caching pool? > Yes, I though about it, but I hesitated, whether this should be implemented in > advance. I attached the updated patch. >
-- Regards, Mikhail Maltsev