On Fri, Jun 05, 2015 at 09:26:33AM +0400, Mikhail Maltsev wrote: > > The compiler has much more information than the headers. > > - It can do alias analysis, so to avoid needing to handle overlap > > and similar. > > - It can (sometimes) determine alignment, which is important > > information for tuning. > > - With profile feedback it can use value histograms to determine the > > best code. > Value range information is also passed to, for example, memcpy expander. > Even if exact length of data being copied is not known at compile time, > the known range can help the compiler to select a subset of available > algorithms a choose one of them at run time without using several > branches and a huge jumptable (which is unavoidable in library code, > because it has to deal with general case). > You can't as there is no such thing as optimum algorithm. I did simple memset benchmark that tests 8byte aligned memset(x, 0, c) with constant argument on L1 cache resident data.
You as compiler simply cannot do that as you would need to expand memset(x, 0, 256) into a) rep stosb. On haswell thats fastest which glibc memset-avx2 does. b) sequence of movdqa %xmm0, x(%rdi) Thats optimal on ivy bridge. Thats a table lookup exported from libc that I talked about. c) use -mstringop-strategy=unrolled_loop. Thats a fastest implementation on core2 Now tell me how you as a compiler would do selection? Also data clearly shows that memset suffers same problem as memcmp and using loop should be disabled by default. Its slower than glibc one since 128 where you stop unrolling it into movq sequence for all tested processors except core2 where treshold is at 256 bytes (read below). >From variants only unrolled_loop is sometimes faster but thats slower than a full unrolling that I proposed. As using jumptable its only way to do that as you need to map each size to appropriate address in completely unrolled implementation. Also we could use unrolled implementation using vmovdqa %ymm0, (%rdi) on haswell. Without that I could use jumptable with powers of 2. > BUT comparison functions (like str[n]cmp, memcmp) expand to something > like "repz cmpsb", which is definitely suboptimal. IMHO, a quick and > dirty (but working) solution would be to use library call instead of > relying on HAVE_cmpstr[n]si, (except short strings). A better solution > would be to implement something like movmem pattern (which is aware of > value ranges, alignment, etc.) but for comparison patterns. I could try > to write at least some pro I wrote a recursive always_inline function how it should be done in sibling thread. Main point is for gcc to recognize streq and memeq when you dont care about strcmp sign which makes implementation almost trivial. haswell size 64 glibc.s 0.41 loop.s 0.64 memset-avx2.s 0.52 memset_tbl.s 0.43 rep8.s 0.91 unroll.s 0.44 size 96 glibc.s 0.63 loop.s 0.82 memset-avx2.s 0.52 memset_tbl.s 0.55 rep8.s 0.91 unroll.s 0.56 size 128 glibc.s 0.80 loop.s 1.02 memset-avx2.s 0.77 memset_tbl.s 0.64 rep8.s 0.92 unroll.s 0.64 size 192 glibc.s 1.06 loop.s 1.46 memset-avx2.s 0.95 memset_tbl.s 0.80 rep8.s 1.26 unroll.s 0.88 size 256 glibc.s 1.34 loop.s 1.87 memset-avx2.s 0.70 memset_tbl.s 0.93 rep8.s 1.46 unroll.s 1.10 size 384 glibc.s 1.64 loop.s 3.05 memset-avx2.s 0.85 memset_tbl.s 1.21 rep8.s 1.88 unroll.s 1.54 size 512 glibc.s 1.86 loop.s 3.59 memset-avx2.s 1.07 memset_tbl.s 1.45 rep8.s 2.28 unroll.s 2.00 size 1024 glibc.s 2.94 loop.s 5.80 memset-avx2.s 1.83 memset_tbl.s 2.54 rep8.s 2.81 unroll.s 3.78 size 2048 glibc.s 4.91 loop.s 10.44 memset-avx2.s 3.60 memset_tbl.s 4.65 rep8.s 4.91 unroll.s 7.50 i7_ivy_bridge size 64 glibc.s 0.44 loop.s 0.99 memset-avx2.s 0.57 memset_tbl.s 0.44 rep8.s 0.96 unroll.s 0.48 size 96 glibc.s 0.64 loop.s 1.14 memset-avx2.s 0.56 memset_tbl.s 0.59 rep8.s 1.00 unroll.s 0.62 size 128 glibc.s 0.80 loop.s 1.45 memset-avx2.s 0.84 memset_tbl.s 0.66 rep8.s 1.14 unroll.s 0.69 size 192 glibc.s 1.04 loop.s 2.24 memset-avx2.s 0.97 memset_tbl.s 0.82 rep8.s 1.38 unroll.s 0.93 size 256 glibc.s 1.31 loop.s 2.83 memset-avx2.s Command terminated by signal 4 0.00 memset_tbl.s 0.95 rep8.s 1.61 unroll.s 1.17 size 384 glibc.s 1.61 loop.s 4.13 memset-avx2.s Command terminated by signal 4 0.00 memset_tbl.s 1.26 rep8.s 2.01 unroll.s 1.72 size 512 glibc.s 1.89 loop.s 4.99 memset-avx2.s Command terminated by signal 4 0.00 memset_tbl.s 1.48 rep8.s 2.39 unroll.s 2.05 size 1024 glibc.s 3.01 loop.s 8.26 memset-avx2.s Command terminated by signal 4 0.00 memset_tbl.s 3.31 rep8.s 3.43 unroll.s 3.84 size 2048 glibc.s 5.01 loop.s 14.89 memset-avx2.s Command terminated by signal 4 0.00 memset_tbl.s 5.05 rep8.s 5.75 unroll.s 7.83 core2: size 64 glibc.s 1.04 loop.s 1.48 memset-avx2.s Command terminated by signal 4 0.00 memset_tbl.s Command terminated by signal 4 0.00 rep8.s 1.53 unroll.s 0.85 size 96 glibc.s 1.81 loop.s 1.73 memset-avx2.s Command terminated by signal 4 0.00 memset_tbl.s Command terminated by signal 4 0.00 rep8.s 1.53 unroll.s 1.11 size 128 glibc.s 1.87 loop.s 2.21 memset-avx2.s Command terminated by signal 4 0.00 memset_tbl.s Command terminated by signal 4 0.00 rep8.s 1.73 unroll.s 1.29 size 192 glibc.s 2.25 loop.s 3.42 memset-avx2.s Command terminated by signal 4 0.00 memset_tbl.s Command terminated by signal 4 0.00 rep8.s 2.14 unroll.s 1.86 size 256 glibc.s 2.59 loop.s 4.38 memset-avx2.s Command terminated by signal 4 0.00 memset_tbl.s Command terminated by signal 4 0.00 rep8.s 2.46 unroll.s 2.24 size 384 glibc.s 3.18 loop.s 6.32 memset-avx2.s Command terminated by signal 4 0.00 memset_tbl.s Command terminated by signal 4 0.00 rep8.s 3.35 unroll.s 3.21 size 512 glibc.s 3.80 loop.s 8.25 memset-avx2.s Command terminated by signal 4 0.00 memset_tbl.s Command terminated by signal 4 0.00 rep8.s 4.19 unroll.s 4.07 size 1024 glibc.s 6.27 loop.s 12.92 memset-avx2.s Command terminated by signal 4 0.00 memset_tbl.s 5.95 rep8.s 7.97 unroll.s 7.64 size 2048 glibc.s 11.37 loop.s 24.11 memset-avx2.s Command terminated by signal 4 0.00 memset_tbl.s 11.01 rep8.s 14.56 unroll.s 14.97 fx10: size 64 glibc.s 0.65 loop.s 1.06 memset-avx2.s 0.83 memset_tbl.s 0.64 rep8.s 0.92 unroll.s 0.67 size 96 glibc.s 1.15 loop.s 1.25 memset-avx2.s 0.97 memset_tbl.s 0.87 rep8.s 1.14 unroll.s 0.88 size 128 glibc.s 1.19 loop.s 1.91 memset-avx2.s 1.50 memset_tbl.s 0.91 rep8.s 1.20 unroll.s 1.00 size 192 glibc.s 1.59 loop.s 2.78 memset-avx2.s 1.61 memset_tbl.s 1.22 rep8.s 1.58 unroll.s 1.29 size 256 glibc.s 1.76 loop.s 3.38 memset-avx2.s Command terminated by signal 4 0.00 memset_tbl.s 1.69 rep8.s 1.97 unroll.s 1.78 size 384 glibc.s 2.33 loop.s 4.59 memset-avx2.s Command terminated by signal 4 0.00 memset_tbl.s 2.12 rep8.s 2.64 unroll.s 2.37 size 512 glibc.s 3.01 loop.s 5.76 memset-avx2.s Command terminated by signal 4 0.00 memset_tbl.s 2.70 rep8.s 4.45 unroll.s 3.36 size 1024 glibc.s 5.72 loop.s 10.43 memset-avx2.s Command terminated by signal 4 0.00 memset_tbl.s 5.48 rep8.s 6.56 unroll.s 6.06 size 2048 glibc.s 10.32 loop.s 19.63 memset-avx2.s Command terminated by signal 4 0.00 memset_tbl.s 10.15 rep8.s 11.17 unroll.s 11.12
memset_bt.tar.bz2
Description: Binary data