On Tue, 5 Nov 2013, John-Mark Gurney wrote:

Andriy Gapon wrote this message on Tue, Nov 05, 2013 at 12:12 +0200:
on 08/10/2013 04:38 Xin LI said the following:
Author: delphij
Date: Tue Oct  8 01:38:24 2013
New Revision: 256132
URL: http://svnweb.freebsd.org/changeset/base/256132

Log:
  Improve lzjb decompress performance by reorganizing the code
  to tighten the copy loop.

  Submitted by: Denis Ahrens <denis h3q com>
  MFC after:    2 weeks
  Approved by:  re (gjb)

Modified:
  head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/lzjb.c

Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/lzjb.c
==============================================================================
--- head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/lzjb.c  Mon Oct  7 
22:30:03 2013        (r256131)
+++ head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/lzjb.c  Tue Oct  8 
01:38:24 2013        (r256132)
@@ -117,7 +117,9 @@ lzjb_decompress(void *s_start, void *d_s
                        src += 2;
                        if ((cpy = dst - offset) < (uchar_t *)d_start)
                                return (-1);
-                       while (--mlen >= 0 && dst < d_end)
+                       if (mlen > (d_end - dst))
+                               mlen = d_end - dst;
+                       while (--mlen >= 0)
                                *dst++ = *cpy++;
                } else {
                        *dst++ = *src++;


Intuitively it might seem that this change is indeed an improvement.
But given how "not dumb" (not sure if that always amounts to smart) the modern
compilers are, has anyone actually measured if this change indeed improves the
performance?

Judging from the conversations on the ZFS mailing lists this change event hurt
performance in some environments.
Looks like the upstream is not going to take this change.

So does it make any sense for us to make the code more obscure and different
from upstream?  Unless performance benefits could indeed be demonstrated.

The old code compiles to:
  62c13:       49 f7 de                neg    %r14
  62c16:       44 8d 58 ff             lea    -0x1(%rax),%r11d
  62c1a:       4c 89 d3                mov    %r10,%rbx
  62c1d:       0f 1f 00                nopl   (%rax)
  62c20:       42 8a 14 33             mov    (%rbx,%r14,1),%dl
  62c24:       88 13                   mov    %dl,(%rbx)
  62c26:       48 ff c3                inc    %rbx
  62c29:       ff c8                   dec    %eax
  62c2b:       85 c0                   test   %eax,%eax
  62c2d:       7f f1                   jg     62c20 <lzjb_decompress+0xa0>

The load/store line is addresses 62c20-62c29...  So it looks like clang
is already doing the optimization that was added...

It is unlikely to make a difference, but this is fairly bad source and
object code:
- byte at a time accesses.  The loop just copies data.  Even calling
  memcpy might be faster.  Compilers used to inline memcpy too much,
  but they now know that they don't understand memory, so the call
  would probably be external, though it should be inline iff the size
  of the copy is small
- the source code asks for, and gets a slow signed comparison.  The
  compiler is apparently not able to optimized, and the extra instruction
  at 62c2b is needed.
- rewrites to change the loop termination condition are often to change it
  from (--v >= 0) to (--v == 0).  This isn't done here, so there is still
  the extra instruction at 62c2b.  I don't like signed variables, this
  optimization often occurse automatically because size_t is used for
  byte counts and it is unsigned.  mlen is apparently signed and not
  size_t, else the optimization would occur automatically (if you write
  ">= 0", this makes no sense for unsigned variables, and the compiler
  will translate it to "== 0".
- oops, the extra instruction at 62c2b is not needed.  The compiler could
  have used "subl $1,%eax" instead of "decl %eax" to set the sign flags.

A minimal bytewise copying loop on 64-bit x86 looks like:

1:
        movb    (%rsi,rcx,1),%al        # from top down
        movb    %al,(%rdi,rcx,1)
        decq    %rcx
        jne     1b                      # end with last byte not copied
        ...     # copy last byte now, or modfify end condition

4 instructions instead of 6.  But this probably makes little difference on
OOE x86.

This can use subl too, to allow using jg so as not to need an extra step
to copy the last byte.  Using decl is a good optimization for space on
32-bit values, but IIRC decq is not a single byte instruction so using it
is not such a good optimization for space and using subl loses less space.

This already copies from the top down so as to allow an end condition
of nearly %rcx == 0.  Perhaps the pointers and counters can be adjusted
a bit more to get to exactly %rcx == 0, but this probably has more
setup overheads.  Getting the end condition to be exactly '--v == 0'
tends to take more C code too.  Compilers know about this problem, and
sometimes generate an extra step outside of the loop even if you
structure the loop unnaturally to try to avoid this.

The loop could also be unrolled a bit.  ISTR noticing clang doing automatic
unrolling.  Perhaps this is the main thing affected by manual optimizations.
They might be harmful because the confuse the compiler into not doing
automatic unrolling, or harmful because they help the compiler do
automatic unrolling when it shouldn't, or helpful...

In my copy benchmarks, I didn't find reducing the loop overhead by
adjusting only the count register to be very useful, since loop overhead
is easy to hide in the src/dst access overhead.  When there are cache
misses for src/dst, loop overhead is in the noise, so all of this is
only interesting for the L1-cached case.  It usually takes only minor
(bit often some) unrolling to amortize the loop overhead in the
L1-cached case.

Bruce
_______________________________________________
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"

Reply via email to