On Sun, 22 Jan 2017, Mateusz Guzik wrote:

On Sun, Jan 22, 2017 at 12:07:56PM +1100, Bruce Evans wrote:
...
Later commits further unimproved style by adding __predict_ugly() and
long lines from blind substitution of that.   __predict_ugly() is almost
as useful as 'register', but more invasive.

There were no __predict changes to this function.

I did add some to vn_closefile in r312602.

Indeed, one line is now 82 chars and arguably could be modified to fit
the limit.

I have to disagree about the usefulness remark. If you check generated
assembly for amd64 (included below), you will see the uncommon code is
moved out of the way and in particular there are no forward jumps in the
common case.

Check benchmarks.  A few cycles here and there are in the noise.  Kernel
code has very few possibilities for useful optimizations since it doesn't
have many inner loops.

With __predict_false:

[snip prologue]
  0xffffffff8084ecaf <+31>:       mov    0x24(%rbx),%eax
  0xffffffff8084ecb2 <+34>:       test   $0x40,%ah
  0xffffffff8084ecb5 <+37>:       jne    0xffffffff8084ece2 <vn_closefile+82>

All that this does is as you say -- invert the condition to jump to the
uncommon code.  This made more of a difference on old CPUs (possiblly
still on low end/non-x86).  Now branch predictors are too good for the
slow case to be much slower.

I think x86 branch predictors initially predict forward branches as not
taken and backward branches as taken.  So this branch was initially
mispredicted, and the change fixes this.  But the first branch doesn't
really matter.  Starting up takes hundreds or thousands of cycles for
cache misses.

Moving the uncommon code out of the way reduces Icache pressure a bit.
I've never been able to measure a significant difference from this.

However, I usually compile with -Os to reduce Icache pressure.  This
is almost as good for speed as -O (typically at most 5% slower than
-O2 for kernels). The other day, I tried compiling with -O2
-march=highest-supported and found that on a network microbenchmark
which was supposed to benefit from this, it was significantly slower.
It was with a very old version of gcc.  I think this is because -O2
rearranges the code a lot, and since at least the old version of gcc
doesn't understand caches at all, the arrangement accidentally ended
up much worse and caused more cache misses.  With another compiler and
NIC, -O2 gave the expected speeding.

  0xffffffff8084ecb7 <+39>:       mov    0x24(%rbx),%esi
  0xffffffff8084ecba <+42>:       mov    0x10(%rbx),%rdx
  0xffffffff8084ecbe <+46>:       mov    %r14,%rdi
  0xffffffff8084ecc1 <+49>:       mov    %r15,%rcx
  0xffffffff8084ecc4 <+52>:       callq  0xffffffff80850000 <vn_close>
  0xffffffff8084ecc9 <+57>:       mov    %eax,%r15d
  0xffffffff8084eccc <+60>:       mov    0x24(%rbx),%eax
  0xffffffff8084eccf <+63>:       test   $0x40,%ah
  0xffffffff8084ecd2 <+66>:       jne    0xffffffff8084ecf5 <vn_closefile+101>
[snip cleanup]
  0xffffffff8084ece1 <+81>:       retq

The code for he uncommon case is now here (not shown).  It will be no
smaller (actually slightly larger for a branch back) unless it can be
shared.


Without __predict_false:
[snip prologue]
  0xffffffff8084ecaf <+31>:       mov    0x24(%rbx),%eax
  0xffffffff8084ecb2 <+34>:       test   $0x40,%ah
  0xffffffff8084ecb5 <+37>:       je     0xffffffff8084ecc8 <vn_closefile+56>

The branch is not very far away, so perhaps the optimization from the move
is null.

  0xffffffff8084ecb7 <+39>:       movzwl 0x20(%rbx),%eax
  0xffffffff8084ecbb <+43>:       cmp    $0x1,%eax
  0xffffffff8084ecbe <+46>:       jne    0xffffffff8084ecc8 <vn_closefile+56>
  0xffffffff8084ecc0 <+48>:       mov    %r14,%rdi
  0xffffffff8084ecc3 <+51>:       callq  0xffffffff8083e270 <vrefact>

The branch was to here <+56>.  The function is not very well aligned.  It
has address 0x90 mod 0x100, so starts 0x10 into a 64-bit cache line (is that
still the line size for all modern x86?).  So 48 bytes are in a line, but
this +56 is in the next line.

But aligning functions and branches to cache line boundaries works poorly
in general.  It wastes Icache in another way.  -march affects this a lot,
and -march=higher is often worse since the compiler doesn't really understand
this and does more or less alignment than is good.  Here it didn't bother
aligning even 1 of the branch targets.

  0xffffffff8084ecc8 <+56>:       mov    0x24(%rbx),%esi
  0xffffffff8084eccb <+59>:       mov    0x10(%rbx),%rdx
  0xffffffff8084eccf <+63>:       mov    %r14,%rdi
  0xffffffff8084ecd2 <+66>:       mov    %r15,%rcx

The main code ends up just +17 later at <+69>.  Fitting it all below <+64>
would be better but is not necessary if the instructions are slow enough
to all the instruction fetch to run ahead.  This depends on good prediction
to not start too many instructions in paths that won't be followed.

  0xffffffff8084ecd5 <+69>:       callq  0xffffffff80850000 <vn_close>
  0xffffffff8084ecda <+74>:       mov    %eax,%r15d
  0xffffffff8084ecdd <+77>:       mov    0x24(%rbx),%eax
  0xffffffff8084ece0 <+80>:       test   $0x40,%ah
  0xffffffff8084ece3 <+83>:       je     0xffffffff8084ed45 <vn_closefile+181>
  0xffffffff8084ece5 <+85>:       movzwl 0x20(%rbx),%eax
  0xffffffff8084ece9 <+89>:       cmp    $0x1,%eax
  0xffffffff8084ecec <+92>:       jne    0xffffffff8084ed45 <vn_closefile+181>
  0xffffffff8084ecee <+94>:       movw   $0x0,-0x22(%rbp)
[skip some parts, +181 marks the beginning of cleanup]
  0xffffffff8084ed52 <+194>:      retq

Bruce
_______________________________________________
svn-src-all@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"

Reply via email to