On Sat, 24 Dec 2011, Alexander Best wrote:

On Sat Dec 24 11, Bruce Evans wrote:
On Sat, 24 Dec 2011, Alexander Best wrote:

On Sat Dec 24 11, Bruce Evans wrote:
On Fri, 23 Dec 2011, Alexander Best wrote:
...
the gcc(1) man page states the following:

"
This extra alignment does consume extra stack space, and generally
increases code size.  Code that is sensitive to stack space usage,
such as embedded systems and operating system kernels, may want to
reduce the preferred alignment to -mpreferred-stack-boundary=2.
"

the comment in sys/conf/kern.mk however sorta suggests that the default
alignment of 4 bytes might improve performance.

The default stack alignment is 16 bytes, which unimproves performance.

maybe the part of the comment in sys/conf/kern.mk, which mentions that a
stack
alignment of 16 bytes might improve micro benchmark results should be
removed.
this would prevent people (like me) from thinking, using a stack alignment
of
4 bytes is a compromise between size and efficiently. it isn't! currently a
stack alignment of 16 bytes has no advantages towards one with 4 bytes on
i386.

I think the comment is clear enough.  It it mentions all the tradeoffs.
It is only slightly cryptic in saying that these are tradeoffs and that
the configuration is our best guess at the best tradeoff -- it just says
"while" for both.  It goes without saying that we don't use our worst
guess.  Anyone wanting to change this should run benchmarks and beware
that micro-benchmarks are especially useless.  The changed comment is not
so good since it no longer mentions micro-bencharmarks or says "while".

if micro benchmark results aren't of any use, why should the claim that the
default stack alignment of 16 bytes might produce better outcome stay?

Because:
- the actual claim is the opposite of that (it is that the default 16-byte
  alignments is probably a loss overall)
- the claim that the default 16-byte alignment may benefit micro-benchmarks
  is true, even without the weaselish miswording of "might" in it.  There
  is always at least 1 micro-benchmark that will benefit from almost any
  change, and here we expect a benefit in many microbenchmarks that don't
  bust the caches.  Except, 16-byte alignment isn't supported (*) in the
  kernel, so we actually expect a loss from many microbenchmarks that
  don't bust the caches.
- the second claim warns inexperienced benchmarkers not to claim that the
  default is better because it is better in microbenchmarks.

it doesn't seem as if anybody has micro benchmarked 16 bytes vs. 4 bytes stack
alignment, until now. so the micro benchmark statement in the comment seems to
be pure speculation.

No, it is obviously true.

even worse...it indicates that by removing the
-mpreferred-stack-boundary=2 flag, one can gain a performance boost by
sacrifying a few more bytes of kernel (and module) size.

No, it is part of the sentence explaining why removing the
-mpreferred-stack-boundary=2 flag will probably regain the "overall loss"
that is avoided by using the flag.

this suggests that the behavior -mpreferred-stack-boundary=2 vs. not specyfing
it, losely equals the semantics of -Os vs. -O2.

No, -Os guarantees slower execution by forcing optimization to prefer
space savings over time savings in more ways.  Except, -Os is completely
broken in -current (in the kernel), and gives very large negative space
savings (about 50%).  It last worked with gcc-3.  Its brokenness with
gcc-4 is related to kern.pre.mk still specifying -finline-limit flags
that are more suitable for gcc-3 (gcc has _many_ flags for giving more
delicate control over inlining, and better defaults for them) and
excessive inlining in gcc-4 given by -funit-at-a-time
-finline-functions-called-once.  These apparently cause gcc's inliner
to go insane with -Os.  When I tried to fix this by reducing inlining,
I couldn't find any threshold that fixed -Os without breaking inlining
of functions that are declared inline.

(*) A primary part of the lack of support for 16-byte stack alignment in
the kernel no special stack alignment for the main kernel entry point,
namely syscall().  From i386/exception.s:

%       SUPERALIGN_TEXT
% IDTVEC(int0x80_syscall)

At this point, the stack has 5 words on it (it was 16-byte aligned before
that).

%       pushl   $2                      /* sizeof "int 0x80" */
%       subl    $4,%esp                 /* skip over tf_trapno */
%       pushal
%       pushl   %ds
%       pushl   %es
%       pushl   %fs
%       SET_KERNEL_SREGS
%       cld
%       FAKE_MCOUNT(TF_EIP(%esp))
%       pushl   %esp

We "push" 14 more words.  This gives perfect misaligment to the worst odd
word boundary (perfect if only word boundaries are allowed).  gcc wants
the stack to be aligned to a 4*n word boundary before function calls,
but here we have a 4*n+3 word boundary.  (4*n+3 is worse than 4*n+1
since 2 more words instead of 4 will cross the next 16-byte boundary).

%       call    syscall

Using the default -mpreferred-stack-boundary will preserve the perfect
misaligment across all C functions called by syscall().

%       add     $4, %esp
%       MEXITCOUNT
%       jmp     doreti

Old versions didn't have the pessimization of pushing the frame pointer.
This is a minor pessimization, except it uses more stack, unless you
use the default -mpreferred-stack-boundary.  Without this, only 18 words
were pushed, so the misalignment was imperfect (to a 4*n+2 word
boundary).  If the default stack alignment is any use at all (in the
kernel), then it is mainly to prevent 64-bit data types being laid out
across cache line boundaries.  Alignment to a 4*n+2 word boundary gives
that just as well as alignment to a 4*n+0 word boundary.

I tested using the default -mpreferred-stack-boundary in FreeBSD-~5.2,
which doesn't push the frame pointer.  This gave the expected results,
except the optimization for a microbenchmark was surprisingly large.
For a macro-benchmark, I built some kernels.  This seemed to take a
little longer (about 0.2%, and not statistically significant).  But
the time for a clock_gettime() microbenchmark was reduced from 271 ns
per call to 263.5 ns per call.  That's with the stack for clock_gettime()
imperfectly misaligned to a 4*n+2 word boundary.  But changing the
stack alignment by subtracting more from the stack in syscons made
little difference, unless it was changed to an odd byte boundary (then
clock_gettime() took about 324 ns).

amd64 is of course more careful about this (since its ABI requires
16-byte alignment).  According to log messages, the initial %rsp
(before anything is pushed onto it in the above) is offset by 8
bytes or so, as necessary to make the final %rsp come out aligned.
Pushing the frame pointer would have broken this.  However, on
amd64, the first arg is passed in %rdi, so there is no push to
pass the frame pointer and the stack remains aligned.  When the
frame pointer was passed "by reference", adjusting the stack
after the pushes would have broken the reference, so the offset
method was essential.  Now it is not needed (unless we want or
need frame to be aligned, since %rdi can pass the frame pointer
wherever the frame is, and the offset method becomes a minor
optimization.  If you remove the -mpreferred-stack-boundary=2
optimization, be sure to remove this one too, since it is tinier.

Bruce
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Reply via email to