On Thu, 10 May 2018, Konstantin Belousov wrote:

On Fri, May 11, 2018 at 03:31:46AM +1000, Bruce Evans wrote:
On Thu, 10 May 2018, Konstantin Belousov wrote:

Log:
 Make fpusave() and fpurestore() on amd64 ifuncs.

 From now on, linking amd64 kernel requires either lld or newer ld.bfd.

This breaks building with gcc:

XX cc1: warnings being treated as errors
XX ../../../amd64/amd64/fpu.c:195: warning: 'ifunc' attribute directive ignored 
[-Wattributes]
...
XX ./machine/fpu.h:62: warning: previous declaration of 'fpurestore' was here
Yes.  Nothing unexpected.

Thus it is not suitable for commit.

After building fpu.o with clang, linking doesn't reqire a newer ld, but
booting does -- after linking with an older ld, the boot panics with a
page fault near fork().
Yes.  I noted this in the commit message.

emaste will commit the check shortly which would prevent using inadequate
linker for building kernel.

Now it is broken even for clang:

XX make: "../../../conf/../../../conf/kern.pre.mk" line 125: amd64 kernel 
requires linker ifunc support

This uses a LINKER_FEATURES, but LINKER_FEATURES is not defined anywhere in
the sys tree.  It is defined in the src amd host tree in bsd.linker.mk.
Kernel makefiles have been broken by adding a lot of dependencies on files
not in the sys tree, but bsd.linker.mk doesn't seem to be included.

Even "make -V LINKER_FEATURES" to determine what LINKER_FEATURES is is broken
(it gives the above error).

After backing out r333470, "make -V LINKER_FEATURES" works and gives
" build-id filter retpoline" with both clang and gcc.  LINKER_FEATURES
seems to be hard-coded somewhere.

Only the ld that will be used by the build knows its features.  The
ld is hard to find.  My gcc is a shell script with lots of -B paths that
eventually find ld.  Links should be done by ${CC} to find this ld.  But
the kernel uses ${LD}.  When this became incompatible earlier this year
after working for about 15 years despite its logical incompatibility,
I worked around the problem by adding "makeoptions LD=<path to my ld>".
My ld was pre-lld until yesterday.  For this test, it is a symlink to
the host's ld.  This supports ifuncs, but LINKER_FEATURES says that
it doesn't.

The host (freefall) also doesn't support ifuncs according to
"make -V LINKER_FEATURES" in src/bin/cat.

Only bsd.pre.mk was broken in r333470.  kern.mk doesn't have this check.

It is a layering bug to not put such checks in kern.mk.

...
This looks like a small or null pessimization.  The branch in the old
version is probably faster.  The following simple test on Haswell shows
that all reasonable versions have almost the same speed when cached:
Yes, I already noted and mjg noted that ifuncs are directed through PLT.
I remember that it was not the case when I did it the first time, but then
both compiler and linker were different.

The test failed to explain why the branch-free version worked slower only
when written in asm (7.25 cycles instead of 7).  It was because I changed
the return value from (xsave ? 1 : 0) to (xsave ? 2 : 1) to better match
some asm versions.  gcc -O only produces branch-free code for '1 : 0', and
that is what I first tested for the C version (before that, I had forgotten
that brranch-free cide might be better).  gcc -O2 produces branch-free code
for '2 : 1'.  It generates the same code as for '1 : 0', then increments this.
Neither -O nor -O2 uses cmov for some reason.  The increment is 1 more
instruction, but doesn't increase the time of 7 cycles.  This is all with
gcc-3.3.3.  gcc-4.2.1 -O is "smarter" and generates "sbb %eax,%eax; add $2,
%eax" to generate '2 : 1'.  This also desn't change the time of 7 cycles.

I tried a quick experiment with adding -fvisibility=hidden to the kernel
compilation, but it still leaves ifunc relocations on PLT (and PLT correctly
consists only of ifuncs).

I might look some more on it later.  Anyway, there are more uses, and SMAP
plus pmap applications of ifunc are more clear.  I selected amd64/fpu.c
for the initial commit to check toolchain in wild since this case is
simplest and easy to diagnose if failing, I hope.

You need to find about 16 million uses per second at 4 GHz for this to give
a 1% optimization ("this" = 0.25 cycles).  -O2 and other optimizations give
an improvement in this range, but break debugging so I turn them off.  Except
for clang, the -finline* flags for turning off some optimizations don't
even work.  A typical stack trace for clang usually has
<value optimized out> for almost all values even for debuggers that should
be able to find the values in registers.  ifuncs might also give
<function name optimized out>.  Debugging is already very broken by inlining
small functions -- stack traces don't show the small functions or their
args, but who a caller several layers high.  All for optimizations in the
1-5% range for kernels, or a small fraction of that for total time.

Bruce
_______________________________________________
svn-src-all@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"

Reply via email to