Large, modular C++ application performance ...

2005-07-29 Thread michael meeks
Hi there,

I've been doing a little thinking about how to improve OO.o startup
performance recently; and - well, relocation processing happens to be
the single, biggest thing that most tools flag.

Anyhow - so I wrote up the problem, and a couple of potential
solutions / extensions / workarounds, and - being of a generally
clueless nature, was hoping to solicit instruction from those of a more
enlightened disposition.

All input much appreciated; no doubt my terminology is irritatingly up
the creek, hopefully the sentiment will win through.

http://go-oo.org/~michael/OOoStartup.pdf

Two solutions are proposed - there are almost certainly more that I'm
not thinking of. I'm interested in people's views as to which approach
is best. So far the constructor hook approach seems to be the path of
least resistance.

Thanks,

Michael.

-- 
 [EMAIL PROTECTED]  <><, Pseudo Engineer, itinerant idiot



Re: Large, modular C++ application performance ...

2005-08-01 Thread michael meeks
Hi Giovanni,

On Sat, 2005-07-30 at 15:36 +0200, Giovanni Bajo wrote:
> I'm slow, but I can't understand why a careful design of the interfaces of
> the dynamic libraries

Well - sure, depends how 'careful' you are ;-) clearly if no C++
classes with virtual methods form the interface of any library, then
there is no problem ;-) unfortunately, mandating that would rather
cripple C++.

>  together with the new -fvisibility flags, should not
> be sufficient. It worked well in other scenarios

-fvisibility is helpful - as the paper says, not as helpful as the old
-Bsymbolic (or link maps exposing only 3 or so functions) were. However
- -fvisibility can only help so much - if you have:

class LibraryAClass {
virtual void doFoo(void);
};
class LibraryBClass : public LibraryAClass {
virtual void doBaa(void);
};

then there are 2 problems:

a) there is no symbol visibility that will trigger internal
   binding in addition to a symbol export. ie. if 
   'LibraryBClass' is a public interface - no useful
   visibility markup can be done; and hence we have a named
   relocation for 'doBaa's vtable slot.
   [ IMHO this is a feature-gap, we need a new ('export'?)
 visibility attribute for this case ].

b) even if LibraryBClass is a 'hidden' class - to build it's
   vtable we have to have a slot for 'doFoo' which is in
   an external library (A) => another named relocation. An 
   unavoidable consequence of using virtual classes as part of
   a library's API.

> IMHO, it's unreasonable to break the C++ ABI for 1 second of warm time
> startup.

Well - it's an option that was considered, although - as you say -
highly unpleasant, and probably quite unnecessary - as the paper
explains.

Regards,

Michael.

-- 
 [EMAIL PROTECTED]  <><, Pseudo Engineer, itinerant idiot



Re: Large, modular C++ application performance ...

2005-08-01 Thread michael meeks

On Sat, 2005-07-30 at 18:25 +0100, Andrew Haley wrote:
>  > > All input much appreciated; no doubt my terminology is irritatingly up
>  > > the creek, hopefully the sentiment will win through.
>  > >
>  > > http://go-oo.org/~michael/OOoStartup.pdf
> 
> One thing I don't understand is the formula where you write linking
> time is proprortional to the log of the total number of symbols.  Does
> this come from drepper's paper, or somewhere else?

I defer to Ulrich's text:
http://people.redhat.com/drepper/dsohowto.pdf

Section 1.5 of:

"Deficiencies in the ELF hash table function and various ELF extensions
modifying the symbol lookup functionality may well increase the factor
to O(R + r.n.log(s)) where s is the number of symbols. This should make
clear that for improved performance it is significant to reduce the
number of relocations and symbols as much as possible".

However - the log(s) term is rather irrelevant to my argument :-)

HTH,

Michael.

-- 
 [EMAIL PROTECTED]  <><, Pseudo Engineer, itinerant idiot



re: Large, modular C++ application performance ...

2005-08-01 Thread michael meeks
Hi Dan,

On Sat, 2005-07-30 at 11:19 -0400, [EMAIL PROTECTED] wrote:
> MM wrote in http://go-oo.org/~michael/OOoStartup.pdf:
> "... not one slot was overridden by an implementation
> method external to the implementing library."

This is really an issue rather orthogonal to that of 'final', what I'm
trying to say (clearly, rather badly) - is that in those 3 libraries
there were 0 instances of virtual functions of a given class implemented
in that DSO, being implemented outside that DSO.[1]

The significance of this is that - if we can markup classes to generate
internal relocations for their overridden slots, and copy the parent
library's (also internally) relocated version for inherited slots,
(during this proposed idle vtable relocation process). Then we would
avoid needing ~any named relocations at all to construct these vtables.
ie. go from many tens of thousands of the slowest type of relocation, to
none.

HTH,

Michael.

[1] - further research AFAIR showed only a handful of these instances
across all OO.o libraries.
-- 
 [EMAIL PROTECTED]  <><, Pseudo Engineer, itinerant idiot



Re: Large, modular C++ application performance ...

2005-08-02 Thread michael meeks

On Mon, 2005-08-01 at 14:18 +0200, Steven Bosscher wrote:
> On Monday 01 August 2005 11:44, michael meeks wrote:
> > However - the log(s) term is rather irrelevant to my argument :-)
> 
> Not really.  Maybe the oprofile results for the linker show that the
> behavior is worse, or maybe better - who knows :-)
> Have you looked at any profiles btw?  Just for the curious...

Yes - identifying the linker and relocation processing as the root
cause of the problem isn't just a stab in the dark :-)

This flgas up as the no.1 (individual) performance killer with whatever
profiling tools you use eg.:

* vtune
* speedprof
* instrumenting top/tail of dlopen calls

etc. :-)

Regards,

Michael.

-- 
 [EMAIL PROTECTED]  <><, Pseudo Engineer, itinerant idiot



Re: Large, modular C++ application performance ...

2005-08-02 Thread michael meeks
Hi H.J.,

On Mon, 2005-08-01 at 08:55 -0700, H. J. Lu wrote:
> > -fvisibility is helpful - as the paper says, not as helpful as the old
> > -Bsymbolic (or link maps exposing only 3 or so functions) were. However
> > - -fvisibility can only help so much - if you have:
>
> Since you were comparing Windows vs. ELF, doesn't Windows need a file
> to define which symbols to export for a shared library ?

Apparently so - here is my (fragementary) understanding of that -
Martin - please do correct me. OO.o builds the .defs on Win32 with a
custom tool called 'ldump4'. That (interestingly) goes groping in some
binary file format, reads the symbol table, groks symbols tagged with
'EXPORT:', and builds a .def file. ie. it *looks* like it's automated,
and can uses the API marked (__dllexport etc.) where appropriate.

>  Why can't you you do it with ELF using a linker map? Libstdc++.so is
> built with a linker map. Any C++ shared library should use one if the
> startup time is a big concern. Of coursee, if gcc can generate a list
> of symbols suitable for linker map, which needs to be exported, it will
> be very helpful. I don't think it will be too hard to implement.

So - the thing about linker maps (cf. the ldump4 tool) is that they
tend to be hard to maintain, not portable across platforms, a source of
grief and problems etc. ;-) [ we have several strata of old, now defunct
link maps lying around from previous investments of effort that
subsequently became useless ].

As I recall, I saw a suggestion (from you I think), for a new
visibility attribute 'export' or somesuch, that would resolve names
internally to the library, while still exporting the symbols.

That would suit our needs beautifully - if, when used to annotate a
class, it would allow the various typeinfo / vague-linkage pieces
through as 'default'. Is it a realistic suggestion ? / if so, am happy
to knock up a patch.

[ and of course, this is only 1/2 the problem - the other half isn't
much helped by visibility markup as previously discussed ;-]

Thanks,

Michael.

-- 
 [EMAIL PROTECTED]  <><, Pseudo Engineer, itinerant idiot



Re: Large, modular C++ application performance ...

2005-08-02 Thread michael meeks

On Tue, 2005-08-02 at 06:57 -0700, H. J. Lu wrote:
> Maitaining a C++ linker map isn't easy. I think gcc should help out
> here.

What do you suggest ? - something separate from the visibility markup ?
perhaps what I'm suggesting is some horribly mis-use of that. Clearly
adding a new visibility attribute that would bind that symbol
internally, yet export it would be a simple approach; did you have a
better idea ? and/or suggestions for a name ? - or is this a total
non-starter for some other reason ?

> > That would suit our needs beautifully - if, when used to annotate a
> > class, it would allow the various typeinfo / vague-linkage pieces
> > through as 'default'. Is it a realistic suggestion ? / if so, am happy
> > to knock up a patch.
> > 
> > [ and of course, this is only 1/2 the problem - the other half isn't
> > much helped by visibility markup as previously discussed ;-]
>
> Why not? If you know a symbol in DSO won't be overridden by others,
> you can resolve it locally via a linker map.

Sure - the other (more than) 1/2 of the performance problem comes from
named relocations to symbols external to the DSO.

Thanks,

Michael.

-- 
 [EMAIL PROTECTED]  <><, Pseudo Engineer, itinerant idiot



Re: Moving to git - bibisect ...

2015-08-24 Thread Michael Meeks
Hi Jakub,

On Mon, 2015-08-24 at 10:17 +0200, Jakub Jelinek wrote:
> > Jakub: How about using git bisect instead, and identify the compiler
> > binaries with the git commit sha1?
> 
> That is really not useful.  While you speed it bisection somewhat by avoiding
> network traffic and communication with a server, there is still significant
> time spent on actually building the compiler.

In LibreOffice land (thanks to Bjoern Michaelsen) we use and publish
binary bisection repositories (bibsect). It takes of the order of an
hour+ on some cutting edge hardware to build each of our binaries - for
most people longer - so we archive our live, runnable commit as you do -
but we check those images into a new git repository.

Each of those is checked in with a commit message that points to the
source hash.

> The way I use bisection is that either I have for every 50-200
> commits a cc1/cc1plus/f951 compiler already built (that is on my ws) or for
> every non-library commit to the branch that could affect the compiler (no
> testsuite changes etc.).

So in our model, those would all go in git and get packed with an
aggressive git gc. We publish these repositories too[1] - with thousands
of binaries built inside them so non-technical QA guys can download and
locate the right developer to blame for their pet regression long after
the date). Interestingly mostly non-technical QA guys have done this for
several hundred regressions in the last few years.

>   And for those really identifying them by sha1
> hashes is significantly worse than using monotonically increasing small
> number, sha1 hashes are impossible to remember, and you don't know what is
> earlier and what is later from just looking at it.

That's of course true; the hashes are a pain - but bisecting in the
binary repository is easy enough I think - and there is IIRC some degree
of built-in tooling for running scripts/tests on each version to
automate that (I'm sure you have something like that already).

https://wiki.documentfoundation.org/QA/Bibisect#Introduction

has some fluff on our approach.

Of course, aside from that git takes quite some learning to love ;-)
but as/when you're there you wonder how you lived through RCS, CVS, SVN,
etc.

HTH,

Michael.

[1] - this of course involves some horrors of different Linux and ABI
issues and so on that (I hope) gcc would be less prone to problems with.
-- 
 michael.me...@collabora.com  <><, Pseudo Engineer, itinerant idiot



Re: libstdc++ c++98 & c++11 ABI incompatibility

2012-07-02 Thread Michael Meeks

On Thu, 2012-06-14 at 15:14 +0200, Matthias Klose wrote:
> While PR53646 claims that c++98 and c++11 should be ABI
> compatible (modulo bugs), the addition of the _M_size member
> to std::_List_base::_List_impl makes libraries using
> std::list in headers incompatible

This is pretty nasty for LibreOffice (and no doubt others). We can, and
often do depend on rather a number of system C++ libraries and at a very
minimum, having no simple way to detect which C++ ABI we have to build
against 'old' vs. 'new' - is profoundly unpleasant.

Is there no chance of having a bug fix that is a revision of the
(unintended?) ABI breakage in this compiler series ?

> And is there a way to tell which mode a shared object/an
> executable was built for, when just looking at the stripped
> or unstripped object?

I guess here we have a compile-time checking problem; we would need
some more or less gross configure hack to try to detect which ABI is
deployed; suggestions appreciated.

Many thanks for the (otherwise) excellent gcc :-)

ATB,

Michael.

-- 
michael.me...@suse.com  <><, Pseudo Engineer, itinerant idiot



vtrelocs: large/modular C++ app speedup ...

2008-04-02 Thread Michael Meeks
Hi guys,

I spent a little time recently researching ways to reduce the number of
unique named relocations that must be processed at dlopen time for large
C++ libraries[1]. Apologies for spamming all 3 lists like this, but it
touches all 3 projects.

Since almost all function relocations of this type are inside vtables,
I implemented a new way of relocating vtables. This is a new
'.suse.vtrelocs' section.

As we inherit a class across a shared library boundary we construct new
vtables that are often extremely similar to their parents. However -
this similarity is not exposed - instead we fill the new vtable with
many unique named relocations, one per method. This generates lots
of .rel entries, and emits lots of external symbols; worse these symbols
tend to be duplicated across ~all libraries deriving from the base
class.

Instead a vtreloc sections contains (a sorted):

struct {
void **src, **dest;
int  copy_slot_bitmask;
} vtreloc_entries[] = { ... }

The run-time cost of processing these is insignificant in comparison to
the cost of processing the remaining relocations, giving a pleasant
speed win.

A brief slide-deck with the results of my research is here:

http://www.gnome.org/~michael/vtrelocs-gcc.pdf

and has a comparison against the current state of the art wrt. reducing
relocations: -Bsymbolic-functions [ in itself a substantial
optimisation ].

The 3 prototype patches for discussion are attached. There are a number
of trivial hacks in there (of course) - eg. environment variables to
turn the feature on, leaving an empty .vtrelocs section in object files
etc.

The more interesting problems are:

* glibc - the memory protection semantics need adjusting - since
  we need to fixup relocations in 'init' order: shouldn't be
  impossibly hard to fix but I just turn off protection ;-)
+ subsequent dlopens can (I think) avoid touching
  already relocated libraries they don't own avoiding 
  this sort of problem.

* gcc - the code to generate the vtreloc sections is  
  written for comfort not speed. This is a fall-back from having
  initially tried to integrate the work into 
  build_vtbl_initializer & friends with some success, but rather
  a tangling of the code.

* vtreloc section design - the section should be readonly, and 
  prolly refer by offset to .bss relocations that can be re-used
  for implementing indirect calls via. parent vtable to virtual
  functions. That should save relocs, but make each entry 
  slightly larger.

Of course, apart from the run-time speed wins, some of the nicest
potential size wins come from breaking the ABI[2] & depending on the
vtrelocs to fixup vtables: eg. hiding all thunks (implemented), or
potentially hiding all virtual function symbols & invoking them via
their parent vtable (not implemented).

Wrt. testing, I can build & run an OO.o built with this - clearly not a
unit-test ;-) but perhaps helpful.

Feedback much appreciated,

Thanks,

Michael.

[1] - specifically OpenOffice.org ;-)
[2] - which while bad, can be done in isolated islands like OO.o.
-- 
 [EMAIL PROTECTED]  <><, Pseudo Engineer, itinerant idiot

diff -u -r -x '*~' -x testsuite -x libjava -x cc-nptl -x build-dir -x '*.orig' -x obj-i586-suse-linux -x texis -x Makeconfig -x version.h -x '*.o' -x '*.1' -x 'Makefile*' -x 'config*' -x libtool -x '*.info' -x '*.tex' pristine-binutils-2.17.50/bfd/elf.c binutils-2.17.50/bfd/elf.c
--- pristine-binutils-2.17.50/bfd/elf.c	2008-01-09 16:45:22.0 +
+++ binutils-2.17.50/bfd/elf.c	2008-01-23 16:48:45.0 +
@@ -1240,6 +1240,7 @@
 	case DT_USED: name = "USED"; break;
 	case DT_FILTER: name = "FILTER"; stringp = TRUE; break;
 	case DT_GNU_HASH: name = "GNU_HASH"; break;
+	case DT_SUSE_VTRELOC: name = "SUSE_VTRELOC"; break;
 	}
 
 	  fprintf (f, "  %-11s ", name);

diff -u -r -x '*~' -x testsuite -x libjava -x cc-nptl -x build-dir -x '*.orig' -x obj-i586-suse-linux -x texis -x Makeconfig -x version.h -x '*.o' -x '*.1' -x 'Makefile*' -x 'config*' -x libtool -x '*.info' -x '*.tex' pristine-binutils-2.17.50/bfd/elflink.c binutils-2.17.50/bfd/elflink.c
--- pristine-binutils-2.17.50/bfd/elflink.c	2008-01-09 16:45:22.0 +
+++ binutils-2.17.50/bfd/elflink.c	2008-01-23 16:50:07.0 +
@@ -5652,6 +5652,13 @@
 	return FALSE;
 	}
 
+  s = bfd_get_section_by_name (output_bfd, ".suse.vtrelocs");
+  if (s != NULL)
+	{
+  if (!_bfd_elf_add_dynamic_entry (info, DT_SUSE_VTRELOC, 0))
+	return FALSE;
+	}
+
   dynstr = bfd_get_section_by_name (dynobj, ".dynstr");
   /* If .dynstr is excluded from the link, we don't want any of
 	 these tags.  Strictly, we should be checking each section
@@ -10869,6 +10876

Re: vtrelocs: large/modular C++ app speedup ...

2008-04-02 Thread Michael Meeks
Hi Ian / Andi,

On Wed, 2008-04-02 at 07:56 -0700, Ian Lance Taylor wrote:
> * Use GNU instead of SUSE, as this is for the GNU tools.

Ah yes; you noticed the subliminal advertising ;-) If you're happy for
me to trample on the GNU section namespace that's fine, but I hesitate
to tread there by default.

> * Don't check for explicit section names.  Instead, give the section a
>   magic type.
> * It seems that this is not backward compatible--an executable built
>   in this way will not work if the dynamic linker does not know about
>   it.  The section should have the SHF_OS_NONCONFORMING bit set.

Not clear how to fix either of those :-) I binned a redundant string
section name lookup in the binutils patch though.

> * Aren't you going to get a lot of duplicate vtreloc entries?
>   Shouldn't they be grouped with the vtables themselves?

That's entirely possible; perhaps I misunderstand the question, but had
I hoped that by making the _ZVTR_ section weak the linker would discard
any duplicate vtreloc records for the same vtable.

> * The idea is useless without support in the dynamic linker, so you
>   need to get signoff there first.

Naturally :-)

On Wed, 2008-04-02 at 17:06 +0200, Andi Kleen wrote:
> I wonder if it could be made backwards compatible. As in keep the old
> style relocations too, but the new linker would not process them
> when seeing the new special relocations.
 
It's certainly possible; of course it looses you any size savings. I
imagine that using the dynsort code we could shuffle the relevant relocs
to the end of the list fairly easily - that is if we could identify
whether they overlapped with the vtrelocs (or not): perhaps some big
bit-mask for the whole data section or something (?).

Thanks,

Michael.

-- 
 [EMAIL PROTECTED]  <><, Pseudo Engineer, itinerant idiot