Large, modular C++ application performance ...
Hi there, I've been doing a little thinking about how to improve OO.o startup performance recently; and - well, relocation processing happens to be the single, biggest thing that most tools flag. Anyhow - so I wrote up the problem, and a couple of potential solutions / extensions / workarounds, and - being of a generally clueless nature, was hoping to solicit instruction from those of a more enlightened disposition. All input much appreciated; no doubt my terminology is irritatingly up the creek, hopefully the sentiment will win through. http://go-oo.org/~michael/OOoStartup.pdf Two solutions are proposed - there are almost certainly more that I'm not thinking of. I'm interested in people's views as to which approach is best. So far the constructor hook approach seems to be the path of least resistance. Thanks, Michael. -- [EMAIL PROTECTED] <><, Pseudo Engineer, itinerant idiot
Re: Large, modular C++ application performance ...
Hi Giovanni, On Sat, 2005-07-30 at 15:36 +0200, Giovanni Bajo wrote: > I'm slow, but I can't understand why a careful design of the interfaces of > the dynamic libraries Well - sure, depends how 'careful' you are ;-) clearly if no C++ classes with virtual methods form the interface of any library, then there is no problem ;-) unfortunately, mandating that would rather cripple C++. > together with the new -fvisibility flags, should not > be sufficient. It worked well in other scenarios -fvisibility is helpful - as the paper says, not as helpful as the old -Bsymbolic (or link maps exposing only 3 or so functions) were. However - -fvisibility can only help so much - if you have: class LibraryAClass { virtual void doFoo(void); }; class LibraryBClass : public LibraryAClass { virtual void doBaa(void); }; then there are 2 problems: a) there is no symbol visibility that will trigger internal binding in addition to a symbol export. ie. if 'LibraryBClass' is a public interface - no useful visibility markup can be done; and hence we have a named relocation for 'doBaa's vtable slot. [ IMHO this is a feature-gap, we need a new ('export'?) visibility attribute for this case ]. b) even if LibraryBClass is a 'hidden' class - to build it's vtable we have to have a slot for 'doFoo' which is in an external library (A) => another named relocation. An unavoidable consequence of using virtual classes as part of a library's API. > IMHO, it's unreasonable to break the C++ ABI for 1 second of warm time > startup. Well - it's an option that was considered, although - as you say - highly unpleasant, and probably quite unnecessary - as the paper explains. Regards, Michael. -- [EMAIL PROTECTED] <><, Pseudo Engineer, itinerant idiot
Re: Large, modular C++ application performance ...
On Sat, 2005-07-30 at 18:25 +0100, Andrew Haley wrote: > > > All input much appreciated; no doubt my terminology is irritatingly up > > > the creek, hopefully the sentiment will win through. > > > > > > http://go-oo.org/~michael/OOoStartup.pdf > > One thing I don't understand is the formula where you write linking > time is proprortional to the log of the total number of symbols. Does > this come from drepper's paper, or somewhere else? I defer to Ulrich's text: http://people.redhat.com/drepper/dsohowto.pdf Section 1.5 of: "Deficiencies in the ELF hash table function and various ELF extensions modifying the symbol lookup functionality may well increase the factor to O(R + r.n.log(s)) where s is the number of symbols. This should make clear that for improved performance it is significant to reduce the number of relocations and symbols as much as possible". However - the log(s) term is rather irrelevant to my argument :-) HTH, Michael. -- [EMAIL PROTECTED] <><, Pseudo Engineer, itinerant idiot
re: Large, modular C++ application performance ...
Hi Dan, On Sat, 2005-07-30 at 11:19 -0400, [EMAIL PROTECTED] wrote: > MM wrote in http://go-oo.org/~michael/OOoStartup.pdf: > "... not one slot was overridden by an implementation > method external to the implementing library." This is really an issue rather orthogonal to that of 'final', what I'm trying to say (clearly, rather badly) - is that in those 3 libraries there were 0 instances of virtual functions of a given class implemented in that DSO, being implemented outside that DSO.[1] The significance of this is that - if we can markup classes to generate internal relocations for their overridden slots, and copy the parent library's (also internally) relocated version for inherited slots, (during this proposed idle vtable relocation process). Then we would avoid needing ~any named relocations at all to construct these vtables. ie. go from many tens of thousands of the slowest type of relocation, to none. HTH, Michael. [1] - further research AFAIR showed only a handful of these instances across all OO.o libraries. -- [EMAIL PROTECTED] <><, Pseudo Engineer, itinerant idiot
Re: Large, modular C++ application performance ...
On Mon, 2005-08-01 at 14:18 +0200, Steven Bosscher wrote: > On Monday 01 August 2005 11:44, michael meeks wrote: > > However - the log(s) term is rather irrelevant to my argument :-) > > Not really. Maybe the oprofile results for the linker show that the > behavior is worse, or maybe better - who knows :-) > Have you looked at any profiles btw? Just for the curious... Yes - identifying the linker and relocation processing as the root cause of the problem isn't just a stab in the dark :-) This flgas up as the no.1 (individual) performance killer with whatever profiling tools you use eg.: * vtune * speedprof * instrumenting top/tail of dlopen calls etc. :-) Regards, Michael. -- [EMAIL PROTECTED] <><, Pseudo Engineer, itinerant idiot
Re: Large, modular C++ application performance ...
Hi H.J., On Mon, 2005-08-01 at 08:55 -0700, H. J. Lu wrote: > > -fvisibility is helpful - as the paper says, not as helpful as the old > > -Bsymbolic (or link maps exposing only 3 or so functions) were. However > > - -fvisibility can only help so much - if you have: > > Since you were comparing Windows vs. ELF, doesn't Windows need a file > to define which symbols to export for a shared library ? Apparently so - here is my (fragementary) understanding of that - Martin - please do correct me. OO.o builds the .defs on Win32 with a custom tool called 'ldump4'. That (interestingly) goes groping in some binary file format, reads the symbol table, groks symbols tagged with 'EXPORT:', and builds a .def file. ie. it *looks* like it's automated, and can uses the API marked (__dllexport etc.) where appropriate. > Why can't you you do it with ELF using a linker map? Libstdc++.so is > built with a linker map. Any C++ shared library should use one if the > startup time is a big concern. Of coursee, if gcc can generate a list > of symbols suitable for linker map, which needs to be exported, it will > be very helpful. I don't think it will be too hard to implement. So - the thing about linker maps (cf. the ldump4 tool) is that they tend to be hard to maintain, not portable across platforms, a source of grief and problems etc. ;-) [ we have several strata of old, now defunct link maps lying around from previous investments of effort that subsequently became useless ]. As I recall, I saw a suggestion (from you I think), for a new visibility attribute 'export' or somesuch, that would resolve names internally to the library, while still exporting the symbols. That would suit our needs beautifully - if, when used to annotate a class, it would allow the various typeinfo / vague-linkage pieces through as 'default'. Is it a realistic suggestion ? / if so, am happy to knock up a patch. [ and of course, this is only 1/2 the problem - the other half isn't much helped by visibility markup as previously discussed ;-] Thanks, Michael. -- [EMAIL PROTECTED] <><, Pseudo Engineer, itinerant idiot
Re: Large, modular C++ application performance ...
On Tue, 2005-08-02 at 06:57 -0700, H. J. Lu wrote: > Maitaining a C++ linker map isn't easy. I think gcc should help out > here. What do you suggest ? - something separate from the visibility markup ? perhaps what I'm suggesting is some horribly mis-use of that. Clearly adding a new visibility attribute that would bind that symbol internally, yet export it would be a simple approach; did you have a better idea ? and/or suggestions for a name ? - or is this a total non-starter for some other reason ? > > That would suit our needs beautifully - if, when used to annotate a > > class, it would allow the various typeinfo / vague-linkage pieces > > through as 'default'. Is it a realistic suggestion ? / if so, am happy > > to knock up a patch. > > > > [ and of course, this is only 1/2 the problem - the other half isn't > > much helped by visibility markup as previously discussed ;-] > > Why not? If you know a symbol in DSO won't be overridden by others, > you can resolve it locally via a linker map. Sure - the other (more than) 1/2 of the performance problem comes from named relocations to symbols external to the DSO. Thanks, Michael. -- [EMAIL PROTECTED] <><, Pseudo Engineer, itinerant idiot
Re: Moving to git - bibisect ...
Hi Jakub, On Mon, 2015-08-24 at 10:17 +0200, Jakub Jelinek wrote: > > Jakub: How about using git bisect instead, and identify the compiler > > binaries with the git commit sha1? > > That is really not useful. While you speed it bisection somewhat by avoiding > network traffic and communication with a server, there is still significant > time spent on actually building the compiler. In LibreOffice land (thanks to Bjoern Michaelsen) we use and publish binary bisection repositories (bibsect). It takes of the order of an hour+ on some cutting edge hardware to build each of our binaries - for most people longer - so we archive our live, runnable commit as you do - but we check those images into a new git repository. Each of those is checked in with a commit message that points to the source hash. > The way I use bisection is that either I have for every 50-200 > commits a cc1/cc1plus/f951 compiler already built (that is on my ws) or for > every non-library commit to the branch that could affect the compiler (no > testsuite changes etc.). So in our model, those would all go in git and get packed with an aggressive git gc. We publish these repositories too[1] - with thousands of binaries built inside them so non-technical QA guys can download and locate the right developer to blame for their pet regression long after the date). Interestingly mostly non-technical QA guys have done this for several hundred regressions in the last few years. > And for those really identifying them by sha1 > hashes is significantly worse than using monotonically increasing small > number, sha1 hashes are impossible to remember, and you don't know what is > earlier and what is later from just looking at it. That's of course true; the hashes are a pain - but bisecting in the binary repository is easy enough I think - and there is IIRC some degree of built-in tooling for running scripts/tests on each version to automate that (I'm sure you have something like that already). https://wiki.documentfoundation.org/QA/Bibisect#Introduction has some fluff on our approach. Of course, aside from that git takes quite some learning to love ;-) but as/when you're there you wonder how you lived through RCS, CVS, SVN, etc. HTH, Michael. [1] - this of course involves some horrors of different Linux and ABI issues and so on that (I hope) gcc would be less prone to problems with. -- michael.me...@collabora.com <><, Pseudo Engineer, itinerant idiot
Re: libstdc++ c++98 & c++11 ABI incompatibility
On Thu, 2012-06-14 at 15:14 +0200, Matthias Klose wrote: > While PR53646 claims that c++98 and c++11 should be ABI > compatible (modulo bugs), the addition of the _M_size member > to std::_List_base::_List_impl makes libraries using > std::list in headers incompatible This is pretty nasty for LibreOffice (and no doubt others). We can, and often do depend on rather a number of system C++ libraries and at a very minimum, having no simple way to detect which C++ ABI we have to build against 'old' vs. 'new' - is profoundly unpleasant. Is there no chance of having a bug fix that is a revision of the (unintended?) ABI breakage in this compiler series ? > And is there a way to tell which mode a shared object/an > executable was built for, when just looking at the stripped > or unstripped object? I guess here we have a compile-time checking problem; we would need some more or less gross configure hack to try to detect which ABI is deployed; suggestions appreciated. Many thanks for the (otherwise) excellent gcc :-) ATB, Michael. -- michael.me...@suse.com <><, Pseudo Engineer, itinerant idiot
vtrelocs: large/modular C++ app speedup ...
Hi guys, I spent a little time recently researching ways to reduce the number of unique named relocations that must be processed at dlopen time for large C++ libraries[1]. Apologies for spamming all 3 lists like this, but it touches all 3 projects. Since almost all function relocations of this type are inside vtables, I implemented a new way of relocating vtables. This is a new '.suse.vtrelocs' section. As we inherit a class across a shared library boundary we construct new vtables that are often extremely similar to their parents. However - this similarity is not exposed - instead we fill the new vtable with many unique named relocations, one per method. This generates lots of .rel entries, and emits lots of external symbols; worse these symbols tend to be duplicated across ~all libraries deriving from the base class. Instead a vtreloc sections contains (a sorted): struct { void **src, **dest; int copy_slot_bitmask; } vtreloc_entries[] = { ... } The run-time cost of processing these is insignificant in comparison to the cost of processing the remaining relocations, giving a pleasant speed win. A brief slide-deck with the results of my research is here: http://www.gnome.org/~michael/vtrelocs-gcc.pdf and has a comparison against the current state of the art wrt. reducing relocations: -Bsymbolic-functions [ in itself a substantial optimisation ]. The 3 prototype patches for discussion are attached. There are a number of trivial hacks in there (of course) - eg. environment variables to turn the feature on, leaving an empty .vtrelocs section in object files etc. The more interesting problems are: * glibc - the memory protection semantics need adjusting - since we need to fixup relocations in 'init' order: shouldn't be impossibly hard to fix but I just turn off protection ;-) + subsequent dlopens can (I think) avoid touching already relocated libraries they don't own avoiding this sort of problem. * gcc - the code to generate the vtreloc sections is written for comfort not speed. This is a fall-back from having initially tried to integrate the work into build_vtbl_initializer & friends with some success, but rather a tangling of the code. * vtreloc section design - the section should be readonly, and prolly refer by offset to .bss relocations that can be re-used for implementing indirect calls via. parent vtable to virtual functions. That should save relocs, but make each entry slightly larger. Of course, apart from the run-time speed wins, some of the nicest potential size wins come from breaking the ABI[2] & depending on the vtrelocs to fixup vtables: eg. hiding all thunks (implemented), or potentially hiding all virtual function symbols & invoking them via their parent vtable (not implemented). Wrt. testing, I can build & run an OO.o built with this - clearly not a unit-test ;-) but perhaps helpful. Feedback much appreciated, Thanks, Michael. [1] - specifically OpenOffice.org ;-) [2] - which while bad, can be done in isolated islands like OO.o. -- [EMAIL PROTECTED] <><, Pseudo Engineer, itinerant idiot diff -u -r -x '*~' -x testsuite -x libjava -x cc-nptl -x build-dir -x '*.orig' -x obj-i586-suse-linux -x texis -x Makeconfig -x version.h -x '*.o' -x '*.1' -x 'Makefile*' -x 'config*' -x libtool -x '*.info' -x '*.tex' pristine-binutils-2.17.50/bfd/elf.c binutils-2.17.50/bfd/elf.c --- pristine-binutils-2.17.50/bfd/elf.c 2008-01-09 16:45:22.0 + +++ binutils-2.17.50/bfd/elf.c 2008-01-23 16:48:45.0 + @@ -1240,6 +1240,7 @@ case DT_USED: name = "USED"; break; case DT_FILTER: name = "FILTER"; stringp = TRUE; break; case DT_GNU_HASH: name = "GNU_HASH"; break; + case DT_SUSE_VTRELOC: name = "SUSE_VTRELOC"; break; } fprintf (f, " %-11s ", name); diff -u -r -x '*~' -x testsuite -x libjava -x cc-nptl -x build-dir -x '*.orig' -x obj-i586-suse-linux -x texis -x Makeconfig -x version.h -x '*.o' -x '*.1' -x 'Makefile*' -x 'config*' -x libtool -x '*.info' -x '*.tex' pristine-binutils-2.17.50/bfd/elflink.c binutils-2.17.50/bfd/elflink.c --- pristine-binutils-2.17.50/bfd/elflink.c 2008-01-09 16:45:22.0 + +++ binutils-2.17.50/bfd/elflink.c 2008-01-23 16:50:07.0 + @@ -5652,6 +5652,13 @@ return FALSE; } + s = bfd_get_section_by_name (output_bfd, ".suse.vtrelocs"); + if (s != NULL) + { + if (!_bfd_elf_add_dynamic_entry (info, DT_SUSE_VTRELOC, 0)) + return FALSE; + } + dynstr = bfd_get_section_by_name (dynobj, ".dynstr"); /* If .dynstr is excluded from the link, we don't want any of these tags. Strictly, we should be checking each section @@ -10869,6 +10876
Re: vtrelocs: large/modular C++ app speedup ...
Hi Ian / Andi, On Wed, 2008-04-02 at 07:56 -0700, Ian Lance Taylor wrote: > * Use GNU instead of SUSE, as this is for the GNU tools. Ah yes; you noticed the subliminal advertising ;-) If you're happy for me to trample on the GNU section namespace that's fine, but I hesitate to tread there by default. > * Don't check for explicit section names. Instead, give the section a > magic type. > * It seems that this is not backward compatible--an executable built > in this way will not work if the dynamic linker does not know about > it. The section should have the SHF_OS_NONCONFORMING bit set. Not clear how to fix either of those :-) I binned a redundant string section name lookup in the binutils patch though. > * Aren't you going to get a lot of duplicate vtreloc entries? > Shouldn't they be grouped with the vtables themselves? That's entirely possible; perhaps I misunderstand the question, but had I hoped that by making the _ZVTR_ section weak the linker would discard any duplicate vtreloc records for the same vtable. > * The idea is useless without support in the dynamic linker, so you > need to get signoff there first. Naturally :-) On Wed, 2008-04-02 at 17:06 +0200, Andi Kleen wrote: > I wonder if it could be made backwards compatible. As in keep the old > style relocations too, but the new linker would not process them > when seeing the new special relocations. It's certainly possible; of course it looses you any size savings. I imagine that using the dynsort code we could shuffle the relevant relocs to the end of the list fairly easily - that is if we could identify whether they overlapped with the vtrelocs (or not): perhaps some big bit-mask for the whole data section or something (?). Thanks, Michael. -- [EMAIL PROTECTED] <><, Pseudo Engineer, itinerant idiot