On Sat, 9 Jan 2021 at 14:09, Christophe Lyon <christophe.l...@linaro.org> wrote: > > On Sat, 9 Jan 2021 at 13:27, Daniel Engel <lib...@danielengel.com> wrote: > > > > On Thu, Jan 7, 2021, at 4:56 AM, Richard Earnshaw wrote: > > > On 07/01/2021 00:59, Daniel Engel wrote: > > > > --snip-- > > > > > > > > On Wed, Jan 6, 2021, at 9:05 AM, Richard Earnshaw wrote: > > > > > > > >> > > > >> Thanks for working on this, Daniel. > > > >> > > > >> This is clearly stage1 material, so we've got time for a couple of > > > >> iterations to sort things out. > > > > > > > > I appreciate your feedback. I had been hoping that with no regressions > > > > this might still be eligible for stage2. Christophe never indicated > > > > either way. but the fact that he was looking at it seemed positive. > > > > I thought I would be a couple of weeks faster with this last > > > > iteration, but holidays got in the way. > > > > > > GCC doesn't have a stage 2 any more (historical wart). We were in > > > (late) stage3 when this was first posted, and because of the significant > > > impact this might have on not just CM0 but other targets as well, I > > > don't think it's something we should try to squeeze in at the last > > > minute. We're now in stage 4, so that is doubly the case. > > > > Of course I meant stage3. Oops. I actually thought stage 3 would > > continue through next week based on the average historical dates. > > I expected stage4 to start on Jan 1st :-) > > > It would have been nice to get this feedback when I emailed you a > > preview version of this patch (2020-Nov-11). Christophe's logs have > > been very helpful on the technical integration, but it's proving a chore > > to go back and re-create some of the intermediate chunks. > > > > Regardless, I still have free time for at least a little while longer to > > work on this, so I'll push forward with any further feedback you are > > willing to give me. I have failed to free up any time during the last 2 > > years to actually work on this during stage1, and I have no guarantee > > this coming year will be different. > > > > > > > > Christophe is a very valuable member of our community, but he's not a > > > port maintainer and thus cannot really rule on what can go into the > > > tools, or when. > > > > > > > > > > > I actually think your comments below could all be addressable within a > > > > couple of days. But, I'm not accounting for the review process. > > > > > > > >> Firstly, the patch is very large, but contains a large number of > > > >> distinct changes, so it would really benefit from being broken down > > > >> into > > > >> a number of distinct patches. This will make reviewing the individual > > > >> changes much more straight-forward. > > And if you can generate the patch with git, that will help: the reason for the > previous build errors was because I had to "reformat" your patch before > submitting it for validation, and I messed things up. > > > > > > > > > I have no context for "large" or "small" with respect to gcc. This > > > > patch comprises about 30% of a previously-monolithic library that's > > > > been shipping since ~2016 (the rest is libm material). Other than > > > > (1) the aforementioned change to div0(), (2) a nascent adaptation > > > > for __truncdfsf2() (not enabled), and (3) the gratuitous addition of > > > > the bitwise functions, the library remains pretty much as it was > > > > originally released. > > > > > > Large, like many other terms is relative. For assembler file changes, > > > which this is primarily, the overall size can be much smaller and still > > > be considered 'large'. > > > > > > > > > > > The driving force in the development of this library was small size, > > > > which of course was never possible with the softfp routines. It's not > > > > half-slow, either, for the limitations of the M0 architecture. And, > > > > it's IEEE compliant. But, that means that most of the functions are > > > > highly interconnected. So, some of it can be broken up as you outline > > > > below, but that last patch is still worth more than half of the total. > > > > > > Nevertheless, having the floating-point code separated out will make > > > reviewing more straight forward. I'll likely need to ask one of our FP > > > experts to have a specific look at that part and that will be easier if > > > it is disentangled from the other changes. > > > > > > > > I also have ~70k lines of test vectors that seem mostly redundant, but > > > > not completely. I haven't decided what to do here. For example, I have > > > > coverage for __aeabi_u/ldivmod, while GCC does not. If I do anything > > > > with this code it will be in a separate thread. > > > > > > Publishing the test code, even if it isn't integrated into the GCC > > > testsuite would be useful. Perhaps someone else could then help with > > > that. > > > > Very brute force stuff, not production quality: > > <http://danielengel.com/cm0_test_vectors.tgz> (160 kb) > > > > > >> I'd suggest: > > > >> > > > >> 1) Some basic makefile cleanups to ease initial integration - in > > > >> particular where we have things like > > > >> > > > >> LIB1FUNCS += <long list of functions> > > > >> > > > >> that this be rewritten with one function per line (and sorted > > > >> alphabetically) - then we can see which functions are being changed in > > > >> subsequent patches. It makes the Makefile fragments longer, but the > > > >> improvement in clarity for makes this worthwhile. > > > > > > > > I know next to nothing about Makefiles, particularly ones as complex as > > > > GCC's. I was just trying to work with the existing style to avoid > > > > breaking something. However, I can certainly adopt this suggestion. > > > > > > > >> 2) The changes for the existing integer functions - preferably one > > > >> function per patch. > > > >> > > > >> 3) The new integer functions that you're adding > > > > > > > > These wouldn't be too hard to do, but what are the expectations for > > > > testing? A clean build of GCC takes about 6 hours in my VM, and > > > > regression testing takes about 4 hours per architecture. You would want > > > > a full regression report for each incremental patch? I have no idea how > > > > to target regression tests that apply to particular runtime functions > > > > without the risk of missing something. > > > > > > > > > > Most of this can be tested in a cross-compile environment using qemu as > > > a model. A cross build shouldn't take that long (especially if you > > > restrict the compiler to just C and C++ - other languages are > > > vanishingly unlikely to pick up errors in the parts of the compiler > > > you're changing). But build checks will be mostly sufficient for most > > > of the intermediate patches. > > > > > > >> 4) The floating-point support. > > > >> > > > >> Some more general observations: > > > >> > > > >> - where functions are already in lib1funcs.asm, please leave them > > > >> there. > > > > > > > > I guess I have a different vision here. I have had a really hard time > > > > following all of the nested #ifdefs in lib1funcs, so I thought it would > > > > be helpful to begin breaking it up into logical units. > > > > > > Agreed, it's not easy. But the restructuring, if any, should be done > > > separately from other changes, not mixed up with them. > > > > > > > > > > > The functions removed were all functions for which I had THUMB1 > > > > sequences faster/smaller than lib1funcs: __clzsi2, __clzdi2, __ctzsi2, > > > > __ashrdi3, __lshrdi3, __ashldi3. In fact, the new THUMB1 of __clzsi2 is > > > > the same number of instructions as the previous ARM/THUMB2 version. > > > > > > > > You will find all of the previous ARM versions of these functions merged > > > > into the new files (with attribution) and the same preprocessor > > > > selection path. So no architecture variant should be any worse off than > > > > before this patch, and some beyond v6m should benefit. > > > > > > > > In the future, I think that my versions of __divsi3 and __divdi3 will > > > > prove faster/smaller than the existing THUMB2 versions. I know that my > > > > float routines are less than half the compiled size of THUMB2 versions > > > > in 'ieee754-sf.S'. However, I haven't profiled the exact performance > > > > differences so I have left all this work for future patches. (It's also > > > > quite likely that my version can be further-refined with a few judicious > > > > uses of THUMB2 alternatives.) > > > > > > > > My long-term vision would be use lib1funcs as an architectural wrapper > > > > distinct from the implementation code. > > > > > > > >> - lets avoid having the cm0 subdirectory - in particular we should do > > > >> this when there is existing code for other targets in the same source > > > >> files. It's OK to have any new files in the main 'arm' directory of > > > >> the > > > >> source tree - just name the files appropriately if really needed. > > > > > > > > Fair point on the name. In v1 of this patch, all these files were all > > > > preprocessor-selected for v6m only. However, as I've stumbled through > > > > the finer points of integration, that line has blurred. Name aside, > > > > the subdirectory does still represent a standalone library. I think > > > > I've managed to add enough integration hooks that it works well in > > > > a libgcc context, but it still has a very distinct implementation style. > > > > > > > > I don't have a strong opinion on this, just preference. But, keeping > > > > the subdirectory with a neutral name will probably make maintenance > > > > easier in the short term. I would suggest "lib0" (since it caters to > > > > the lowest common denominator) or "eabi" (since that was the original > > > > target). There are precedents in other architectures (e.g. avr). > > > > > > The issue here is that the selection of code from the various > > > subdirectories is not consistent. In some cases we might be pulling in > > > a thumb1 implementation into a thumb2 environment, so having the code in > > > a directory that doesn't reflect this makes maintaining the code harder. > > > I don't mind too much if some new files are introduced and their names > > > reflect both their function and the architecture they support - eg > > > t1-di-shift.S would obviously contain code for di-mode shifts in thumb1. > > > > You didn't say that a neutral directory name is off the table. > > I will propose something other than 'cm0'. > > > > > > > > > >> - let's avoid the CM0 prefix - this is 'thumb1' code, for want of a > > > >> better term, and that is used widely elsewhere in the compiler. So if > > > >> you really need a term just use THUMB1, or even T1. > > > > > > > > Maybe. The Cortex M0 includes a subset of THUMB2 instructions. Most > > > > of this is probably THUMB1 clean, but it wasn't a design requirement. > > > > > > It's particularly the Thumb1 issue, just more the name is for a specific > > > CPU which might cause confusion later. v6m would be preferable to that > > > if there really is a dependency on the instructions that are not in the > > > original Thumb1 ISA. > > > > I will remove the CM0 prefix and use/extend the standard macro names. > > > > > > > > > > > > > The CM0_FUNC_START exists so that I can specify subsections of ".text" > > > > for each function. This was a fairly fundamental design decision that > > > > allowed me to make a number of branch optimizations between functions. > > > > The other macros are just duplicates for naming symmetry. > > > > > > This is something we'll have to get to during the main review of the > > > code - we used to have support for PE-COFF object files. That might now > > > be obsolete, wince support is certainly deprecated - but we can't assume > > > that ELF is the only object format we'll ever have to support. > > > > > > > > > > > The existing FUNC_START macro inserts extra conflicting ".text" > > > > directives that would break the build. Of course, the prefix was > > > > arbitrary; I just took CM0 from the library name. But, there's nothing > > > > architecturally significant about this macro at all, so THUMB1 and T1 > > > > seems just about as wrong. Maybe define a FUNC_START_SECTION macro with > > > > two parameters? For example: > > > > > > > > FUNC_START_SECTION clzdi2 .text.sorted.libgcc.clz2.clzdi2 > > > > > > > > Instead of: > > > > > > > > .section .text.sorted.libgcc.clz2.clzdi2,"x" > > > > CM0_FUNC_START clzdi2 > > > > > > > >> - For the 64-bit shift functions, I expect the existing code to be > > > >> preferable whenever possible - I think it can even be tweaked to build > > > >> for thumb2 by inserting a suitable IT instruction. So your code should > > > >> only be used when > > > >> > > > >> #if !__ARM_ARCH_ISA_ARM && __ARM_ARCH_ISA_THUMB == 1 > > > > > > > > That is the definition of NOT_ISA_TARGET_32BIT, which I am respecting. > > > > (The name doesn't seem quite right for Cortex M0, since it does support > > > > some 32 bit instructions, but that's beside the point.) > > > > > > The terms Thumb1 and Thumb2 predate the arm-v8m architecture > > > specifications - even then the term Thumb1 was interpreted as "mostly > > > 16-bit instructions" and thumb2 as "a mix of 16- and 32-bit". Yes, the > > > 16/32-bit spilt has become more blurred and that will likely continue in > > > future since the 16-bit encoding space is pretty full. > > > > > > > > > > > The current lib1funcs ARM code path still exists, as described above. My > > > > THUMB1 implementations were 1 - 3 instructions shorter than the current > > > > versions, which is why I took the time to merge the files. > > > > > > > > Unfortunately, the Cortex M0 THUMB2 subset does not provide IT. I don't > > > > see an advantage to eliminating the branch unless these functions were > > > > written with cryptographic side channel attacks in mind. > > > > > > On high performance cores branches are predicted - if the branch is > > > predictable then the common path will be taken and the unneeded > > > instructions will never be used. But library functions like this tend > > > to have very unpredictable values used for calling them, so it's much > > > less likely that the hardware will predict the right path - at this > > > point conditional instructions tend to win (especially if there aren't > > > very many of them) because the cost (on average) of not executing the > > > unneeded instructions is much lower than the cost (on average) of > > > unwinding the processor state to execute the other arm of the > > > conditional branch. > > > > Got it. I have been counting branches as 3 cycles of fixed cost, and > > ignoring penalties if a branch skips at least 2 instructions. > > > > Going forward, I will add 'IT<c>' compile options for any new code with > > scope beyond v6m. > > > > > >> - most, if not all, of your LSYM symbols should not be needed after > > > >> assembly, so should start with a captial 'L' (and no leading > > > >> underscores), the assembler will then automatically discard any that > > > >> are > > > >> not needed for relocations. > > > > > > > > You don't want debugging symbols for libgcc internals :) ? I sort of > > > > understand that, but those symbols have been useful to me in the past. > > > > The "." by itself seems to keep visibility local, so the extra symbols > > > > won't cause linker issuess. Would you object to a macro variant (e.g. > > > > LLSYM) that prepends the "L" but is easier to disable? > > > > > > It is a matter of taste, but I really prefer the local symbols to > > > disappear entirely once the file is compiled - it makes things like > > > backtrace gdb show the proper call heirarchy. > > > > Hearing no objection to LLSYM, I'll implement that for debugging. > > The released version will have ".L" symbols stripped. > > > > > >> - you'll need to write suitable commit messages for each patch, which > > > >> also contain a suitable ChangeLog style entry. > > > > > > > > OK. > > > > > > > >> - finally, your popcount implementations have data in the code segment. > > > >> That's going to cause problems when we have compilation options such > > > >> as > > > >> -mpure-code. > > > > > > > > I am just following the precedent of existing lib1funcs (e.g. __clz2si). > > > > If this matters, you'll need to point in the right direction for the > > > > fix. I'm not sure it does matter, since these functions are PIC anyway. > > > > > > That might be a bug in the clz implementations - Christophe: Any thoughts? > > > > __clzsi2() has test coverage in "gcc.c-torture/execute/builtin-bitops-1.c" > Thanks, I'll have a closer look at why I didn't see problems. >
So, that's because the code goes to the .text section (as opposed to .text.noread) and does not have the PURECODE flag. The compiler takes care of this when generating code with -mpure-code. And the simulator does not complain because it only checks loads from the segment with the PURECODE flag set. > > The 'clzs' and 'ctz' functions should never have problems. -mpure-code > > appears to be valid only when the 'movt' instruction is available, which > > means that the 'clz' instruction will also be available, so no array loads. > No, -mpure-code is also supported with v6m. > > > Is the -mpure-code state detectable as a preprocessor flag? While > No. > > > 'movw'/'movt' appears to be the canonical solution, I'm not sure it > > should be the default just because a processor supports Thumb-2. > > > > Do users wanting to use -mpure-code recompile the toolchain to avoid > > constant data in compiled C functions? I don't think this is the > > default for the typical toolchain scripts. > No, users of -mpure-code do not recompile the toolchain. > > > > >> I strongly suggest that, rather than using gcc snapshots (I'm assuming > > > >> this based on the diff style and directory naming in your patches), you > > > >> switch to using a git tree, then you'll be able to use tools such as > > > >> rebasing and the git posting tools to send the patch series for > > > >> subsequent review. > > > > > > > > Your assumption is correct. I didn't think that I would have to get so > > > > deep into the gcc development process for this contribution. I used > > > > this library as a bare metal alternative for libgcc/libm in the product > > > > for years, so I thought it would just drop in. But, the libgcc compile > > > > mechanics have proved much more 'interesting'. I'm assuming this > > > > architecture was created years before the introduction of -ffunction- > > > > sections... > > > > > > > > > > I don't think I've time to write a history lesson, even if you wanted > > > it. Suffice to say, this does date back to the days of a.out format > > > object files (with 4 relocation types, STABS debugging, and one code, > > > one data and one bss section). > > > > > > >> > > > >> Richard. > > > >> > > > > > > > > Thanks again, > > > > Daniel > > > > > > > > > > R. > > > > > > > To reiterate what I said above, I intend to push forward and incorporate > > your current recommendations plus any further feedback I may get. I > > expect you to say that this doesn't merit inclusion yet, but I'd rather > > spend the time while I have it. > > > > I'll post a patch series for review within the next day or so. > > Here are the results of the validation of your latest version (20210105): > https://people.linaro.org/~christophe.lyon/cross-validation/gcc-test-patches/r11-5993-g159b0bd9ce263dfb791eff5133b0ca0207201c84-cortex-m0-fplib-20210105.patch/report-build-info.html > > "BIG-REGR" just means the regression report is large enough that it's > provided in compressed form to avoid overloading the browser. > > So it really seems your patch introduces regressions in arm*linux* configs. > For the 2 arm-none-eabi configs which show regressions (cortex-m0 and > cortex-m3), the logs seem to indicate some tests timed out, and it's > possible the server used was overloaded. > The same applies to the 3 aarch64*elf cases, where the regressions > seem only caused by timed out; there's no reason your patch would have > an impact on aarch64. > (there 5 configs were tested on the same machine, so overload is indeed > likely). > > I didn't check why all the ubsan tests now seem to fail, they are in > the "unstable" category because in the past some of them had some > randomness. > I do not see such noise in trunk validation though. > > Thanks, > > Christophe > > > > > Thanks again, > > Daniel