Re: GCC Optimisation status update
> We are working on a patch which will improve decimal > itoa by up to 10X. It will take a while to finish it. What's the method? I have a function converting 32 bit unsigneds to decimal which costs one 32x32->64 multiply with a constant (a single constant, not a look-up table) plus a max. 8-times loop involving a few 64-bit adds and shifts, which can be unrolled for speed (there's very little in the loop body, really). There's also an initial overhead of up to three 32-bit compare and subtracts. The 64 bit unsigned to decimal conversion costs two calls to the above routine, three 32x32->64 multiplies and a few preparation steps, which are simple 64-bit add/sub things. The routines are used on 32-bit ARM chips where multiply is dirt cheap; for chips with no 32x32->64 multiply they might not be feasible. The routines are also quite simple. Would they be useful for you, they've been released under the GPL (with an additional relaxational clause, but that's irrelevant here). I don't know if the method is well-known already, casual search on the Net did not find binary to decimal conversion using the above technique at the time when I came up with it (couple of years ago), so it may not be that widespread. I also have routines to convert 32 and 64 bit numbers to arbitrary base without using division but again, they are heavily reliant on the cheap 32x32->64 multiply and cheap 64-bit shifts. Zoltan
Re: Bitfields
On Sun, 20 Sep 2009, Joseph S. Myers wrote: > On Sun, 20 Sep 2009, Zolt??n K??csi wrote: > > > I wonder if there would be at least a theoretical support by the > > developers to a proposal for volatile bitfields: > > It has been proposed (and not rejected, but not yet implemented) that > volatile bit-fields should follow the ARM EABI specification (on all > targets); that certainly seems better than inventing something new unless > you have a very good reason to prefer the something new on some targets. Yes, that discussion was that made me thinking and suggesting this *before* the ARM EABI gets implemented. I don't suggest to implement something instead of the ARM EABI, I suggest to implement something on top of it. The suggested behaviour is also architecture-neutral. It is nothing more than if the user expressly asks the compiler to break the standard in a particular way, then the compiler does so. The breaking of the standard is at one single point. The ARM EABI spec clearly states that bitfield operations are never to be combined, not even in the case where consecutive bitfield assignments refer to bitfields located in the same machine word. My suggestion was that if a new command line switch is present, then in the special case of consecutive bitfield assignments being made to fields within the same word and the assignments being separated by the comma operator, then the compiler combines those assignments. The rationale of such behaviour is writing low-level code dealing with HW registers. To have a practical example, let's have a SoC chip with multi-function pins. Let's assume that we have a register that has 2 bits for each actual pin and the value of the 2 bits selects the actual function for the pin; a 32 bit register can thus control 16 pins. Now if you want to, say, assign 4 pins to the SPI interface, without bitfields you would (and indeed do) write something along these lines: temp = *pin_control_reg; temp &= ~(PIN_03_MASK | PIN_04_MASK | PIN_05_MASK | PIN_06_MASK); temp |= PIN_O3_MISO | PIN_04_MOSI | PIN_05_SCLK | PIN_06_SSEL; *pin_ctrl_reg = temp; You can't really use bitfields to achieve the above, because if you write pin_control_reg->pin_03 = MISO; pin_control_reg->pin_04 = MOSI; and so on, pin_xx being 2-bit wide bitfields, then according to the ARM EABI spec each statement would be translated to a temp = *pin_contorl_reg; temp &=...; temp |=...; *pin_control_reg=temp; sequence. What I suggest is that if you write pin_control_reg->pin_03 = MISO, // Note the comma pin_control_reg->pin_04 = MOSI, pin_control_reg->pin_05 = SCLK, pin_control_reg->pin_06 = SSEL; and compile it with a -fcomma-combines-bitfields switch, then you get the equivalent of the first code fragment where you manually combined the masks and the settings and only a single load and a single store was used. If the switch is not given or the consecutive assignments are not separated by commas or the bitfields do not belong to the same word, then the behaviour falls back to the default ARM EABI spec. The advantage of the suggested behaviour is that it would allow the use of the more elegant and expressive bitfields in place of the many hundreds of #define REGNAME_FIELDNAME_MASK and #define REGNAME_FIELDNAME_SHIFT macros that you can currently find in code that deals with HW. The suggestion does not introduce any new functionality or performance advantage, it just provides a way of writing (in my opinion) more readable and more maintainable code than what we have now with all the #defines. The fact that structure members live in their own namespace as opposed to the global #define namespace is an added benefit, of course. The suggested extension does not break backward compatibility, because the #define stuff would not be affected and the ARM EABI is not yet implemented anyway; it would not break the expected behaviour because it becomes active only when an explicite command line switch is given and has no side-effects outside the single expression where the subexpressions are separated by commas. The change, I believe, would benefit gcc users who deal with HW a lot, i.e. low level embedded system and device driver designers. Outside of that circle the suggested behavior would have only a little performance benefit. Zoltan
Re: arm-elf multilib issues
On Thu, 1 Oct 2009, Paul Brook wrote: > > Do we want to enable more multilibs in arm-elf? > > Almost certainly not. As far as I'm concerned arm-elf is obsolete, and in > maintenance only mode. You should be using arm-eabi. I'm possibly (probably?) wrong, but as far as I know, it forces alignment of 64-bit datum (namely, doubles and long longs) to 8 byte boundaries, which does not make sense on small 32-bit cores with 32-bit buses and no caches (e.g. practically all ARM7TDMI based chips). Memory is a scarce resource on those and wasting bytes for alignment with no performance benefit is something that makes arm-eabi less attractive. Also, as far as I know passing such datums to functions might cause some headache due to the 64-bit datums being even-register aligned when passing them to functions, effectively forcing arguments to be passed on the stack unnecessarily (memory access is rather expensive on a cache-less ARM7TDMI). If you have to write assembly routines that take long long or double arguments among other types, that forces you to shuffle registers and fetch data from the stack. You lose code space, data space and CPU cycles with absolutely nothing in return. For resource constrained embedded systems built around one of those 32-bit cores arm-elf is actually rather more attractive than arm-eabi. Zoltan
Re: arm-elf multilib issues
> Meh. Badly written code on antique hardware. > I realise this sounds harsh, but in all seriousness if you take a bit of care Yes, I think it does sound harsh, considering that, I believe, at least as many chips are sold with ARM7TDMI core as the nice fat chips with MMU, caches, 64 and 128 bit buses. > (and common sense) you should get the alignment for free in pretty much all > cases, and it can make a huge difference on ARMv5te cores. > If you're being really pedantic then old-abi targets tend to pad all > structures to a word boundary. I'd expect this to have much more > detrimental overall effect than alignment of doubleword quantities, > which in my experience are pretty rare to start with. Well, I have to agree with the above. Zoltan
Serious code generation/optimisation bug (I think)
I was debugging a function and by inserting the debug statement crashed the system. Some investigation revealed that gcc 4.3.2 arm-eabi (compiled from sources) with -O2 under some circumstances assumes that if a pointer is dereferenced, it can not be NULL therefore explicite tests against NULL can be later eliminated. Here is a short function that demonstrates the erroneous behaviour: extern void Debug( unsigned int x ); typedef struct s_head { struct s_head *next; unsigned intvalue; } A_STRUCT; void InsertByValue( A_STRUCT **queue, A_STRUCT *ptr ) { A_STRUCT *tst; for ( tst = *queue ; ; queue = &tst->next, tst = *queue ) { // Debug( tst->value ); if ( ! tst ) { ptr->next = (void *) 0; break; } if ( tst->value < ptr->value ) { ptr->next = tst; break; } } *queue = ptr; } Compiling this function with arm-eabi-gcc -O2 -S foo.c generates perfect code. However, if the Debug( tst->value ); is not commented out, then the generated code looks like this: InsertByValue: @ Function supports interworking. @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 stmfd sp!, {r4, r5, r6, lr} mov r6, r0 ldr r4, [r0, #0] mov r5, r1 b .L3 .L2: mov r6, r4 ldr r4, [r4, #0] .L3: ldr r0, [r4, #4] bl Debug ldr r2, [r4, #4] ldr r3, [r5, #4] cmp r2, r3 bcs .L2 str r4, [r5, #0] str r5, [r6, #0] ldmfd sp!, {r4, r5, r6, lr} bx lr As you can see, when 'tst' is fetched to R4, it is not checked against being 0 anywhere and the whole if ( ! tst ) { ... } bit is completely eliminated from the code. Indeed, the actual compiled code crashes because the loop does not stop when the end of the list is reached. I know that you are not supposed to dereference a NULL pointer, however, on the microcontroller I have it is perfectly legal: what you get is an element of the exception vector table that resides at 0x0. I don't think that the compiler has a right to remove my test, just because it assumes that if I derferenced a pointer then it surely was not NULL. At least it should give me a warning (which it does not, not even with -W -Wall -Wextra). Zoltan
ARM compiler rewriting code to be longer and slower
Using 4.4.0 gcc, I compiled a function and found it a tad long. The command line is: gcc -Os -mcpu=arm7tdmi-s -S func.c although the output is pretty much the same with -O2 or -O3 as well (only a few instructions longer). The function is basically an unrolled 32 bit unsigned division by 1E9: unsigned int divby1e9( unsigned int num, unsigned int *quotient ) { unsigned int dig; unsigned int tmp; tmp = 10u; dig = 0; if ( num >= tmp ) { tmp <<= 2; if ( num >= tmp ) { num -= tmp; dig = 4; } else { tmp >>= 1; if ( num >= tmp ) { num -= tmp; dig = 2; } tmp >>= 1; if ( num >= tmp ) { num -= tmp; dig++; } } } *quotinet = dig; return num; } The compiler generated the following code: divby1e9: @ Function supports interworking. @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. ldr r3, .L10 cmp r0, r3 movls r3, #0 bls .L3 ldr r2, .L10+4 cmp r0, r2 addhi r0, r0, #293601280 addhi r0, r0, #1359872 addhi r0, r0, #6144 movhi r3, #4 bhi .L3 .L4: ldr r2, .L10+8 cmp r0, r2 movls r3, #0 bls .L6 add r0, r0, #-2013265920 add r0, r0, #13238272 add r0, r0, #27648 cmp r0, r3 movls r3, #2 bls .L3 mov r3, #2 .L6: add r0, r0, #-1006632960 add r0, r0, #6619136 add r0, r0, #13824 add r3, r3, #1 .L3: str r3, [r1, #0] bx lr .L11: .align 2 .L10: .word 9 .word -294967297 .word 19 Note that it is sub-optimal on two counts. First, each loading of a constant takes 3 instructions and 3 clocks. Storing the constant and fetching it using an ldr also takes 3 clocks but only two 32-bit words and identical constants need to be stored only once. The speed increase is only true on the ARM7TDMI-S, which has no caches, so that's just a minor issue, but the memory saving is true no matter what ARM core you have (note that -Os was specified). Second, and this is the real problem, if the compiler did not want to be overly clever and compiled the code as it was written, then instead of loading the constants 4 times, at the cost of 3 instuctions each, it could have loaded it only once and then generated the next constants at the cost of a single-word, single clock shift. The code would have been rather shorter *and* faster, plus some of the jumps could have been eliminated. Practically each C statement line (except the braces) corresponds to one assembly instruction, so without being clever, just translating what's written, it could be done in 20 words instead of 30. Is it a problem that is worth being put onto bugzilla or I just have to do some trickery to save the compiler from being smarter than it is? Zoltan
Optimising for size
Just a tentative question about a problem: I have a piece of C code. The code, compiled to an ARM THUMB target using gcc 4.0.2, with -Os results in 230 instructions. The exact same code, using the exact same switches compiles to 437 instructions with gcc 4.3.1. Considering that the compiler optimises to size and the much newer compiler emits almost twice as much code as the old one, I think it is an issue. So the question is, how should I report it? It is not a bug as such, it is a performance issue, but I think one that should be considered. Overall on a source resulting in a 4000 insns long binary the newer version compiles only about 150 instructions more than the old one, indicating that it actually saved some space on pieces of the code other than the above mentioned very sick case, but the savings wasn't enough to compensate for the 230 -> 437 instruction blowout. Thanks, Zoltan
Signed-unsigned comparison question
Gcc 8.2.0 (arm-none-eabi) throws a warning on the following construct: uint32_t a; uint16_t b; if ( a > b ) ... compaining that a signed integer is compared against an unsigned. Of course, it is correct, as 'b' was promoted to int. But shouldn't it be smart enough to know that (int) b is restricted to the range of [0,65535] which it can safely compare against the range of [0,0xu]? Thanks, Zoltan
Re: Signed-unsigned comparison question
Correction: The construct gcc complains about is not if ( a < b ) ... but if ( a < b - ( b >> 2 ) ) ... but still the same applies. The RHS of the > operator can never be negative or have an overflow on 32 bits. On Fri, 8 Mar 2019 10:40:06 +1100 Zoltan Kocsi wrote: > Gcc 8.2.0 (arm-none-eabi) throws a warning on the following construct: > > uint32_t a; > uint16_t b; > > if ( a > b ) ... > > compaining that a signed integer is compared against an unsigned. > Of course, it is correct, as 'b' was promoted to int. > > But shouldn't it be smart enough to know that (int) b is restricted to > the range of [0,65535] which it can safely compare against the range > of [0,0xu]? > > Thanks, > > Zoltan