Re: GCC Optimisation status update

2011-06-14 Thread zoltan

> We are working on a patch which will improve decimal
> itoa by up to 10X.  It will take a while to finish it.

What's the method?

I have a function converting 32 bit unsigneds to decimal which costs one
32x32->64 multiply with a constant (a single constant, not a look-up
table) plus a max. 8-times loop involving a few 64-bit adds and shifts,
which can be unrolled for speed (there's very little in the loop body,
really). There's also an initial overhead of up to three 32-bit compare
and subtracts.

The 64 bit unsigned to decimal conversion costs two calls to the above
routine, three 32x32->64 multiplies and a few preparation steps, which
are simple 64-bit add/sub things.

The routines are used on 32-bit ARM chips where multiply is dirt cheap;
for chips with no 32x32->64 multiply they might not be feasible. The
routines are also quite simple. Would they be useful for you, they've been
released under the GPL (with an additional relaxational clause, but that's
irrelevant here). I don't know if the method is well-known already, casual
search on the Net did not find binary to decimal conversion using the
above technique at the time when I came up with it (couple of years ago),
so it may not be that widespread.

I also have routines to convert 32 and 64 bit numbers to arbitrary base
without using division but again, they are heavily reliant on the cheap
32x32->64 multiply and cheap 64-bit shifts.

Zoltan



Re: Bitfields

2009-09-20 Thread zoltan
On Sun, 20 Sep 2009, Joseph S. Myers wrote:

> On Sun, 20 Sep 2009, Zolt??n K??csi wrote:
>
> > I wonder if there would be at least a theoretical support by the
> > developers to a proposal for volatile bitfields:
>
> It has been proposed (and not rejected, but not yet implemented) that
> volatile bit-fields should follow the ARM EABI specification (on all
> targets); that certainly seems better than inventing something new unless
> you have a very good reason to prefer the something new on some targets.

Yes, that discussion was that made me thinking and suggesting this
*before* the ARM EABI gets implemented. I don't suggest to implement
something instead of the ARM EABI, I suggest to implement something on top
of it. The suggested behaviour is also architecture-neutral.

It is nothing more than if the user expressly asks the compiler to break
the standard in a particular way, then the compiler does so. The breaking
of the standard is at one single point. The ARM EABI spec clearly states
that bitfield operations are never to be combined, not even in the case
where consecutive bitfield assignments refer to bitfields located in the
same machine word. My suggestion was that if a new command line switch is
present, then in the special case of consecutive bitfield assignments
being made to fields within the same word and the assignments being
separated by the comma operator, then the compiler combines those
assignments. The rationale of such behaviour is writing low-level code
dealing with HW registers. To have a practical example, let's have a SoC
chip with multi-function pins. Let's assume that we have a register that
has 2 bits for each actual pin and the value of the 2 bits selects the
actual function for the pin; a 32 bit register can thus control 16 pins.
Now if you want to, say, assign 4 pins to the SPI interface, without
bitfields you would (and indeed do) write something along these lines:

temp = *pin_control_reg;
temp &= ~(PIN_03_MASK | PIN_04_MASK | PIN_05_MASK | PIN_06_MASK);
temp |= PIN_O3_MISO | PIN_04_MOSI | PIN_05_SCLK | PIN_06_SSEL;
*pin_ctrl_reg = temp;

You can't really use bitfields to achieve the above, because if you write

pin_control_reg->pin_03 = MISO;
pin_control_reg->pin_04 = MOSI;

and so on, pin_xx being 2-bit wide bitfields, then according to the ARM
EABI spec each statement would be translated to a temp = *pin_contorl_reg;
temp &=...; temp |=...;  *pin_control_reg=temp; sequence. What I suggest
is that if you write

pin_control_reg->pin_03 = MISO, // Note the comma
pin_control_reg->pin_04 = MOSI,
pin_control_reg->pin_05 = SCLK,
pin_control_reg->pin_06 = SSEL;

and compile it with a -fcomma-combines-bitfields switch, then you get the
equivalent of the first code fragment where you manually combined the
masks and the settings and only a single load and a single store was used.

If the switch is not given or the consecutive assignments are not
separated by commas or the bitfields do not belong to the same word, then
the behaviour falls back to the default ARM EABI spec.

The advantage of the suggested behaviour is that it would allow the use of
the more elegant and expressive bitfields in place of the many hundreds of
#define REGNAME_FIELDNAME_MASK and #define REGNAME_FIELDNAME_SHIFT macros
that you can currently find in code that deals with HW. The suggestion
does not introduce any new functionality or performance advantage, it just
provides a way of writing (in my opinion) more readable and more
maintainable code than what we have now with all the #defines. The fact
that structure members live in their own namespace as opposed to the
global #define namespace is an added benefit, of course.

The suggested extension does not break backward compatibility, because the
#define stuff would not be affected and the ARM EABI is not yet
implemented anyway; it would not break the expected behaviour because it
becomes active only when an explicite command line switch is given and has
no side-effects outside the single expression where the subexpressions are
separated by commas.

The change, I believe, would benefit gcc users who deal with HW a lot,
i.e. low level embedded system and device driver designers. Outside of
that circle the suggested behavior would have only a little performance
benefit.

Zoltan



Re: arm-elf multilib issues

2009-10-01 Thread zoltan


On Thu, 1 Oct 2009, Paul Brook wrote:

> > Do we want to enable more multilibs in arm-elf?
>
> Almost certainly not. As far as I'm concerned arm-elf is obsolete, and in
> maintenance only mode. You should be using arm-eabi.

I'm possibly (probably?) wrong, but as far as I know, it forces alignment
of 64-bit datum (namely, doubles and long longs) to 8 byte boundaries,
which does not make sense on small 32-bit cores with 32-bit buses and no
caches (e.g. practically all ARM7TDMI based chips). Memory is a scarce
resource on those and wasting bytes for alignment with no performance
benefit is something that makes arm-eabi less attractive. Also, as far as
I know passing such datums to functions might cause some headache due to
the 64-bit datums being even-register aligned when passing them to
functions, effectively forcing arguments to be passed on the stack
unnecessarily (memory access is rather expensive on a cache-less
ARM7TDMI). If you have to write assembly routines that take long long or
double arguments among other types, that forces you to shuffle registers
and fetch data from the stack. You lose code space, data space and CPU
cycles with absolutely nothing in return.

For resource constrained embedded systems built around one of those
32-bit cores arm-elf is actually rather more attractive than arm-eabi.

Zoltan



Re: arm-elf multilib issues

2009-10-01 Thread zoltan
> Meh. Badly written code on antique hardware.
> I realise this sounds harsh, but in all seriousness if you take a bit of care

Yes, I think it does sound harsh, considering that, I believe, at least as
many chips are sold with ARM7TDMI core as the nice fat chips with MMU,
caches, 64 and 128 bit buses.

> (and common sense) you should get the alignment for free in pretty much all
> cases, and it can make a huge difference on ARMv5te cores.
> If you're being really pedantic then old-abi targets tend to pad all
> structures to a word boundary. I'd expect this to have much more
> detrimental overall effect than alignment of doubleword quantities,
> which in my experience are pretty rare to start with.

Well, I have to agree with the above.

Zoltan




Serious code generation/optimisation bug (I think)

2009-01-26 Thread zoltan
I was debugging a function and by inserting the debug statement crashed
the system. Some investigation revealed that gcc 4.3.2 arm-eabi (compiled
from sources) with -O2 under some circumstances assumes that if a pointer
is dereferenced, it can not be NULL therefore explicite tests against
NULL can be later eliminated. Here is a short function that demonstrates
the erroneous behaviour:

extern void Debug( unsigned int x );

typedef struct s_head {

struct s_head   *next;
unsigned intvalue;

} A_STRUCT;

void InsertByValue( A_STRUCT **queue, A_STRUCT *ptr )
{
A_STRUCT *tst;

   for ( tst = *queue ; ; queue = &tst->next, tst = *queue ) {

// Debug( tst->value );

   if ( ! tst ) {
   ptr->next = (void *) 0;
   break;
   }

   if ( tst->value < ptr->value ) {
   ptr->next = tst;
   break;
   }
}
*queue = ptr;
}

Compiling this function with

arm-eabi-gcc -O2 -S foo.c

generates perfect code. However, if the Debug( tst->value ); is not
commented out, then the generated code looks like this:

InsertByValue:
@ Function supports interworking.
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
stmfd   sp!, {r4, r5, r6, lr}
mov r6, r0
ldr r4, [r0, #0]
mov r5, r1
b   .L3
.L2:
mov r6, r4
ldr r4, [r4, #0]
.L3:
ldr r0, [r4, #4]
bl  Debug
ldr r2, [r4, #4]
ldr r3, [r5, #4]
cmp r2, r3
bcs .L2
str r4, [r5, #0]
str r5, [r6, #0]
ldmfd   sp!, {r4, r5, r6, lr}
bx  lr

As you can see, when 'tst' is fetched to R4, it is not checked against
being 0 anywhere and the whole if ( ! tst ) { ... } bit is completely
eliminated from the code. Indeed, the actual compiled code crashes because
the loop does not stop when the end of the list is reached.

I know that you are not supposed to dereference a NULL pointer, however,
on the microcontroller I have it is perfectly legal: what you get is an
element of the exception vector table that resides at 0x0.

I don't think that the compiler has a right to remove my test, just
because it assumes that if I derferenced a pointer then it surely was not
NULL. At least it should give me a warning (which it does not, not even
with -W -Wall -Wextra).

Zoltan




ARM compiler rewriting code to be longer and slower

2009-03-12 Thread zoltan
Using 4.4.0 gcc, I compiled a function and found it a tad long. The
command line is:

gcc -Os -mcpu=arm7tdmi-s -S func.c

although the output is pretty much the same with -O2 or -O3 as well (only
a few instructions longer).

The function is basically an unrolled 32 bit unsigned division by 1E9:

unsigned int divby1e9( unsigned int num, unsigned int *quotient )
{
unsigned int dig;
unsigned int tmp;
  tmp = 10u;
  dig = 0;
  if ( num >= tmp ) {
 tmp <<= 2;
 if ( num >= tmp ) {
 num -= tmp;
 dig  = 4;
 }
 else {
 tmp >>= 1;
 if ( num >= tmp ) {
 num -= tmp;
 dig  = 2;
 }
 tmp >>= 1;
 if ( num >= tmp ) {
 num -= tmp;
 dig++;
 }
 }
   }
   *quotinet = dig;
   return num;
}

The compiler generated the following code:

divby1e9:
@ Function supports interworking.
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
ldr r3, .L10
cmp r0, r3
movls   r3, #0
bls .L3
ldr r2, .L10+4
cmp r0, r2
addhi   r0, r0, #293601280
addhi   r0, r0, #1359872
addhi   r0, r0, #6144
movhi   r3, #4
bhi .L3
.L4:
ldr r2, .L10+8
cmp r0, r2
movls   r3, #0
bls .L6
add r0, r0, #-2013265920
add r0, r0, #13238272
add r0, r0, #27648
cmp r0, r3
movls   r3, #2
bls .L3
mov r3, #2
.L6:
add r0, r0, #-1006632960
add r0, r0, #6619136
add r0, r0, #13824
add r3, r3, #1
.L3:
str r3, [r1, #0]
bx  lr
.L11:
.align  2
.L10:
.word   9
.word   -294967297
.word   19


Note that it is sub-optimal on two counts.

First, each loading of a constant takes 3 instructions and 3 clocks.
Storing the constant and fetching it using an ldr also takes 3 clocks but
only two 32-bit words and identical constants need to be stored only once.
The speed increase is only true on the ARM7TDMI-S, which has no caches, so
that's just a minor issue, but the memory saving is true no matter what
ARM core you have (note that -Os was specified).

Second, and this is the real problem, if the compiler did not want to be
overly clever and compiled the code as it was written, then instead of
loading the constants 4 times, at the cost of 3 instuctions each, it could
have loaded it only once and then generated the next constants at the cost
of a single-word, single clock shift. The code would have been rather
shorter *and* faster, plus some of the jumps could have been eliminated.
Practically each C statement line (except the braces) corresponds to one
assembly instruction, so without being clever, just translating what's
written, it could be done in 20 words instead of 30.

Is it a problem that is worth being put onto bugzilla or I just have to do
some trickery to save the compiler from being smarter than it is?

Zoltan




Optimising for size

2008-07-13 Thread zoltan
Just a tentative question about a problem:

I have a piece of C code. The code, compiled to an ARM THUMB target using
gcc 4.0.2, with -Os results in 230 instructions. The exact same code,
using the exact same switches compiles to 437 instructions with gcc 4.3.1.
Considering that the compiler optimises to size and the much newer
compiler emits almost twice as much code as the old one, I think it is an
issue.

So the question is, how should I report it? It is not a bug as such, it is
a performance issue, but I think one that should be considered. Overall on
a source resulting in a 4000 insns long binary the newer version compiles
only about 150 instructions more than the old one, indicating that it
actually saved some space on pieces of the code other than the above
mentioned very sick case, but the savings wasn't enough to compensate for
the 230 -> 437 instruction blowout.

Thanks,

Zoltan



Signed-unsigned comparison question

2019-03-07 Thread Zoltan Kocsi
Gcc 8.2.0 (arm-none-eabi) throws a warning on the following construct:

uint32_t a;
uint16_t b;

if ( a > b ) ...

compaining that a signed integer is compared against an unsigned.
Of course, it is correct, as 'b' was promoted to int.

But shouldn't it be smart enough to know that (int) b is restricted to
the range of [0,65535] which it can safely compare against the range of
[0,0xu]?

Thanks,

Zoltan


Re: Signed-unsigned comparison question

2019-03-07 Thread Zoltan Kocsi
Correction:

The construct gcc complains about is not

if ( a < b ) ...

but

if ( a < b - ( b >> 2 ) ) ...

but still the same applies. The RHS of the > operator can never be
negative or have an overflow on 32 bits.

On Fri, 8 Mar 2019 10:40:06 +1100
Zoltan Kocsi  wrote:

> Gcc 8.2.0 (arm-none-eabi) throws a warning on the following construct:
> 
> uint32_t a;
> uint16_t b;
> 
> if ( a > b ) ...
> 
> compaining that a signed integer is compared against an unsigned.
> Of course, it is correct, as 'b' was promoted to int.
> 
> But shouldn't it be smart enough to know that (int) b is restricted to
> the range of [0,65535] which it can safely compare against the range
> of [0,0xu]?
> 
> Thanks,
> 
> Zoltan