Hi,
Here is the new version of the patch
I have fixed two errors in the previous version,
       on mips32 the compiler could not expand division and terminated
with ICE, this change fixed the issue:
      /* Extrating the higher part of the sum */
      tmp = gen_highpart (SImode, tmp);
      tmp = force_reg (SImode, tmp);
+      tmp = force_reg (SImode, tmp);
+      tmp = convert_to_mode (DImode, tmp, 1);

       and another error on i686 and mips32r2: I found that overflow
check in multiplication was not working. For example
0xfe34b4190a392b23/257 produced wrong result. Following change
resolved the issue:
       -emit_store_flag_force (c, GT, u0, tmp, DImode, 1, -1);
       +emit_store_flag_force (c, GT, u0, tmp, DImode, 1, 1);
Tested this new version of the patch on -none-linux-gnu with arm-7l,
mips-32r2 (74k), i686 without new regressions.

On Thu, May 3, 2012 at 5:40 PM, Richard Earnshaw <rearn...@arm.com> wrote:
> On 03/05/12 11:27, Dinar Temirbulatov wrote:
>
>
> This clearly needs more work.
>
> Comments:  Need to end with a period and two spaces.
> Binary Operators: Need to be surrounded with white space.
sorry for this, hope I resolved such issues with the new version.
>
> As Andrew Haley has already said, some documentation of the algorithm is
> needed.
General documentation for the issue could be found here
gmplib.org/~tege/divcnst-pldi94.pdf.
Multiplication to get higher 128-bit was developed by me and Alexey
Kravets, I attached C version of the algorithm.
>
> Why is this restricted to BITS_PER_WORD == 32?
I am checking here that we are generating code for 32-bit target with
32-bit wide general propose registers, and with 64-bit wide registers
usually there is an instruction to get the higher 64-bit of 64x64-bit
multiplication cheaply.

>
> Costs: This code clearly needs to understand the relative cost of doing
> division this way; there is a limit to the amount of inline expansion
> that we should tolerate, particularly at -O2 and it's not clear that
> this will be much better than a library call if we don't have a widening
> multiply operation (as is true for older ARM chips, and presumably some
> other CPUs).  In essence, you need to work out the cost of a divide
> instruction (just as rtx costs for this) and the approximate cost of the
> expanded algorithm.
Added cost calculation.
>
> Another issue that worries me is the number of live DImode values; on
> machines with few registers this could cause excessive spilling to start
> happening.
The cost calculation suppose to take this into account.
>
> I also wonder whether it would be beneficial to generate custom
> functions for division by specific constants (using this algorithm) and
> then call those functions rather than the standard lib-call.  On ELF
> systems the functions can marked as hidden and put into common sections
> so that we only end up with one instance of each function in a program.
yes, I think it is a good approach, I could redo my patch with such
intrinsic function implementation with pre-shift, post-shift, and
64-bit constant as function parameters.
>
> Finally, do you have a copyright assignment with the FSF?  We can't use
> this code without one.
yes, I do have a copyright assignment with the FSF.
Also I am in process of implementing signed integer 64-bit division by constant.

                    thanks, Dinar.

Attachment: 18.patch
Description: Binary data

unsigned long long mul(unsigned long long a, unsigned long long b)
{
        unsigned long long x1=a>>32;
        unsigned long long x0=a&0x00000000ffffffff;
        unsigned long long y1=b>>32;
        unsigned long long y0=b&0x00000000ffffffff;
        unsigned long long z2, z0, tmp, u0, u1;
        unsigned char c=0;
        z2=x1*y1;
        z0=x0*y0;
        u0=x0*y1+(z0>>32);
        u1=x1*y0;
        tmp = (u0+u1);
        c = (tmp < u0) || (tmp < u1);
        return z2+(tmp>>32)+(((unsigned long long)c)<<32);
}

Reply via email to