Hi, Here is the new version of the patch I have fixed two errors in the previous version, on mips32 the compiler could not expand division and terminated with ICE, this change fixed the issue: /* Extrating the higher part of the sum */ tmp = gen_highpart (SImode, tmp); tmp = force_reg (SImode, tmp); + tmp = force_reg (SImode, tmp); + tmp = convert_to_mode (DImode, tmp, 1);
and another error on i686 and mips32r2: I found that overflow check in multiplication was not working. For example 0xfe34b4190a392b23/257 produced wrong result. Following change resolved the issue: -emit_store_flag_force (c, GT, u0, tmp, DImode, 1, -1); +emit_store_flag_force (c, GT, u0, tmp, DImode, 1, 1); Tested this new version of the patch on -none-linux-gnu with arm-7l, mips-32r2 (74k), i686 without new regressions. On Thu, May 3, 2012 at 5:40 PM, Richard Earnshaw <rearn...@arm.com> wrote: > On 03/05/12 11:27, Dinar Temirbulatov wrote: > > > This clearly needs more work. > > Comments: Need to end with a period and two spaces. > Binary Operators: Need to be surrounded with white space. sorry for this, hope I resolved such issues with the new version. > > As Andrew Haley has already said, some documentation of the algorithm is > needed. General documentation for the issue could be found here gmplib.org/~tege/divcnst-pldi94.pdf. Multiplication to get higher 128-bit was developed by me and Alexey Kravets, I attached C version of the algorithm. > > Why is this restricted to BITS_PER_WORD == 32? I am checking here that we are generating code for 32-bit target with 32-bit wide general propose registers, and with 64-bit wide registers usually there is an instruction to get the higher 64-bit of 64x64-bit multiplication cheaply. > > Costs: This code clearly needs to understand the relative cost of doing > division this way; there is a limit to the amount of inline expansion > that we should tolerate, particularly at -O2 and it's not clear that > this will be much better than a library call if we don't have a widening > multiply operation (as is true for older ARM chips, and presumably some > other CPUs). In essence, you need to work out the cost of a divide > instruction (just as rtx costs for this) and the approximate cost of the > expanded algorithm. Added cost calculation. > > Another issue that worries me is the number of live DImode values; on > machines with few registers this could cause excessive spilling to start > happening. The cost calculation suppose to take this into account. > > I also wonder whether it would be beneficial to generate custom > functions for division by specific constants (using this algorithm) and > then call those functions rather than the standard lib-call. On ELF > systems the functions can marked as hidden and put into common sections > so that we only end up with one instance of each function in a program. yes, I think it is a good approach, I could redo my patch with such intrinsic function implementation with pre-shift, post-shift, and 64-bit constant as function parameters. > > Finally, do you have a copyright assignment with the FSF? We can't use > this code without one. yes, I do have a copyright assignment with the FSF. Also I am in process of implementing signed integer 64-bit division by constant. thanks, Dinar.
18.patch
Description: Binary data
unsigned long long mul(unsigned long long a, unsigned long long b) { unsigned long long x1=a>>32; unsigned long long x0=a&0x00000000ffffffff; unsigned long long y1=b>>32; unsigned long long y0=b&0x00000000ffffffff; unsigned long long z2, z0, tmp, u0, u1; unsigned char c=0; z2=x1*y1; z0=x0*y0; u0=x0*y1+(z0>>32); u1=x1*y0; tmp = (u0+u1); c = (tmp < u0) || (tmp < u1); return z2+(tmp>>32)+(((unsigned long long)c)<<32); }