On Fri, 2 Jun 2023 09:58:59 GMT, Andrew Dinn <ad...@openjdk.org> wrote:
>> Yes, of course, you are right that 0<= U_2 < 6 at the point where that >> second multiply by 5 occurs (i.e. after the loop). >> >> I believe it is safe to use the same optimization inside the loop for >> reasons given below. Of course it is a bit late to change this now and >> retest but if my reasoning is correct then we could consider updating this >> post release and, maybe, a backport. >> >> The only thing that needs to be determined is what value could sit in U2 >> when we enter the loop. That's the only important case because we already >> agreed that at the loop back edge that 0 <= U2 < 6. >> >> The incoming value for U2 at loop entry is derived by the following >> subsequence of the instruction stream >> >> __ adcs(S_1, U_1, S_1); >> __ adc(S_2, U_2, zr); // A.1 >> __ add(S_2, S_2, 1); // A.2 >> . . . >> wide_mul(U_1, U_1HI, S_0, R_1); wide_madd(U_1, U_1HI, S_1, R_0); >> wide_madd(U_1, U_1HI, S_2, RR_1); // B >> . . . >> __ andr(U_2, R_0, 3); // C >> __ mul(U_2, S_2, U_2); // D >> . . . >> >> __ adc(U_2, U_1HI, U_2); // E >> >> At A.1 we know that 0 <= U_2 <= 3 (since it was initialized by unpack26) >> So, at A.2 we know that 0 <= S2 <= 5 >> >> At B we know that 0 <= RR_1 <= (2^60 - 2^2) = FFFFFFF_FFFFFFFC (top 4 and >> bottom 2 bits of RR_1 are clear) >> So 0 <= U1_HI < 5 * FFFFFFF_FFFFFFFC = 4FFFFFFF_FFFFFFEC >> >> At C we know 0 <= U_2 <= 3 >> >> At D we know 0 <= U_2 <= 15 >> >> So at E we know that 0 <= U_2 <= 4FFFFFFF_FFFFFFEC + 15 + 1 >> >> So, the highest possible value for U_2 at loop entry is 50000000_00000002. >> >> Clearly we can shift this down by two and add without any danger of >> overflowing >> >> 50000000_00000002 >> 2 + 50000000_00000002 = 64000000_00000002 > > Ah, no scratch that. I have made a wrong assumption at B. The value of U1_HI > is bounded by the sum of the 3 64 bit * 64 bit multiplies. I think there is still a proof of validity to be salvaged though. We compute a 128 bit product: U1_HI:U1 = S_0 * R_1 + S_1 * R_0 + S_2 * RR_1 We know that R_0 and R_1 have four top bits clear and S2 <= 5. So, I think we can guarantee that the top word of the 128 bit product is small enough to not to overflow when we do the shift. Even if we assume S_0, S_1 and RR_1 have the maximum possible value we have S_0 * R_1 <= (2^64 - 1) * (2^60 - 1) S_1 * R_0 <= (2^64 - 1) * (2^60 - 1) S_2 * RR_1 <= 5 * (2^64 - 1) So, the top word U1_HI is bounded by at most (2 * (2^60 - 1)) + 5. That leaves more than enough room to guarantee that the shift and add will not overflow. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14085#discussion_r1214230004