On Fri, 2 Jun 2023 09:51:57 GMT, Andrew Dinn <ad...@openjdk.org> wrote:
>>> This comment and the next one both need correcting. They mention U_0HI and >>> U_1HI and, as the previous comment says, those registers are dead. >>> >>> What actually happens here is best summarized as >>> >>> // U_2:U_1:U_0 += (U2 >> 2) * 5 >>> >>> or, if we actually want to be clearer about the current encoding which does >>> it in several steps >>> >>> // rscratch1 = (U2 >> 2) // U2 = U2[1:0] // U_2:U_1:U_0 += rscratch1 // >>> U_2:U_1:U_0 += (rscratch1 << 2) >>> >>> i.e. any bits that are set from 130 upwards are masked off, treated as an >>> integer in their own right, multiplied by 5 and the result added back in at >>> the bottom to update the 130 bit result U2[1:0]:U1[63:0]:U0[63:0]. >> >> OK. >> >>> I'm not sure whether this provides an opportunity for you to optimize this >>> by doing the multiply by five earlier i.e. replace the code with this >>> version >> >> I'm not sure either, which is why it's done in two separate steps. I think >> you may be right, but it's a bit late to be optimizing this version any >> further. That would require careful analysis and a redo of all the testing. >> >>> The obvious concern is that the multiply of rscratch1 by 5 might overflow >>> 64 bits. Is that why you have implemented two add and carry steps? >> >> Indeed. >> >>> If so then why is it legitimate to do the multiply by 5 up front in the >>> final reduction that follows the loop? >> >> I assume that you're referring to the multiply by 5 in >> >> >> // Further reduce modulo 2^130 - 5 >> >> >> >> >> __ lsr(rscratch1, U_2, 2); >> __ add(rscratch1, rscratch1, rscratch1, Assembler::LSL, 2); // rscratch1 >> = U_2 * 5 >> >> >> >> >> >> `U_2`, at this point, has only a few lower set bits. This is because `U_2` >> was previously ANDed with 3, and subsequently twice was the target of >> adc(U_2, U_2, zr). So I think that `U_2 <= 6`. > > Yes, of course, you are right that 0<= U_2 < 6 at the point where that > second multiply by 5 occurs (i.e. after the loop). > > I believe it is safe to use the same optimization inside the loop for reasons > given below. Of course it is a bit late to change this now and retest but if > my reasoning is correct then we could consider updating this post release > and, maybe, a backport. > > The only thing that needs to be determined is what value could sit in U2 when > we enter the loop. That's the only important case because we already agreed > that at the loop back edge that 0 <= U2 < 6. > > The incoming value for U2 at loop entry is derived by the following > subsequence of the instruction stream > > __ adcs(S_1, U_1, S_1); > __ adc(S_2, U_2, zr); // A.1 > __ add(S_2, S_2, 1); // A.2 > . . . > wide_mul(U_1, U_1HI, S_0, R_1); wide_madd(U_1, U_1HI, S_1, R_0); > wide_madd(U_1, U_1HI, S_2, RR_1); // B > . . . > __ andr(U_2, R_0, 3); // C > __ mul(U_2, S_2, U_2); // D > . . . > > __ adc(U_2, U_1HI, U_2); // E > > At A.1 we know that 0 <= U_2 <= 3 (since it was initialized by unpack26) > So, at A.2 we know that 0 <= S2 <= 5 > > At B we know that 0 <= RR_1 <= (2^60 - 2^2) = FFFFFFF_FFFFFFFC (top 4 and > bottom 2 bits of RR_1 are clear) > So 0 <= U1_HI < 5 * FFFFFFF_FFFFFFFC = 4FFFFFFF_FFFFFFEC > > At C we know 0 <= U_2 <= 3 > > At D we know 0 <= U_2 <= 15 > > So at E we know that 0 <= U_2 <= 4FFFFFFF_FFFFFFEC + 15 + 1 > > So, the highest possible value for U_2 at loop entry is 50000000_00000002. > > Clearly we can shift this down by two and add without any danger of > overflowing > > 50000000_00000002 >> 2 + 50000000_00000002 = 64000000_00000002 Ah, no scratch that. I have made a wrong assumption at B. The value of U1_HI is bounded by the sum of the 3 multiplies. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14085#discussion_r1214174121