On Thu, 1 Jun 2023 15:00:26 GMT, Andrew Haley <a...@openjdk.org> wrote:
> This comment and the next one both need correcting. They mention U_0HI and > U_1HI and, as the previous comment says, those registers are dead. > > What actually happens here is best summarized as > > // U_2:U_1:U_0 += (U2 >> 2) * 5 > > or, if we actually want to be clearer about the current encoding which does > it in several steps > > // rscratch1 = (U2 >> 2) // U2 = U2[1:0] // U_2:U_1:U_0 += rscratch1 // > U_2:U_1:U_0 += (rscratch1 << 2) > > i.e. any bits that are set from 130 upwards are masked off, treated as an > integer in their own right, multiplied by 5 and the result added back in at > the bottom to update the 130 bit result U2[1:0]:U1[63:0]:U0[63:0]. OK. > I'm not sure whether this provides an opportunity for you to optimize this by > doing the multiply by five earlier i.e. replace the code with this version I'm not sure either, which is why it's done in two separate steps. I think you may be right, but it's a bit late to be optimizing this version any further. That would require careful analysis and a redo of all the testing. > The obvious concern is that the multiply of rscratch1 by 5 might overflow 64 > bits. Is that why you have implemented two add and carry steps? Indeed. > If so then why is it legitimate to do the multiply by 5 up front in the final > reduction that follows the loop? I assume that you're referring to the multiply by 5 in // Further reduce modulo 2^130 - 5 __ lsr(rscratch1, U_2, 2); __ add(rscratch1, rscratch1, rscratch1, Assembler::LSL, 2); // rscratch1 = U_2 * 5 `U_2`, at this point, has only a few lower set bits. This is because `U_2` was previously ANDed with 3, and subsequently was the target of adc(U_2, U_2, zr). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/14085#discussion_r1213382344