On Fri, 2 Jun 2023 09:58:59 GMT, Andrew Dinn wrote:
>> Yes, of course, you are right that 0<= U_2 < 6 at the point where that
>> second multiply by 5 occurs (i.e. after the loop).
>>
>> I believe it is safe to use the same optimization inside the loop for
>> reasons given below. Of course it
On Fri, 2 Jun 2023 09:51:57 GMT, Andrew Dinn wrote:
>>> This comment and the next one both need correcting. They mention U_0HI and
>>> U_1HI and, as the previous comment says, those registers are dead.
>>>
>>> What actually happens here is best summarized as
>>>
>>> // U_2:U_1:U_0 += (U2 >> 2) *
On Thu, 1 Jun 2023 16:06:40 GMT, Andrew Haley wrote:
>> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7135:
>>
>>> 7133: regs = (regs.remaining() + U_0HI + U_1HI).begin();
>>> 7134:
>>> 7135: // U_2:U_1:U_0 += (U_1HI >> 2)
>>
>> This comment and the next one both need corr
On Thu, 1 Jun 2023 15:00:26 GMT, Andrew Haley wrote:
> This comment and the next one both need correcting. They mention U_0HI and
> U_1HI and, as the previous comment says, those registers are dead.
>
> What actually happens here is best summarized as
>
> // U_2:U_1:U_0 += (U2 >> 2) * 5
>
> or,
On Thu, 1 Jun 2023 12:16:45 GMT, Andrew Dinn wrote:
> This comment and the next one both need correcting. They mention U_0HI and
> U_1HI and, as the previous comment says, those registers are dead.
>
> What actually happens here is best summarized as
>
> // U_2:U_1:U_0 += (U2 >> 2) * 5
>
> or, i
On Wed, 24 May 2023 16:17:14 GMT, Andrew Haley wrote:
>> This provides a solid speedup of about 3-4x over the Java implementation.
>>
>> I have a vectorized version of this which uses a bunch of tricks to speed it
>> up, but it's complex and can still be improved. We're getting close to ramp
>
On Wed, 24 May 2023 19:16:36 GMT, Claes Redestad wrote:
> Thanks for your patience in answering my questions and addressing my comments.
Thank you for asking questions that made the patch better, and even removed an
instruction in what I thought was a tightly-written intrinsic!
-
On Wed, 24 May 2023 16:17:14 GMT, Andrew Haley wrote:
>> This provides a solid speedup of about 3-4x over the Java implementation.
>>
>> I have a vectorized version of this which uses a bunch of tricks to speed it
>> up, but it's complex and can still be improved. We're getting close to ramp
>
> This provides a solid speedup of about 3-4x over the Java implementation.
>
> I have a vectorized version of this which uses a bunch of tricks to speed it
> up, but it's complex and can still be improved. We're getting close to ramp
> down, so I'm submitting this simple intrinsic so that we ca