Re: RFR: 8350126: Regression ~3% on Crypto-ChaCha20Poly1305.encrypt for MacOSX aarch64 [v4]

2025-04-22 Thread Andrew Haley
On Mon, 21 Apr 2025 21:53:33 GMT, Jamil Nimeh wrote: >> This fix addresses a performance regression found on some aarch64 >> processors, namely the Apple M1, when we moved to a quarter round parallel >> implementation in JDK-8349106. After making some improvements in the >> ordering of the in

Re: RFR: 8350126: Regression ~3% on Crypto-ChaCha20Poly1305.encrypt for MacOSX aarch64 [v3]

2025-04-22 Thread Andrew Haley
On Mon, 21 Apr 2025 17:03:25 GMT, Jamil Nimeh wrote: > Hi @theRealAph, just wanted to check in and see if you were happy with the > function for the column/diagonal reassignments. Yes, it looks fine. I think we're good to go. - PR Comment: https://git.openjdk.org/jdk/pull/24420#is

Re: RFR: 8350126: Regression ~3% on Crypto-ChaCha20Poly1305.encrypt for MacOSX aarch64

2025-04-04 Thread Andrew Haley
On Thu, 3 Apr 2025 16:31:39 GMT, Jamil Nimeh wrote: > This fix addresses a performance regression found on some aarch64 processors, > namely the Apple M1, when we moved to a quarter round parallel implementation > in JDK-8349106. After making some improvements in the ordering of the > instruc

Re: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5]

2025-02-27 Thread Andrew Haley
On Tue, 25 Feb 2025 15:58:18 GMT, Ferenc Rakoczi wrote: >> Aha! >> >> >> aph@Andrews-MacBook-Pro ~ % as t.s >> t.s:1:19: error: expected 'sxtx' 'uxtx' or 'lsl' with optional integer in >> range [0, 4] >> sub x1, x10, x23, sxth #2 >> ^ >> aph@Andrews-MacBook-Pro ~ % as --ve

Re: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5]

2025-02-25 Thread Andrew Haley
On Tue, 25 Feb 2025 13:15:49 GMT, Andrew Haley wrote: >>> I have not found the place in the manual where it allows/encourages the use >>> of x instead of w, but I admit I > haven't read through all of the 14568 >>> pages. >> >> Yes, you've

Re: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5]

2025-02-25 Thread Andrew Haley
On Tue, 25 Feb 2025 11:15:39 GMT, Ferenc Rakoczi wrote: >>> You might have to use an assembler from the latest binutils build (if the >>> system default isn't the latest) and add the path to the assembler in the >>> "AS" variable. Also you can run it something like - `python >>> aarch64-asmtes

Re: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5]

2025-02-25 Thread Andrew Haley
On Tue, 25 Feb 2025 13:14:52 GMT, Andrew Haley wrote: >> @theRealAlph, maybe we are not reading the same manual (ARM DDI 0487K.a). In >> my copy: >> SUB (extended register) is defined as >> SUB , , {, {#}} >> and should be W when is SXTH >> and the as I

Re: RFR: 8348561: Add aarch64 intrinsics for ML-DSA [v5]

2025-02-25 Thread Andrew Haley
On Mon, 24 Feb 2025 17:11:24 GMT, Andrew Dinn wrote: >> I have tried that, but the python script (actually the as command that it >> started) threw error messages: >> >> aarch64ops.s:338:24: error: index must be a multiple of 8 in range [0, >> 32760]. >> prfmPLDL1KEEP, [x15, 43] >>

Re: RFR: 8349106: Change ChaCha20 intrinsic to use quarter-round parallel implementation on aarch64 [v2]

2025-02-04 Thread Andrew Haley
On Mon, 3 Feb 2025 23:56:18 GMT, Jamil Nimeh wrote: >> This enhancement makes a change to the ChaCha20 block function intrinsic on >> aarch64, moving away from the block parallel implementation and to the >> quarter-round parallel implementation that was done on x86_64. Assembly >> language p

Re: RFR: 8349106: Change ChaCha20 intrinsic to use quarter-round parallel implementation on aarch64

2025-02-03 Thread Andrew Haley
On Mon, 3 Feb 2025 16:14:23 GMT, Jamil Nimeh wrote: > In terms of explaining the algorithm changes, I could add some comment text > to the header of the stub function that better explains the general idea > behind what is being done. It would certainly help anyone maintaining it down > the lin

Re: RFR: 8349106: Change ChaCha20 intrinsic to use quarter-round parallel implementation on aarch64

2025-02-03 Thread Andrew Haley
On Fri, 31 Jan 2025 16:48:09 GMT, Jamil Nimeh wrote: > This enhancement makes a change to the ChaCha20 block function intrinsic on > aarch64, moving away from the block parallel implementation and to the > quarter-round parallel implementation that was done on x86_64. Assembly > language prof

Re: RFR: 8344766: AES/CTR slow at big payloads [v2]

2024-11-27 Thread Andrew Haley
On Wed, 27 Nov 2024 13:49:51 GMT, Jatin Bhateja wrote: >> src/java.base/share/classes/com/sun/crypto/provider/CounterMode.java line 57: >> >>> 55: >>> 56: // chunkSize is a multiple of block size and used to divide up >>> 57: // input data to trigger the intrinsic. >> >> This comment l

Re: RFR: 8341903: Implementation of Scoped Values (Fourth Preview) [v2]

2024-10-10 Thread Andrew Haley
> The fourth preview of scoped values. Andrew Haley has updated the pull request incrementally with one additional commit since the last revision: Fix javadoc - Changes: - all: https://git.openjdk.org/jdk/pull/21456/files - new: https://git.openjdk.org/jdk/pull/21456/fi

RFR: 8341903: Implementation of Scoped Values (Fourth Preview)

2024-10-10 Thread Andrew Haley
The fourth preview of scoped values. - Commit messages: - Scoped Values API changes - Scoped Values API changes - Scoped Values API changes Changes: https://git.openjdk.org/jdk/pull/21456/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=21456&range=00 Issue: https://bugs

Re: RFR: 8330842: Support AES CBC with Ciphertext Stealing (CTS) in SunPKCS11

2024-04-24 Thread Andrew Haley
On Mon, 22 Apr 2024 18:31:37 GMT, Francisco Ferrari Bihurriet wrote: > Hi, > > I would like to propose an implementation to support AES CBC with Ciphertext > Stealing (CTS) in SunPKCS11, according to what has been specified in > [JDK-8330843 CSR](https://bugs.openjdk.org/browse/JDK-8330843).

Integrated: 8296411: AArch64: Accelerated Poly1305 intrinsics

2023-06-02 Thread Andrew Haley
On Mon, 22 May 2023 14:23:15 GMT, Andrew Haley wrote: > This provides a solid speedup of about 3-4x over the Java implementation. > > I have a vectorized version of this which uses a bunch of tricks to speed it > up, but it's complex and can still be improved. We're

Re: RFR: 8296411: AArch64: Accelerated Poly1305 intrinsics [v5]

2023-06-01 Thread Andrew Haley
1305DigestBench.updateBytes 1048576 thrpt3 > 2587.132 ± 40.240 ops/s > > Benchmark(dataSize) (provider) Mode Cnt > ScoreError Units > Poly1305DigestBench.updateBytes 64 thrpt3 &

Re: RFR: 8296411: AArch64: Accelerated Poly1305 intrinsics [v4]

2023-06-01 Thread Andrew Haley
On Thu, 1 Jun 2023 15:00:26 GMT, Andrew Haley wrote: > This comment and the next one both need correcting. They mention U_0HI and > U_1HI and, as the previous comment says, those registers are dead. > > What actually happens here is best summarized as > > // U_2:U_1:U_0 += (U

Re: RFR: 8296411: AArch64: Accelerated Poly1305 intrinsics [v4]

2023-06-01 Thread Andrew Haley
On Thu, 1 Jun 2023 12:16:45 GMT, Andrew Dinn wrote: > This comment and the next one both need correcting. They mention U_0HI and > U_1HI and, as the previous comment says, those registers are dead. > > What actually happens here is best summarized as > > // U_2:U_1:U_0 += (U2 >> 2) * 5 > > or, i

Re: RFR: 8296411: AArch64: Accelerated Poly1305 intrinsics [v4]

2023-05-26 Thread Andrew Haley
On Wed, 24 May 2023 19:16:36 GMT, Claes Redestad wrote: > Thanks for your patience in answering my questions and addressing my comments. Thank you for asking questions that made the patch better, and even removed an instruction in what I thought was a tightly-written intrinsic! -

Re: RFR: 8296411: AArch64: Accelerated Poly1305 intrinsics [v4]

2023-05-24 Thread Andrew Haley
1305DigestBench.updateBytes 1048576 thrpt3 > 2587.132 ± 40.240 ops/s > > Benchmark(dataSize) (provider) Mode Cnt > ScoreError Units > Poly1305DigestBench.updateBytes 64 thrpt3 &

Re: RFR: 8296411: AArch64: Accelerated Poly1305 intrinsics [v2]

2023-05-24 Thread Andrew Haley
On Wed, 24 May 2023 13:39:10 GMT, Claes Redestad wrote: >> See https://loup-vaillant.fr/tutorials/poly1305-design for more explanation > > Thanks for the link! > > So `r` refers to the value passed via `r_start` and it wasn't clear from the > immediate context that `r_start` is already split i

Re: RFR: 8296411: AArch64: Accelerated Poly1305 intrinsics [v2]

2023-05-24 Thread Andrew Haley
On Wed, 24 May 2023 10:18:39 GMT, Andrew Haley wrote: >> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7097: >> >>> 7095: // together partial products without any risk of needing to >>> 7096: // propagate a carry out. >>> 70

Re: RFR: 8296411: AArch64: Accelerated Poly1305 intrinsics [v3]

2023-05-24 Thread Andrew Haley
1305DigestBench.updateBytes 1048576 thrpt3 > 2587.132 ± 40.240 ops/s > > Benchmark(dataSize) (provider) Mode Cnt > ScoreError Units > Poly1305DigestBench.updateBytes 64 thrpt3 &

Re: RFR: 8296411: AArch64: Accelerated Poly1305 intrinsics [v2]

2023-05-24 Thread Andrew Haley
On Wed, 24 May 2023 10:07:47 GMT, Claes Redestad wrote: >> Andrew Haley has updated the pull request incrementally with one additional >> commit since the last revision: >> >> Whitespace > > src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 7097: > &

Re: RFR: 8296411: AArch64: Accelerated Poly1305 intrinsics [v2]

2023-05-24 Thread Andrew Haley
1305DigestBench.updateBytes 1048576 thrpt3 > 2587.132 ± 40.240 ops/s > > Benchmark(dataSize) (provider) Mode Cnt > ScoreError Units > Poly1305DigestBench.updateBytes 64 thrpt3 &

RFR: 8296411: AArch64: Accelerated Poly1305 intrinsics

2023-05-22 Thread Andrew Haley
This provides a solid speedup of about 3-4x over the Java implementation. I have a vectorized version of this which uses a bunch of tricks to speed it up, but it's complex and can still be improved. We're getting close to ramp down, so I'm submitting this simple intrinsic so that we can get it r

Re: RFR: 8296507: GCM using more memory than necessary with in-place operations [v7]

2022-12-08 Thread Andrew Haley
On Tue, 6 Dec 2022 20:29:56 GMT, Anthony Scarpino wrote: >> I would like a review of an update to the GCM code. A recent report showed >> that GCM memory usage for TLS was very large. This was a result of in-place >> buffers, which TLS uses, and how the code handled the combined intrinsic >>

Re: RFR: JDK-8286666: JEP 429: Implementation of Scoped Values (Incubator) [v3]

2022-11-15 Thread Andrew Haley
> JEP 429 implementation. Andrew Haley has updated the pull request incrementally with one additional commit since the last revision: Reviewer feedback - Changes: - all: https://git.openjdk.org/jdk/pull/10952/files - new: https://git.openjdk.org/jdk/pull/10952/files/4bd44

Re: RFR: JDK-8286666: JEP 429: Implementation of Scoped Values (Incubator) [v2]

2022-11-15 Thread Andrew Haley
> JEP 429 implementation. Andrew Haley has updated the pull request incrementally with two additional commits since the last revision: - Update test - Reviewer feedback - Changes: - all: https://git.openjdk.org/jdk/pull/10952/files - new: https://git.openjdk.org/jdk/p

Re: RFR: JDK-8286666: JEP 429: Implementation of Scoped Values (Incubator)

2022-11-15 Thread Andrew Haley
On Wed, 2 Nov 2022 16:23:34 GMT, Andrew Haley wrote: > JEP 429 implementation. src/jdk.incubator.concurrent/share/classes/jdk/incubator/concurrent/ScopedValue.java line 475: > 473: // ??? Do we want to search cache for this? In most cases we > don't expect > 474

Re: RFR: JDK-8286666: JEP 429: Implementation of Scoped Values (Incubator)

2022-11-15 Thread Andrew Haley
On Mon, 14 Nov 2022 17:34:31 GMT, Alan Bateman wrote: >> JEP 429 implementation. > > src/java.base/share/classes/java/lang/VirtualThread.java line 318: > >> 316: } >> 317: } >> 318: @Hidden > > Can we rename this to runWith(Runnable, Object) in both Thread and > VirtualThread t

Re: RFR: JDK-8286666: JEP 429: Implementation of Scoped Values (Incubator)

2022-11-15 Thread Andrew Haley
On Thu, 3 Nov 2022 11:50:17 GMT, ExE Boss wrote: >> JEP 429 implementation. > > src/java.base/share/classes/java/lang/Thread.java line 1610: > >> 1608: ensureMaterializedForStackWalk(bindings); >> 1609: task.run(); >> 1610: Reference.reachabilityFence(bindings

Re: RFR: JDK-8286666: JEP 429: Implementation of Scoped Values (Incubator)

2022-11-15 Thread Andrew Haley
On Thu, 10 Nov 2022 17:42:38 GMT, Andrew Haley wrote: >> src/hotspot/share/prims/jvm.cpp line 1410: >> >>> 1408: loc = 3; >>> 1409: } else if (method == resolver.thread_run_method) { >>> 1410: loc = 2; >> >> This depends

Re: RFR: JDK-8286666: JEP 429: Implementation of Scoped Values (Incubator)

2022-11-15 Thread Andrew Haley
On Fri, 4 Nov 2022 23:17:32 GMT, Dean Long wrote: >> JEP 429 implementation. > > src/hotspot/share/prims/jvm.cpp line 1410: > >> 1408: loc = 3; >> 1409: } else if (method == resolver.thread_run_method) { >> 1410: loc = 2; > > This depends on how javac numbers locals, right? It

RFR: JDK-8286666: JEP 429: Implementation of Scoped Values (Incubator)

2022-11-15 Thread Andrew Haley
JEP 429 implementation. - Commit messages: - Update StressStackOverflow - Release _scopedValueCache after use - Merge branch 'JDK-828' of https://github.com/theRealAph/jdk into JDK-828 - Update src/java.base/share/classes/jdk/internal/vm/ScopedValueContainer.java - Renam

Re: RFR: 8295146: Clean up native code with newer C/C++ language features [v2]

2022-11-14 Thread Andrew Haley
On Mon, 14 Nov 2022 08:28:04 GMT, Thomas Stuefe wrote: > unfortunately, your patch will make backporting more difficult. We cannot > downport it to older releases compiled with older compilers. But since it > touches a lot of files it will sit smack in the middle of patch sequences, > requirin

Re: RFR: 8247645: ChaCha20 intrinsics

2022-11-06 Thread Andrew Haley
On Thu, 18 Aug 2022 14:43:51 GMT, Jamil Nimeh wrote: >> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4306: >> >>> 4304: __ subs(loopCtr, loopCtr, 1); >>> 4305: __ cmp(loopCtr, (u1)0); >>> 4306: __ br(Assembler::NE, L_twoRounds); >> >> Same thing about subs-cmp0-bne. > > Th

Re: RFR: 8247645: ChaCha20 intrinsics

2022-11-06 Thread Andrew Haley
On Fri, 2 Sep 2022 16:52:02 GMT, Jamil Nimeh wrote: >> src/hotspot/cpu/aarch64/assembler_aarch64.hpp line 2521: >> >>> 2519: #undef INSN3 >>> 2520: #undef INSN4 >>> 2521: >> >> This code to handle the AdvSIMD load/store single structure and AdvSIMD >> load/store single structure (post-indexed

Re: RFR: 8247645: ChaCha20 intrinsics

2022-11-06 Thread Andrew Haley
On Fri, 4 Mar 2022 16:47:54 GMT, Jamil Nimeh wrote: > This PR delivers ChaCha20 intrinsics that accelerate the core block function > that generates key stream from the key, counter and nonce. Intrinsics have > been written for the following platforms and instruction sets: > > - x86_64: AVX, A

Re: Private Keys are cached "forever" leading to inop HTTP-TLS-servers

2022-06-21 Thread Andrew Haley
On 6/16/22 21:02, Lothar Kimmeringer wrote: If they are allowed to become unuseable (as explained, I see that as something that is to be expected IRL) I don't think they are. There is nothing in PKCS#11 that gives an implementation any permission to time out. -- Andrew Haley (he/him)