Re: RFR: 8308804: Improve UUID.randomUUID performance with bulk/scalable PRNG access

Aleksey Shipilev Fri, 26 May 2023 01:43:41 -0700

On Fri, 26 May 2023 00:13:57 GMT, Andrei Pangin <[email protected]> wrote:


>> UUID is the very important class that is used to track identities of objects 
>> in large scale systems. On some of our systems, `UUID.randomUUID` takes >1% 
>> of total CPU time, and is frequently a scalability bottleneck due to 
>> `SecureRandom` synchronization.
>> 
>> The major issue with UUID code itself is that it reads from the single 
>> `SecureRandom` instance by 16 bytes. So the heavily contended `SecureRandom` 
>> is bashed with very small requests. This also has a chilling effect on other 
>> users of `SecureRandom`, when there is a heavy UUID generation traffic.
>> 
>> We can improve this by doing the bulk reads from the backing SecureRandom 
>> and possibly striping the reads across many instances of it. 
>> 
>> 
>> Benchmark               Mode  Cnt  Score   Error   Units
>> 
>> ### AArch64 (m6g.4xlarge, Graviton, 16 cores)
>> 
>> # Before
>> UUIDRandomBench.single  thrpt   15  3.545 ± 0.058  ops/us
>> UUIDRandomBench.max     thrpt   15  1.832 ± 0.059  ops/us ; negative scaling
>> 
>> # After
>> UUIDRandomBench.single  thrpt   15  4.421 ± 0.047  ops/us 
>> UUIDRandomBench.max     thrpt   15  6.658 ± 0.092  ops/us ; positive 
>> scaling, ~1.5x
>> 
>> ### x86_64 (c6.8xlarge, Xeon, 18 cores)
>> 
>> # Before
>> UUIDRandomBench.single  thrpt   15  2.710 ± 0.038  ops/us
>> UUIDRandomBench.max     thrpt   15  1.880 ± 0.029  ops/us  ; negative 
>> scaling 
>> 
>> # After
>> Benchmark                Mode  Cnt  Score   Error   Units
>> UUIDRandomBench.single  thrpt   15  3.099 ± 0.022  ops/us
>> UUIDRandomBench.max     thrpt   15  3.555 ± 0.062  ops/us  ; positive 
>> scaling, ~1.2x
>> 
>> 
>> Note that there is still a scalability bottleneck in current default random 
>> (`NativePRNG`), because it synchronizes over a singleton instance for SHA1 
>> mixer, then the engine itself, etc. -- it is quite a whack-a-mole to figure 
>> out the synchronization story there. The scalability fix in current default 
>> `SecureRandom` would be much more intrusive and risky, since it would change 
>> a core crypto class with unknown bug fanout.
>> 
>> Using the bulk reads even when the underlying PRNG is heavily synchronized 
>> is still a win. A more scalable PRNG would benefit from this as well. This 
>> PR adds a system property to select the PRNG implementation, and there we 
>> can clearly see the benefit with more scalable PRNG sources:
>> 
>> 
>> Benchmark               Mode  Cnt   Score   Error   Units
>> 
>> ### x86_64 (c6.8xlarge, Xeon, 18 cores)
>> 
>> # Before, hacked `new SecureRandom()` to 
>> `SecureRandom.getInstance("SHA1PRNG")`
>> UUIDRandomBench.single  thrpt  ...
>
> src/java.base/share/classes/java/util/UUID.java line 224:
> 
>> 222:                     // Try to pull the UUID from the current buffer at 
>> current position.
>> 223:                     if (stamp != 0) {
>> 224:                         int p = (int)VH_POS.getAndAdd(this, UUID_CHUNK);
> 
> An atomic update together with an optimistic lock looks non-idiomatic use of 
> StampedLock to me. Won't a simple CAS loop be simpler? E.g. in pseudocode:
> 
> while ((p = this.pos) + 16) < buf.length) {
>     long msb = getLong(buf, p);
>     long lsb = getLong(buf, p + 8);
>     if (cas(this.pos, p, p + 16)) {
>         return new UUID(msb, lsb);
>     }
> }
> 
> // refill buffer under lock

Nope, I don't think you cannot do this, because there is a race on 
concurrently-updating `buf`. The StampedLock (RW lock), protects the 
buffer-reading threads from seeing the `buf` undergoing the initialization by 
buffer-writing threads. Atomic pos only arbitrates the concurrent 
buffer-threads. That atomic does not help to resolve the buf races.

> src/java.base/share/classes/java/util/UUID.java line 260:
> 
>> 258:                         buf[c + 8] &= 0x3f;  /* clear variant        */
>> 259:                         buf[c + 8] |= (byte) 0x80;  /* set to IETF 
>> variant  */
>> 260:                     }
> 
> I'm not sure I understand the point about initialization. Why not just 
> setting the required version bits right in UUID constructor without updating 
> the array? This will be faster and simpler IMO.

Right, we can do things straight on locals, without ever touching the heap, 
let's see...

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/14135#discussion_r1206430447
PR Review Comment: https://git.openjdk.org/jdk/pull/14135#discussion_r1206433588

Re: RFR: 8308804: Improve UUID.randomUUID performance with bulk/scalable PRNG access

Reply via email to