Hi Tim! On Tue, Jul 09, 2024 at 07:34:57PM +0200, Tim Düsterhus wrote: > Hi > > On 7/8/24 06:38, Willy Tarreau wrote: > > Well, I must confess that while I understand what you tried to do, it's > > less clear what is the problem you wanted to address. I think that your > > concern is the possible lack of monotonicity, is that it ? > > An example is probably best to explain the requirements. When I do the > following: > > http-after-response set-header test1 %[uuid(7)] > http-after-response set-header test2 %[uuid(7)] > http-after-response set-header test3 %[uuid(7)] > http-after-response set-header test4 %[uuid(7)] > http-after-response set-header test5 %[uuid(7)] > http-after-response set-header test6 %[uuid(7)] > http-after-response set-header test7 %[uuid(7)] > http-after-response set-header test8 %[uuid(7)] > > then the returned UUIDs need to be "ascending" as per RFC 9562#6.2. The > current implementation does not guarantee that, because the embedded > timestamp only has millisecond resolution and the rest is completely random. > That means that the test6 UUID might be lower than the test2 UUID.
I initially read that the monotonicity was only required for the clock part (because that's what initial implementations were seeking: being able to keep caches hot and batch-purge elements without searching), but it indeed seems they went a bit too far by requiring full monotonicity. The problem is that it automatically excludes multi-node. There's even a section (6.4) about multi-node that carefully avoids discussing anything related to monotonicity since that's a well-known physically unsolvable problem, it only discusses collision avoidance. > RFC 9562 provides several options in section 6.2. Either the bits following > the millisecond part are a simple counter that resets every millisecond, > they are used to store addition clock precision (e.g. micro or nanoseconds), > or the random part is not actually random and instead starts at a random > value and is incremented by a random amount every time. These ones are indeed solutions for a single-node generator. Normally, by just starting a new random each new time and incrementing till next time, you'll play in pseudo-random ranges and will not get any local collision. But the problem in a distributed system is that the most synchronized your clocks will be, the most likely a collision will happen, since two nodes picking a relatively close number can easily reach the other node's starting point, versus pure randoms that will not do that (but are not monotonic). > My patch chooses option (2): Storing additional clock precision. The problem is that on many systems, often VMs but also various devices which are typically used to host components such as haproxy, this extra precision does not exist, or relies on non-monotonic sources, resulting in the clock_gettime() call effectively returning discrete values just differing by a counter. For example here on this system: $ ./a.out | tail 0002507749.074553230 0002507749.074557730 0002507749.074562230 0002507749.074566730 0002507749.074571230 0002507749.074575730 0002507749.074580230 0002507749.074584730 0002507749.074589230 0002507749.074593730 See ? You're in fact having a fixed 500 ns counter. It's not that bad because at least it changes, but on other setups you'd see something stable. The second problem with CLOCK_MONOTONIC is that its base is unspecified and often is the system's uptime, like above. In short-lived VMs, you'll be generating cycling UUIDs all starting with many zeroes and concentrated on a tiny range. It's even worse in containers which can live even shorter (a few seconds). In all such environments, CLOCK_MONOTONIC is a no-go. > > In any case there are much less risks of collisions by filling bits > > with random than with clock bits. Here if I read correctly, you're > > only keeping 64 bits of random, which means that on average you'll > > experience a collision every 2^32 generations. At 50k/s, it happens > > once a day. With the 96 bits used previously, it will only happen > > every 2^48, which is once every 178 years (which obviously reduces > > in distributed systems). > > My patch does the following: > > 48 bits: Unix timestamp with milliseconds. > 4 bits: UUID version. > 12 bits: How many 256ns periods happened since the last millisecond? > 2 bits: UUID variant. > 62 bits: Random. > > Indeed a collision with 64 bits of randomness is expected every 2^32 values, > but you need to keep in mind that a collision will only be a collision if it > happens in the same millisecond + nanosecond offset. So it actually is a > comparison between: > > 1 ms resolution + 74 bits of randomness > 256 ns resolution + 62 bits of randomness The poor resolution of time can make many of these bits pure constants and definitely lose them. > > Finally, the purpose of having the timestamp in UUIDv7 (like ULID) > > is to group simultaneous operations. They keep caches hot, they allow > > events to be found grouped in logs, and offer easier purge mechanisms > > for recycling entries (no longer need to look up, just apply a time > > offset and eliminate all those below). In that sense there's no quest > > for extreme precision (which cannot work on distributed systems anyway), > > rather for a reasonably good and reasonably monotonic ordering that > > still prevents collisions across multiple systems. > > RFC 9562 section 6.2 specifies that: > > > Take care to ensure UUIDs generated in batches are also monotonic. That is, > > if one thousand UUIDs are generated for the same timestamp, there should be > > sufficient logic for organizing the creation order of those one thousand > > UUIDs. > > and that is what this patch attempts t do. > > > So if that's also what you're looking for, I'm proposing that we > > instead change the use of the unreliable date variable with a call > > to now_mono_date_ns() that we can then use as the date source, and > > keep the rest of the randoms. > > As explained above, your suggested patch does not do the right thing. I see, but actually neither does in the end :-) And we have not covered space. 1) threads from the same process. 2) process. 3) multiple nodes. I can suggest several approaches: 1) first, we should probably consider that space-based monotonicity not being compatible with the laws of physics, it must be addressed differently. This means that in a multi-node system, if monotonicity is required, then it must be per-generator, hence the generator ID (e.g. node name and pid) must be conveyed along with the UUID. 2) or we offer an option so that only the time part is monotonic (which is the initial technical requirement for those seeking this feature) and we preserve full randomness on the random bits, which is unlikely to collide even across systems. Then within one generator (process): I think that we can consider that intra-thread monotonicity is non-negotiable (e.g. your example above), and that inter-thread monotonicity is best effort (anyway how do you compare the arrival of events, even if one thread gets a monotonic ID before its sibling and uses it later it breaks apparent monotonicity). This approach can be covered by ranges + counters, but the risk of collision is large. Or we can just reserve the 12 lowest bits of the random/counter for the thread number to guarantee that different threads will always get different numbers while using monotonic, possibly overlapping, sequences. An alternate approach that doesn't require to reserve bits is to use random increments in the rand part: at each new ms, you start with a small random number (typically less than half of the space), and you add a tiny positive random increment upon each call. If the max range is large enough we can fit all threads in there without reserving space, and in practice not being overly large means the value will not loop. In any case the extra counter bits are usable for this (those used to deal with wrapping). Such an approach provides more secure (less predictable) UUIDs. For many users, having predictable UUIDs is a security concern (just like disclosing too precise local system time which opens the way to remote SPECTRE-like attacks). But the problem for all these cases is that it requires a shared generator between all threads, and that will just crawl on large machines (i.e. the machines that are employed by those who hope to deport a large chunk of the application into the load balancer). Or we can admit that since inter-thread monotonicity already does not make much sense beyond the time part anyway, then we can let each thread live its own life with its own sequence, but then we need to try hard to avoid collisions (the pseudo-random increments can work remarkably well for this, just like the fixed bits, which can then be low or high). Then I think we could produce this: unix_ts_ms = the system date corrected for monotonicity as returned by now_mono_date_ns(). rand_a + rand_b (74 bits): thread-local: - initialized to rand64() * nbthreads + thread_id on new ms - incremented by rand44() * nbthreads. => this maintains the thread_id discriminant in the lowest bits to guarantee non-collision while guaranteeing a minimum of 1 million random increments per millisecond that are not trivially guessable (44 bits each). In this case, chances of collisions between multiple nodes (or processes on the same system) are extremely faint. And the inter-system monotonicity remains relatively good since based on the system clock (can be sufficient to group logs per day or drop database tables based on a prefix, etc). And I think it would be useful to have an option to only keep monotonicity on the time and drop it from the random part (I would personally prefer to deal with that if I were still writing applications since this is easier to deal with and more reliable). That can be UUIDv8 since everything is custom. Please just let me know what you think about this. Thanks, Willy