Hi Tim!

On Tue, Jul 09, 2024 at 07:34:57PM +0200, Tim Düsterhus wrote:
> Hi
> 
> On 7/8/24 06:38, Willy Tarreau wrote:
> > Well, I must confess that while I understand what you tried to do, it's
> > less clear what is the problem you wanted to address. I think that your
> > concern is the possible lack of monotonicity, is that it ?
> 
> An example is probably best to explain the requirements. When I do the
> following:
> 
>       http-after-response set-header test1 %[uuid(7)]
>       http-after-response set-header test2 %[uuid(7)]
>       http-after-response set-header test3 %[uuid(7)]
>       http-after-response set-header test4 %[uuid(7)]
>       http-after-response set-header test5 %[uuid(7)]
>       http-after-response set-header test6 %[uuid(7)]
>       http-after-response set-header test7 %[uuid(7)]
>       http-after-response set-header test8 %[uuid(7)]
> 
> then the returned UUIDs need to be "ascending" as per RFC 9562#6.2. The
> current implementation does not guarantee that, because the embedded
> timestamp only has millisecond resolution and the rest is completely random.
> That means that the test6 UUID might be lower than the test2 UUID.

I initially read that the monotonicity was only required for the clock
part (because that's what initial implementations were seeking: being
able to keep caches hot and batch-purge elements without searching), but
it indeed seems they went a bit too far by requiring full monotonicity.
The problem is that it automatically excludes multi-node. There's even a
section (6.4) about multi-node that carefully avoids discussing anything
related to monotonicity since that's a well-known physically unsolvable
problem, it only discusses collision avoidance.

> RFC 9562 provides several options in section 6.2. Either the bits following
> the millisecond part are a simple counter that resets every millisecond,
> they are used to store addition clock precision (e.g. micro or nanoseconds),
> or the random part is not actually random and instead starts at a random
> value and is incremented by a random amount every time.

These ones are indeed solutions for a single-node generator. Normally,
by just starting a new random each new time and incrementing till next
time, you'll play in pseudo-random ranges and will not get any local
collision. But the problem in a distributed system is that the most
synchronized your clocks will be, the most likely a collision will
happen, since two nodes picking a relatively close number can easily
reach the other node's starting point, versus pure randoms that will
not do that (but are not monotonic).

> My patch chooses option (2): Storing additional clock precision.

The problem is that on many systems, often VMs but also various devices
which are typically used to host components such as haproxy, this extra
precision does not exist, or relies on non-monotonic sources, resulting
in the clock_gettime() call effectively returning discrete values just
differing by a counter. For example here on this system:

  $ ./a.out  | tail
  0002507749.074553230
  0002507749.074557730
  0002507749.074562230
  0002507749.074566730
  0002507749.074571230
  0002507749.074575730
  0002507749.074580230
  0002507749.074584730
  0002507749.074589230
  0002507749.074593730

See ? You're in fact having a fixed 500 ns counter. It's not that bad
because at least it changes, but on other setups you'd see something
stable.

The second problem with CLOCK_MONOTONIC is that its base is unspecified
and often is the system's uptime, like above. In short-lived VMs, you'll
be generating cycling UUIDs all starting with many zeroes and concentrated
on a tiny range. It's even worse in containers which can live even shorter
(a few seconds). In all such environments, CLOCK_MONOTONIC is a no-go.

> > In any case there are much less risks of collisions by filling bits
> > with random than with clock bits. Here if I read correctly, you're
> > only keeping 64 bits of random, which means that on average you'll
> > experience a collision every 2^32 generations. At 50k/s, it happens
> > once a day. With the 96 bits used previously, it will only happen
> > every 2^48, which is once every 178 years (which obviously reduces
> > in distributed systems).
> 
> My patch does the following:
> 
> 48 bits: Unix timestamp with milliseconds.
>  4 bits: UUID version.
> 12 bits: How many 256ns periods happened since the last millisecond?
>  2 bits: UUID variant.
> 62 bits: Random.
> 
> Indeed a collision with 64 bits of randomness is expected every 2^32 values,
> but you need to keep in mind that a collision will only be a collision if it
> happens in the same millisecond + nanosecond offset. So it actually is a
> comparison between:
> 
> 1 ms resolution + 74 bits of randomness
> 256 ns resolution + 62 bits of randomness

The poor resolution of time can make many of these bits pure constants
and definitely lose them.

> > Finally, the purpose of having the timestamp in UUIDv7 (like ULID)
> > is to group simultaneous operations. They keep caches hot, they allow
> > events to be found grouped in logs, and offer easier purge mechanisms
> > for recycling entries (no longer need to look up, just apply a time
> > offset and eliminate all those below). In that sense there's no quest
> > for extreme precision (which cannot work on distributed systems anyway),
> > rather for a reasonably good and reasonably monotonic ordering that
> > still prevents collisions across multiple systems.
> 
> RFC 9562 section 6.2 specifies that:
> 
> > Take care to ensure UUIDs generated in batches are also monotonic. That is,
> > if one thousand UUIDs are generated for the same timestamp, there should be
> > sufficient logic for organizing the creation order of those one thousand
> > UUIDs.
> 
> and that is what this patch attempts t do.
> 
> > So if that's also what you're looking for, I'm proposing that we
> > instead change the use of the unreliable date variable with a call
> > to now_mono_date_ns() that we can then use as the date source, and
> > keep the rest of the randoms.
> 
> As explained above, your suggested patch does not do the right thing.

I see, but actually neither does in the end :-)

And we have not covered space. 1) threads from the same process. 2)
process. 3) multiple nodes.

I can suggest several approaches:

  1) first, we should probably consider that space-based monotonicity
     not being compatible with the laws of physics, it must be addressed
     differently. This means that in a multi-node system, if monotonicity
     is required, then it must be per-generator, hence the generator ID
     (e.g. node name and pid) must be conveyed along with the UUID.

  2) or we offer an option so that only the time part is monotonic (which
     is the initial technical requirement for those seeking this feature)
     and we preserve full randomness on the random bits, which is unlikely
     to collide even across systems.

Then within one generator (process): I think that we can consider that
intra-thread monotonicity is non-negotiable (e.g. your example above),
and that inter-thread monotonicity is best effort (anyway how do you
compare the arrival of events, even if one thread gets a monotonic ID
before its sibling and uses it later it breaks apparent monotonicity).
This approach can be covered by ranges + counters, but the risk of
collision is large. Or we can just reserve the 12 lowest bits of the
random/counter for the thread number to guarantee that different
threads will always get different numbers while using monotonic,
possibly overlapping, sequences. An alternate approach that doesn't
require to reserve bits is to use random increments in the rand part:
at each new ms, you start with a small random number (typically less
than half of the space), and you add a tiny positive random increment
upon each call. If the max range is large enough we can fit all threads
in there without reserving space, and in practice not being overly large
means the value will not loop. In any case the extra counter bits are
usable for this (those used to deal with wrapping). Such an approach
provides more secure (less predictable) UUIDs. For many users, having
predictable UUIDs is a security concern (just like disclosing too
precise local system time which opens the way to remote SPECTRE-like
attacks).

But the problem for all these cases is that it requires a shared
generator between all threads, and that will just crawl on large
machines (i.e. the machines that are employed by those who hope to
deport a large chunk of the application into the load balancer).

Or we can admit that since inter-thread monotonicity already does
not make much sense beyond the time part anyway, then we can let
each thread live its own life with its own sequence, but then we
need to try hard to avoid collisions (the pseudo-random increments
can work remarkably well for this, just like the fixed bits, which
can then be low or high).

Then I think we could produce this:
   unix_ts_ms = the system date corrected for monotonicity as
                returned by now_mono_date_ns().

   rand_a + rand_b (74 bits): thread-local:
      - initialized to rand64() * nbthreads + thread_id on new ms
      - incremented by rand44() * nbthreads.
      => this maintains the thread_id discriminant in the lowest
         bits to guarantee non-collision while guaranteeing a
         minimum of 1 million random increments per millisecond
         that are not trivially guessable (44 bits each).

In this case, chances of collisions between multiple nodes (or processes
on the same system) are extremely faint. And the inter-system monotonicity
remains relatively good since based on the system clock (can be sufficient
to group logs per day or drop database tables based on a prefix, etc).

And I think it would be useful to have an option to only keep monotonicity
on the time and drop it from the random part (I would personally prefer to
deal with that if I were still writing applications since this is easier
to deal with and more reliable). That can be UUIDv8 since everything is
custom.

Please just let me know what you think about this.

Thanks,
Willy

Reply via email to