Re: [DISCUSS] UUID type

Jacques Nadeau Fri, 17 Sep 2021 15:05:22 -0700

I already added it to Substrait because of Iceberg lazy consensus :D


On Fri, Sep 17, 2021 at 2:05 PM Ryan Blue <[email protected]> wrote:

> Let's move forward with it. I'm not hearing much dissent after saying the
> general trend is to keep UUID. So let's call it lazy consensus.
>
> Ryan
>
> On Fri, Sep 17, 2021 at 1:32 PM Piotr Findeisen <[email protected]>
> wrote:
>
>> Hi Ryan,
>>
>> Please advise whatever feels more appropriate from your perspective.
>> From my perspective, we could just go ahead and merge Trino Iceberg
>> support for UUID, since this is just fulfilling the spec as it is defined
>> today.
>>
>> Best
>> PF
>>
>>
>> On Wed, Sep 15, 2021 at 10:17 PM Ryan Blue <[email protected]> wrote:
>>
>>> I don't think we necessarily reached consensus, but I think the general
>>> trend toward the end was to keep support for UUID. Should we start a vote
>>> to validate consensus?
>>>
>>> On Wed, Sep 15, 2021 at 1:15 PM Joshua Howard <[email protected]>
>>> wrote:
>>>
>>>> Just following up on Piotr's message here.
>>>>
>>>> Have we converged? I think most people would assume that silence is a
>>>> vote for the status-quo.
>>>>
>>>> On Mon, Sep 13, 2021 at 7:30 AM Piotr Findeisen <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> It seems we converged here that UUID should remain included.
>>>>> I read this as a consensus reached, but it may be subjective. Did we
>>>>> objectively reached consensus on this?
>>>>>
>>>>> From Iceberg project perspective there isn't anything to do, as UUID
>>>>> already *is* part of the spec (
>>>>> https://iceberg.apache.org/spec/#schemas-and-data-types).
>>>>> Trino Iceberg PR adding support for UUID
>>>>> https://github.com/trinodb/trino/pull/8747 was pending merge while
>>>>> this conversation has been ongoing.
>>>>>
>>>>> Best,
>>>>> PF
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Aug 2, 2021 at 6:22 AM Kyle B <[email protected]> wrote:
>>>>>
>>>>>> Hi Ryan and all,
>>>>>>
>>>>>> That sounds like a reasonable reason to leave IP address types out.
>>>>>> In my experience, dedicated IP address types are mostly found in logging
>>>>>> tools and other things for sysadmins / DevOps etc.
>>>>>>
>>>>>> When querying data with IP addresses, I’ve seen it done quite a lot
>>>>>> (eg security reasons) but usually stored as string or manipulated in a 
>>>>>> UDF.
>>>>>> They’re not commonly supported types.
>>>>>>
>>>>>> I would also draw the line at UUID types.
>>>>>>
>>>>>> - Kyle Bendickson
>>>>>>
>>>>>> On Jul 30, 2021, at 3:15 PM, Ryan Blue <[email protected]> wrote:
>>>>>>
>>>>>> 
>>>>>> Jacques, you make some good points here. I think my argument about
>>>>>> usability leading to performance issues is a stronger argument for 
>>>>>> engines
>>>>>> than for Iceberg. Still, there are inefficiencies in Iceberg if someone
>>>>>> chooses to use a string in an engine that doesn't have a UUID type.
>>>>>>
>>>>>> Another thing to consider is cross-engine support. If Iceberg removes
>>>>>> UUID, then Trino would probably translate to fixed[16]. That results in a
>>>>>> table that's difficult to query in other engines, where people would
>>>>>> probably choose to store the data as a string. On the other hand, if
>>>>>> Iceberg keeps the UUID type then integrations would simply translate to 
>>>>>> the
>>>>>> UUID string representation before passing data to the other engines.
>>>>>> While the engines would be using 36-byte values in join keys, the user
>>>>>> experience issue is fixed and the data is more compact on disk and in
>>>>>> Iceberg's bounds metadata.
>>>>>>
>>>>>> While having a UUID type in Iceberg can't really help engines that
>>>>>> don't support UUID take advantage of the type at runtime, it does seem
>>>>>> slightly better to have the UUID type in general since at least one 
>>>>>> engine
>>>>>> supports it and it provides the expected user experience with a compact
>>>>>> representation.
>>>>>>
>>>>>> IPv4 addresses are a good thing to think about as well, since most of
>>>>>> the same arguments apply. If we keep the UUID type, should we also add 
>>>>>> IPv4
>>>>>> or IPv6 types? I would probably draw the line at UUID because it helps in
>>>>>> joins, which are an important operation. IPv4 representations aren't that
>>>>>> big of an inconvenience unless you need to do IP manipulation, which is
>>>>>> typically in a UDF and not the query engine. And you can always keep both
>>>>>> representations in a table fairly inexpensively. Does this sound like a
>>>>>> valid rationale for having UUID but not IP types?
>>>>>>
>>>>>> Ryan
>>>>>>
>>>>>> On Thu, Jul 29, 2021 at 5:08 PM Jacques Nadeau <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> It seems like Spark, Hive, Dremio and Impala all lack UUID as a
>>>>>>> native type. Which engines are you thinking of that have a native UUID 
>>>>>>> type
>>>>>>> besides the Presto derivatives and support Iceberg?
>>>>>>>
>>>>>>> I agree that Trino should expose a UUID type on top of Iceberg
>>>>>>> tables. All the user experience things that you are describing as 
>>>>>>> important
>>>>>>> (compact storage, friendly display, ddl, clean literals) are possible
>>>>>>> without it being a first class type in Iceberg using a trino specific
>>>>>>> property.
>>>>>>>
>>>>>>> I don't really have a strong opinion about UUID. In general, type
>>>>>>> bloat is probably just a part of this kind of project. Generally, 
>>>>>>> CHAR(X)
>>>>>>> and VARCHAR(X) feel like much bigger concerns given that they exist in 
>>>>>>> all
>>>>>>> of the engines but not Iceberg--especially when we start talking about
>>>>>>> views.
>>>>>>>
>>>>>>> Some of this argues for physical vs logical type abstraction.
>>>>>>> (Something that was always challenging in Parquet but also helped to
>>>>>>> resolve how these types are managed in engines that don't support them.)
>>>>>>>
>>>>>>> thanks,
>>>>>>> Jacques
>>>>>>>
>>>>>>> PS: Funny aside, the bloat on an ip address is actually worse than a
>>>>>>> UUID, right? IPv4 = 4 bytes. IPv4 String = 15 bytes.... 15/4 => 275% 
>>>>>>> bloat.
>>>>>>> UUID 36/16 => 125% bloat.
>>>>>>>
>>>>>>> On Thu, Jul 29, 2021 at 4:39 PM Ryan Blue <[email protected]> wrote:
>>>>>>>
>>>>>>>> I don't think this is just a problem in Trino.
>>>>>>>>
>>>>>>>> If there is no UUID type, then a user must choose between a 36-byte
>>>>>>>> string and a 16-byte binary. That's not a good choice to force people 
>>>>>>>> into.
>>>>>>>> If someone chooses binary, then it's harder to work with rows and 
>>>>>>>> construct
>>>>>>>> queries even though there is a standard representation for UUIDs. To 
>>>>>>>> avoid
>>>>>>>> the user headache, people will probably choose to store values as 
>>>>>>>> strings.
>>>>>>>> Using a string would mean that more than half the value is needlessly
>>>>>>>> discarded by default in Iceberg lower/upper bounds instead of keeping 
>>>>>>>> the
>>>>>>>> entire value. And since engines don't know what's in the string, the 
>>>>>>>> full
>>>>>>>> value must be used in comparison, which is extra work and extra space.
>>>>>>>>
>>>>>>>> Inflated values may not be a problem in some cases. IPv4 addresses
>>>>>>>> are one case where you could argue that it doesn't matter very much 
>>>>>>>> that
>>>>>>>> they are typically stored as strings. But I expect the use of UUIDs to 
>>>>>>>> be
>>>>>>>> common for ID columns because you can generate them without 
>>>>>>>> coordination
>>>>>>>> (unlike an incrementing ID) and that's a concern because the use as an 
>>>>>>>> ID
>>>>>>>> makes them likely to be join keys.
>>>>>>>>
>>>>>>>> If we want the values to be stored as 16-byte fixed, then we need
>>>>>>>> to make it easy to get the expected string representation in and out, 
>>>>>>>> just
>>>>>>>> like we do with date/time types. I don't think that's specific to any
>>>>>>>> engine.
>>>>>>>>
>>>>>>>> On Thu, Jul 29, 2021 at 9:00 AM Jacques Nadeau <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> I think points 1&2 don't really apply since a fixed width binary
>>>>>>>>> already covers those properties.
>>>>>>>>>
>>>>>>>>> It seems like this isn't really a concern of iceberg but rather a
>>>>>>>>> cosmetic layer that exists primarily (only?) in trino. In that case I 
>>>>>>>>> would
>>>>>>>>> be inclined to say that trino should just use custom metadata and a 
>>>>>>>>> fixed
>>>>>>>>> binary type. That way you still have the desired ux without exposing 
>>>>>>>>> those
>>>>>>>>> extra concepts to the  iceberg. It actually feels like better 
>>>>>>>>> encapsulation
>>>>>>>>> imo.
>>>>>>>>>
>>>>>>>>> On Thu, Jul 29, 2021, 3:00 AM Piotr Findeisen <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I agree with Ryan, that it takes some precautions before one can
>>>>>>>>>> assume uniqueness of UUID values, and that this shouldn't be any 
>>>>>>>>>> special
>>>>>>>>>> for UUIDs at all.
>>>>>>>>>> After all, this is just a primitive type, which is commonly used
>>>>>>>>>> for certain things, but "commonly" doesn't mean "always".
>>>>>>>>>>
>>>>>>>>>> The advantages of having a dedicated type are on 3 layers.
>>>>>>>>>> The compact representation in the file, and compact
>>>>>>>>>> representation in memory in the query engine are the ones mentioned 
>>>>>>>>>> above.
>>>>>>>>>>
>>>>>>>>>> The third layer is the usability. Seeing a UUID column i know
>>>>>>>>>> what values i can expect, so it's more descriptive than `id 
>>>>>>>>>> char(36)`.
>>>>>>>>>> This also means i can CREATE TABLE ... AS SELECT uuid(), ....
>>>>>>>>>> without need for casting to varchar.
>>>>>>>>>> It also removes temptation of casting uuid to varbinary to
>>>>>>>>>> achieve compact representation.
>>>>>>>>>>
>>>>>>>>>> Thus i think it would be good to have them.
>>>>>>>>>>
>>>>>>>>>> Best
>>>>>>>>>> PF
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Jul 28, 2021 at 5:57 PM Ryan Blue <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> The original reason why I added UUID to the spec was that I
>>>>>>>>>>> thought there would be opportunities to take advantage of UUIDs as 
>>>>>>>>>>> unique
>>>>>>>>>>> values and to optimize the use of UUIDs. I was thinking about
>>>>>>>>>>> auto-increment ID fields and how we might do something similar in 
>>>>>>>>>>> Iceberg.
>>>>>>>>>>>
>>>>>>>>>>> The reason we have thought about removing UUID is that there
>>>>>>>>>>> aren't as many opportunities to take advantage of UUIDs as I 
>>>>>>>>>>> thought. My
>>>>>>>>>>> original assumption was that we could do things like bucket on UUID 
>>>>>>>>>>> fields
>>>>>>>>>>> or assume that a UUID field has a high NDV. But that's not 
>>>>>>>>>>> necessarily the
>>>>>>>>>>> case with when a UUID field is a foreign key, only when it is used 
>>>>>>>>>>> as an
>>>>>>>>>>> identifier or primary key. Before Jack added tracking for row 
>>>>>>>>>>> identifier
>>>>>>>>>>> fields, we couldn't know that a UUID was unique in a table. As a 
>>>>>>>>>>> result, we
>>>>>>>>>>> didn't invest in support for UUID.
>>>>>>>>>>>
>>>>>>>>>>> Quick aside: Now that row identifier fields are tracked, we can
>>>>>>>>>>> do some of these things with the row identifier fields. Engines can 
>>>>>>>>>>> assume
>>>>>>>>>>> that the tuple of row identifier fields is unique in a table for 
>>>>>>>>>>> join
>>>>>>>>>>> estimation. And engines can use row identifier fields in sort keys 
>>>>>>>>>>> to
>>>>>>>>>>> ensure lots of partition split locations (this is really important 
>>>>>>>>>>> for
>>>>>>>>>>> Spark).
>>>>>>>>>>>
>>>>>>>>>>> Coming back to UUIDs, the second reason to have a UUID type is
>>>>>>>>>>> still valid: it is better to represent UUIDs as fixed[16] than as 
>>>>>>>>>>> 36 byte
>>>>>>>>>>> UTF-8 strings that are more than twice as large, or even worse 
>>>>>>>>>>> UCS-16
>>>>>>>>>>> Strings that are 4x as large. Since UUIDs are likely to be used in 
>>>>>>>>>>> joins,
>>>>>>>>>>> this could really help engines as long as they can keep the values 
>>>>>>>>>>> as
>>>>>>>>>>> fixed-width binary.
>>>>>>>>>>>
>>>>>>>>>>> I could go either way on this. I think it is valuable to have a
>>>>>>>>>>> compact representation for UUIDs rather than using the string
>>>>>>>>>>> representation. But that will require investing in the type and 
>>>>>>>>>>> building
>>>>>>>>>>> support in engines that won't take advantage of it. If Trino can 
>>>>>>>>>>> use this,
>>>>>>>>>>> I think it may be worth keeping and investing in.
>>>>>>>>>>>
>>>>>>>>>>> Ryan
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jul 27, 2021 at 9:54 PM Jack Ye <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Yes I agree with Jacques that fixed binary is what it is in the
>>>>>>>>>>>> end. I think It is more about user experience, whether the 
>>>>>>>>>>>> conversion is
>>>>>>>>>>>> done at the user side or Iceberg and engine side. Many people just 
>>>>>>>>>>>> store
>>>>>>>>>>>> UUID as a 36 byte string instead of a 16 byte binary, so with an 
>>>>>>>>>>>> explicit
>>>>>>>>>>>> UUID type, Iceberg can optimize this common use case internally 
>>>>>>>>>>>> for users.
>>>>>>>>>>>> There might be some other benefits I overlooked, but maybe the 
>>>>>>>>>>>> complication
>>>>>>>>>>>> introduced by this type does not really justify the slightly 
>>>>>>>>>>>> better user
>>>>>>>>>>>> experience. I am also on the fence about it.
>>>>>>>>>>>>
>>>>>>>>>>>> -Jack Ye
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Jul 27, 2021 at 7:54 PM Jacques Nadeau <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> What specific arguments are there for it being a first class
>>>>>>>>>>>>> type besides it is elsewhere? Is there some kind of optimization 
>>>>>>>>>>>>> iceberg or
>>>>>>>>>>>>> an engine could do if it was typed versus just a bucket of bits? 
>>>>>>>>>>>>> Fixed
>>>>>>>>>>>>> width binary seems to cover the cases I see in terms of actual
>>>>>>>>>>>>> functionality in the iceberg libraries or engines…
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Jul 27, 2021 at 6:54 PM Yan Yan <[email protected]>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> One conversation I used to come across regarding UUID
>>>>>>>>>>>>>> deprecation was from
>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/1611
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Yan
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Jul 27, 2021 at 1:07 PM Peter Vary
>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Joshua,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I do not have a strong preference about the UUID type, but I
>>>>>>>>>>>>>>> would like the highlight, that the type is handled 
>>>>>>>>>>>>>>> inconsistently in
>>>>>>>>>>>>>>> Iceberg with different file formats. (See:
>>>>>>>>>>>>>>> https://github.com/apache/iceberg/issues/1881)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If we keep the type, it would be good to standardize the
>>>>>>>>>>>>>>> handling in every file format.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks, Peter
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, 27 Jul 2021, 17:08 Joshua Howard, <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> UUID is a current data type according to the Iceberg spec (
>>>>>>>>>>>>>>>> https://iceberg.apache.org/spec/#primitive-types), but
>>>>>>>>>>>>>>>> there seems to have been some discussion about removing it? I 
>>>>>>>>>>>>>>>> could not
>>>>>>>>>>>>>>>> find the original discussion, but a reference to the 
>>>>>>>>>>>>>>>> discussion can be
>>>>>>>>>>>>>>>> found here (https://github.com/trinodb/trino/issues/6663).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I generally agree with the consensus in the Trino issue to
>>>>>>>>>>>>>>>> keep UUID in Iceberg. To summarize…
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> - It makes sense to keep the type now that row identifiers
>>>>>>>>>>>>>>>> are supported
>>>>>>>>>>>>>>>> - Some engines (Trino) have support for the UUID type
>>>>>>>>>>>>>>>> - Engines w/o support for UUID type can determine how to map
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Does anyone want to remove the type? If so, why?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Ryan Blue
>>>>>>>>>>> Tabular
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Tabular
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Tabular
>>>>>>
>>>>>>
>>>>
>>>> --
>>>> Josh Howard
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>
>
> --
> Ryan Blue
> Tabular
>

Re: [DISCUSS] UUID type

Reply via email to