Re: [DISCUSS] UUID type

Ryan Blue Wed, 15 Sep 2021 13:17:38 -0700

I don't think we necessarily reached consensus, but I think the general
trend toward the end was to keep support for UUID. Should we start a vote
to validate consensus?


On Wed, Sep 15, 2021 at 1:15 PM Joshua Howard <joshthow...@gmail.com> wrote:

> Just following up on Piotr's message here.
>
> Have we converged? I think most people would assume that silence is a vote
> for the status-quo.
>
> On Mon, Sep 13, 2021 at 7:30 AM Piotr Findeisen <pi...@starburstdata.com>
> wrote:
>
>> Hi,
>>
>> It seems we converged here that UUID should remain included.
>> I read this as a consensus reached, but it may be subjective. Did we
>> objectively reached consensus on this?
>>
>> From Iceberg project perspective there isn't anything to do, as UUID
>> already *is* part of the spec (
>> https://iceberg.apache.org/spec/#schemas-and-data-types).
>> Trino Iceberg PR adding support for UUID
>> https://github.com/trinodb/trino/pull/8747 was pending merge while this
>> conversation has been ongoing.
>>
>> Best,
>> PF
>>
>>
>>
>> On Mon, Aug 2, 2021 at 6:22 AM Kyle B <kjbendick...@gmail.com> wrote:
>>
>>> Hi Ryan and all,
>>>
>>> That sounds like a reasonable reason to leave IP address types out. In
>>> my experience, dedicated IP address types are mostly found in logging tools
>>> and other things for sysadmins / DevOps etc.
>>>
>>> When querying data with IP addresses, I’ve seen it done quite a lot (eg
>>> security reasons) but usually stored as string or manipulated in a UDF.
>>> They’re not commonly supported types.
>>>
>>> I would also draw the line at UUID types.
>>>
>>> - Kyle Bendickson
>>>
>>> On Jul 30, 2021, at 3:15 PM, Ryan Blue <b...@tabular.io> wrote:
>>>
>>> 
>>> Jacques, you make some good points here. I think my argument about
>>> usability leading to performance issues is a stronger argument for engines
>>> than for Iceberg. Still, there are inefficiencies in Iceberg if someone
>>> chooses to use a string in an engine that doesn't have a UUID type.
>>>
>>> Another thing to consider is cross-engine support. If Iceberg removes
>>> UUID, then Trino would probably translate to fixed[16]. That results in a
>>> table that's difficult to query in other engines, where people would
>>> probably choose to store the data as a string. On the other hand, if
>>> Iceberg keeps the UUID type then integrations would simply translate to the
>>> UUID string representation before passing data to the other engines.
>>> While the engines would be using 36-byte values in join keys, the user
>>> experience issue is fixed and the data is more compact on disk and in
>>> Iceberg's bounds metadata.
>>>
>>> While having a UUID type in Iceberg can't really help engines that don't
>>> support UUID take advantage of the type at runtime, it does seem slightly
>>> better to have the UUID type in general since at least one engine supports
>>> it and it provides the expected user experience with a compact
>>> representation.
>>>
>>> IPv4 addresses are a good thing to think about as well, since most of
>>> the same arguments apply. If we keep the UUID type, should we also add IPv4
>>> or IPv6 types? I would probably draw the line at UUID because it helps in
>>> joins, which are an important operation. IPv4 representations aren't that
>>> big of an inconvenience unless you need to do IP manipulation, which is
>>> typically in a UDF and not the query engine. And you can always keep both
>>> representations in a table fairly inexpensively. Does this sound like a
>>> valid rationale for having UUID but not IP types?
>>>
>>> Ryan
>>>
>>> On Thu, Jul 29, 2021 at 5:08 PM Jacques Nadeau <jacquesnad...@gmail.com>
>>> wrote:
>>>
>>>> It seems like Spark, Hive, Dremio and Impala all lack UUID as a native
>>>> type. Which engines are you thinking of that have a native UUID type
>>>> besides the Presto derivatives and support Iceberg?
>>>>
>>>> I agree that Trino should expose a UUID type on top of Iceberg tables.
>>>> All the user experience things that you are describing as important
>>>> (compact storage, friendly display, ddl, clean literals) are possible
>>>> without it being a first class type in Iceberg using a trino specific
>>>> property.
>>>>
>>>> I don't really have a strong opinion about UUID. In general, type bloat
>>>> is probably just a part of this kind of project. Generally, CHAR(X) and
>>>> VARCHAR(X) feel like much bigger concerns given that they exist in all of
>>>> the engines but not Iceberg--especially when we start talking about views.
>>>>
>>>> Some of this argues for physical vs logical type abstraction.
>>>> (Something that was always challenging in Parquet but also helped to
>>>> resolve how these types are managed in engines that don't support them.)
>>>>
>>>> thanks,
>>>> Jacques
>>>>
>>>> PS: Funny aside, the bloat on an ip address is actually worse than a
>>>> UUID, right? IPv4 = 4 bytes. IPv4 String = 15 bytes.... 15/4 => 275% bloat.
>>>> UUID 36/16 => 125% bloat.
>>>>
>>>> On Thu, Jul 29, 2021 at 4:39 PM Ryan Blue <b...@tabular.io> wrote:
>>>>
>>>>> I don't think this is just a problem in Trino.
>>>>>
>>>>> If there is no UUID type, then a user must choose between a 36-byte
>>>>> string and a 16-byte binary. That's not a good choice to force people 
>>>>> into.
>>>>> If someone chooses binary, then it's harder to work with rows and 
>>>>> construct
>>>>> queries even though there is a standard representation for UUIDs. To avoid
>>>>> the user headache, people will probably choose to store values as strings.
>>>>> Using a string would mean that more than half the value is needlessly
>>>>> discarded by default in Iceberg lower/upper bounds instead of keeping the
>>>>> entire value. And since engines don't know what's in the string, the full
>>>>> value must be used in comparison, which is extra work and extra space.
>>>>>
>>>>> Inflated values may not be a problem in some cases. IPv4 addresses are
>>>>> one case where you could argue that it doesn't matter very much that they
>>>>> are typically stored as strings. But I expect the use of UUIDs to be 
>>>>> common
>>>>> for ID columns because you can generate them without coordination (unlike
>>>>> an incrementing ID) and that's a concern because the use as an ID makes
>>>>> them likely to be join keys.
>>>>>
>>>>> If we want the values to be stored as 16-byte fixed, then we need to
>>>>> make it easy to get the expected string representation in and out, just
>>>>> like we do with date/time types. I don't think that's specific to any
>>>>> engine.
>>>>>
>>>>> On Thu, Jul 29, 2021 at 9:00 AM Jacques Nadeau <
>>>>> jacquesnad...@gmail.com> wrote:
>>>>>
>>>>>> I think points 1&2 don't really apply since a fixed width binary
>>>>>> already covers those properties.
>>>>>>
>>>>>> It seems like this isn't really a concern of iceberg but rather a
>>>>>> cosmetic layer that exists primarily (only?) in trino. In that case I 
>>>>>> would
>>>>>> be inclined to say that trino should just use custom metadata and a fixed
>>>>>> binary type. That way you still have the desired ux without exposing 
>>>>>> those
>>>>>> extra concepts to the  iceberg. It actually feels like better 
>>>>>> encapsulation
>>>>>> imo.
>>>>>>
>>>>>> On Thu, Jul 29, 2021, 3:00 AM Piotr Findeisen <
>>>>>> pi...@starburstdata.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I agree with Ryan, that it takes some precautions before one can
>>>>>>> assume uniqueness of UUID values, and that this shouldn't be any special
>>>>>>> for UUIDs at all.
>>>>>>> After all, this is just a primitive type, which is commonly used for
>>>>>>> certain things, but "commonly" doesn't mean "always".
>>>>>>>
>>>>>>> The advantages of having a dedicated type are on 3 layers.
>>>>>>> The compact representation in the file, and compact representation
>>>>>>> in memory in the query engine are the ones mentioned above.
>>>>>>>
>>>>>>> The third layer is the usability. Seeing a UUID column i know what
>>>>>>> values i can expect, so it's more descriptive than `id char(36)`.
>>>>>>> This also means i can CREATE TABLE ... AS SELECT uuid(), ....
>>>>>>> without need for casting to varchar.
>>>>>>> It also removes temptation of casting uuid to varbinary to achieve
>>>>>>> compact representation.
>>>>>>>
>>>>>>> Thus i think it would be good to have them.
>>>>>>>
>>>>>>> Best
>>>>>>> PF
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jul 28, 2021 at 5:57 PM Ryan Blue <b...@tabular.io> wrote:
>>>>>>>
>>>>>>>> The original reason why I added UUID to the spec was that I thought
>>>>>>>> there would be opportunities to take advantage of UUIDs as unique 
>>>>>>>> values
>>>>>>>> and to optimize the use of UUIDs. I was thinking about auto-increment 
>>>>>>>> ID
>>>>>>>> fields and how we might do something similar in Iceberg.
>>>>>>>>
>>>>>>>> The reason we have thought about removing UUID is that there aren't
>>>>>>>> as many opportunities to take advantage of UUIDs as I thought. My 
>>>>>>>> original
>>>>>>>> assumption was that we could do things like bucket on UUID fields or 
>>>>>>>> assume
>>>>>>>> that a UUID field has a high NDV. But that's not necessarily the case 
>>>>>>>> with
>>>>>>>> when a UUID field is a foreign key, only when it is used as an 
>>>>>>>> identifier
>>>>>>>> or primary key. Before Jack added tracking for row identifier fields, 
>>>>>>>> we
>>>>>>>> couldn't know that a UUID was unique in a table. As a result, we didn't
>>>>>>>> invest in support for UUID.
>>>>>>>>
>>>>>>>> Quick aside: Now that row identifier fields are tracked, we can do
>>>>>>>> some of these things with the row identifier fields. Engines can assume
>>>>>>>> that the tuple of row identifier fields is unique in a table for join
>>>>>>>> estimation. And engines can use row identifier fields in sort keys to
>>>>>>>> ensure lots of partition split locations (this is really important for
>>>>>>>> Spark).
>>>>>>>>
>>>>>>>> Coming back to UUIDs, the second reason to have a UUID type is
>>>>>>>> still valid: it is better to represent UUIDs as fixed[16] than as 36 
>>>>>>>> byte
>>>>>>>> UTF-8 strings that are more than twice as large, or even worse UCS-16
>>>>>>>> Strings that are 4x as large. Since UUIDs are likely to be used in 
>>>>>>>> joins,
>>>>>>>> this could really help engines as long as they can keep the values as
>>>>>>>> fixed-width binary.
>>>>>>>>
>>>>>>>> I could go either way on this. I think it is valuable to have a
>>>>>>>> compact representation for UUIDs rather than using the string
>>>>>>>> representation. But that will require investing in the type and 
>>>>>>>> building
>>>>>>>> support in engines that won't take advantage of it. If Trino can use 
>>>>>>>> this,
>>>>>>>> I think it may be worth keeping and investing in.
>>>>>>>>
>>>>>>>> Ryan
>>>>>>>>
>>>>>>>> On Tue, Jul 27, 2021 at 9:54 PM Jack Ye <yezhao...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Yes I agree with Jacques that fixed binary is what it is in the
>>>>>>>>> end. I think It is more about user experience, whether the conversion 
>>>>>>>>> is
>>>>>>>>> done at the user side or Iceberg and engine side. Many people just 
>>>>>>>>> store
>>>>>>>>> UUID as a 36 byte string instead of a 16 byte binary, so with an 
>>>>>>>>> explicit
>>>>>>>>> UUID type, Iceberg can optimize this common use case internally for 
>>>>>>>>> users.
>>>>>>>>> There might be some other benefits I overlooked, but maybe the 
>>>>>>>>> complication
>>>>>>>>> introduced by this type does not really justify the slightly better 
>>>>>>>>> user
>>>>>>>>> experience. I am also on the fence about it.
>>>>>>>>>
>>>>>>>>> -Jack Ye
>>>>>>>>>
>>>>>>>>> On Tue, Jul 27, 2021 at 7:54 PM Jacques Nadeau <
>>>>>>>>> jacquesnad...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> What specific arguments are there for it being a first class type
>>>>>>>>>> besides it is elsewhere? Is there some kind of optimization iceberg 
>>>>>>>>>> or an
>>>>>>>>>> engine could do if it was typed versus just a bucket of bits? Fixed 
>>>>>>>>>> width
>>>>>>>>>> binary seems to cover the cases I see in terms of actual 
>>>>>>>>>> functionality in
>>>>>>>>>> the iceberg libraries or engines…
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Jul 27, 2021 at 6:54 PM Yan Yan <yyany...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> One conversation I used to come across regarding UUID
>>>>>>>>>>> deprecation was from https://github.com/apache/iceberg/pull/1611
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Yan
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jul 27, 2021 at 1:07 PM Peter Vary
>>>>>>>>>>> <pv...@cloudera.com.invalid> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Joshua,
>>>>>>>>>>>>
>>>>>>>>>>>> I do not have a strong preference about the UUID type, but I
>>>>>>>>>>>> would like the highlight, that the type is handled inconsistently 
>>>>>>>>>>>> in
>>>>>>>>>>>> Iceberg with different file formats. (See:
>>>>>>>>>>>> https://github.com/apache/iceberg/issues/1881)
>>>>>>>>>>>>
>>>>>>>>>>>> If we keep the type, it would be good to standardize the
>>>>>>>>>>>> handling in every file format.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks, Peter
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, 27 Jul 2021, 17:08 Joshua Howard, <
>>>>>>>>>>>> joshthow...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi.
>>>>>>>>>>>>>
>>>>>>>>>>>>> UUID is a current data type according to the Iceberg spec (
>>>>>>>>>>>>> https://iceberg.apache.org/spec/#primitive-types), but there
>>>>>>>>>>>>> seems to have been some discussion about removing it? I could not 
>>>>>>>>>>>>> find the
>>>>>>>>>>>>> original discussion, but a reference to the discussion can be 
>>>>>>>>>>>>> found here (
>>>>>>>>>>>>> https://github.com/trinodb/trino/issues/6663).
>>>>>>>>>>>>>
>>>>>>>>>>>>> I generally agree with the consensus in the Trino issue to
>>>>>>>>>>>>> keep UUID in Iceberg. To summarize…
>>>>>>>>>>>>>
>>>>>>>>>>>>> - It makes sense to keep the type now that row identifiers are
>>>>>>>>>>>>> supported
>>>>>>>>>>>>> - Some engines (Trino) have support for the UUID type
>>>>>>>>>>>>> - Engines w/o support for UUID type can determine how to map
>>>>>>>>>>>>>
>>>>>>>>>>>>> Does anyone want to remove the type? If so, why?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Tabular
>>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>>
>
> --
> Josh Howard
>


-- 
Ryan Blue
Tabular

Re: [DISCUSS] UUID type

Reply via email to