Re: [DISCUSS] UUID type

Piotr Findeisen Thu, 29 Jul 2021 03:01:11 -0700

Hi,

I agree with Ryan, that it takes some precautions before one can assume
uniqueness of UUID values, and that this shouldn't be any special for UUIDs
at all.
After all, this is just a primitive type, which is commonly used for
certain things, but "commonly" doesn't mean "always".


The advantages of having a dedicated type are on 3 layers.
The compact representation in the file, and compact representation in
memory in the query engine are the ones mentioned above.

The third layer is the usability. Seeing a UUID column i know what values i
can expect, so it's more descriptive than `id char(36)`.
This also means i can CREATE TABLE ... AS SELECT uuid(), .... without need
for casting to varchar.
It also removes temptation of casting uuid to varbinary to achieve compact
representation.

Thus i think it would be good to have them.

Best
PF



On Wed, Jul 28, 2021 at 5:57 PM Ryan Blue <[email protected]> wrote:

> The original reason why I added UUID to the spec was that I thought there
> would be opportunities to take advantage of UUIDs as unique values and to
> optimize the use of UUIDs. I was thinking about auto-increment ID fields
> and how we might do something similar in Iceberg.
>
> The reason we have thought about removing UUID is that there aren't as
> many opportunities to take advantage of UUIDs as I thought. My original
> assumption was that we could do things like bucket on UUID fields or assume
> that a UUID field has a high NDV. But that's not necessarily the case with
> when a UUID field is a foreign key, only when it is used as an identifier
> or primary key. Before Jack added tracking for row identifier fields, we
> couldn't know that a UUID was unique in a table. As a result, we didn't
> invest in support for UUID.
>
> Quick aside: Now that row identifier fields are tracked, we can do some of
> these things with the row identifier fields. Engines can assume that the
> tuple of row identifier fields is unique in a table for join estimation.
> And engines can use row identifier fields in sort keys to ensure lots of
> partition split locations (this is really important for Spark).
>
> Coming back to UUIDs, the second reason to have a UUID type is still
> valid: it is better to represent UUIDs as fixed[16] than as 36 byte UTF-8
> strings that are more than twice as large, or even worse UCS-16 Strings
> that are 4x as large. Since UUIDs are likely to be used in joins, this
> could really help engines as long as they can keep the values as
> fixed-width binary.
>
> I could go either way on this. I think it is valuable to have a compact
> representation for UUIDs rather than using the string representation. But
> that will require investing in the type and building support in engines
> that won't take advantage of it. If Trino can use this, I think it may be
> worth keeping and investing in.
>
> Ryan
>
> On Tue, Jul 27, 2021 at 9:54 PM Jack Ye <[email protected]> wrote:
>
>> Yes I agree with Jacques that fixed binary is what it is in the end. I
>> think It is more about user experience, whether the conversion is done at
>> the user side or Iceberg and engine side. Many people just store UUID as a
>> 36 byte string instead of a 16 byte binary, so with an explicit UUID type,
>> Iceberg can optimize this common use case internally for users. There might
>> be some other benefits I overlooked, but maybe the complication introduced
>> by this type does not really justify the slightly better user experience. I
>> am also on the fence about it.
>>
>> -Jack Ye
>>
>> On Tue, Jul 27, 2021 at 7:54 PM Jacques Nadeau <[email protected]>
>> wrote:
>>
>>> What specific arguments are there for it being a first class type
>>> besides it is elsewhere? Is there some kind of optimization iceberg or an
>>> engine could do if it was typed versus just a bucket of bits? Fixed width
>>> binary seems to cover the cases I see in terms of actual functionality in
>>> the iceberg libraries or engines…
>>>
>>>
>>>
>>> On Tue, Jul 27, 2021 at 6:54 PM Yan Yan <[email protected]> wrote:
>>>
>>>> One conversation I used to come across regarding UUID deprecation was
>>>> from https://github.com/apache/iceberg/pull/1611
>>>>
>>>> Thanks,
>>>> Yan
>>>>
>>>> On Tue, Jul 27, 2021 at 1:07 PM Peter Vary <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Joshua,
>>>>>
>>>>> I do not have a strong preference about the UUID type, but I would
>>>>> like the highlight, that the type is handled inconsistently in Iceberg 
>>>>> with
>>>>> different file formats. (See:
>>>>> https://github.com/apache/iceberg/issues/1881)
>>>>>
>>>>> If we keep the type, it would be good to standardize the handling in
>>>>> every file format.
>>>>>
>>>>> Thanks, Peter
>>>>>
>>>>> On Tue, 27 Jul 2021, 17:08 Joshua Howard, <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi.
>>>>>>
>>>>>> UUID is a current data type according to the Iceberg spec (
>>>>>> https://iceberg.apache.org/spec/#primitive-types), but there seems
>>>>>> to have been some discussion about removing it? I could not find the
>>>>>> original discussion, but a reference to the discussion can be found here 
>>>>>> (
>>>>>> https://github.com/trinodb/trino/issues/6663).
>>>>>>
>>>>>> I generally agree with the consensus in the Trino issue to keep UUID
>>>>>> in Iceberg. To summarize…
>>>>>>
>>>>>> - It makes sense to keep the type now that row identifiers are
>>>>>> supported
>>>>>> - Some engines (Trino) have support for the UUID type
>>>>>> - Engines w/o support for UUID type can determine how to map
>>>>>>
>>>>>> Does anyone want to remove the type? If so, why?
>>>>>
>>>>>
>
> --
> Ryan Blue
> Tabular
>

Re: [DISCUSS] UUID type

Reply via email to