I don't think we necessarily reached consensus, but I think the general trend toward the end was to keep support for UUID. Should we start a vote to validate consensus?
On Wed, Sep 15, 2021 at 1:15 PM Joshua Howard <joshthow...@gmail.com> wrote: > Just following up on Piotr's message here. > > Have we converged? I think most people would assume that silence is a vote > for the status-quo. > > On Mon, Sep 13, 2021 at 7:30 AM Piotr Findeisen <pi...@starburstdata.com> > wrote: > >> Hi, >> >> It seems we converged here that UUID should remain included. >> I read this as a consensus reached, but it may be subjective. Did we >> objectively reached consensus on this? >> >> From Iceberg project perspective there isn't anything to do, as UUID >> already *is* part of the spec ( >> https://iceberg.apache.org/spec/#schemas-and-data-types). >> Trino Iceberg PR adding support for UUID >> https://github.com/trinodb/trino/pull/8747 was pending merge while this >> conversation has been ongoing. >> >> Best, >> PF >> >> >> >> On Mon, Aug 2, 2021 at 6:22 AM Kyle B <kjbendick...@gmail.com> wrote: >> >>> Hi Ryan and all, >>> >>> That sounds like a reasonable reason to leave IP address types out. In >>> my experience, dedicated IP address types are mostly found in logging tools >>> and other things for sysadmins / DevOps etc. >>> >>> When querying data with IP addresses, I’ve seen it done quite a lot (eg >>> security reasons) but usually stored as string or manipulated in a UDF. >>> They’re not commonly supported types. >>> >>> I would also draw the line at UUID types. >>> >>> - Kyle Bendickson >>> >>> On Jul 30, 2021, at 3:15 PM, Ryan Blue <b...@tabular.io> wrote: >>> >>> >>> Jacques, you make some good points here. I think my argument about >>> usability leading to performance issues is a stronger argument for engines >>> than for Iceberg. Still, there are inefficiencies in Iceberg if someone >>> chooses to use a string in an engine that doesn't have a UUID type. >>> >>> Another thing to consider is cross-engine support. If Iceberg removes >>> UUID, then Trino would probably translate to fixed[16]. That results in a >>> table that's difficult to query in other engines, where people would >>> probably choose to store the data as a string. On the other hand, if >>> Iceberg keeps the UUID type then integrations would simply translate to the >>> UUID string representation before passing data to the other engines. >>> While the engines would be using 36-byte values in join keys, the user >>> experience issue is fixed and the data is more compact on disk and in >>> Iceberg's bounds metadata. >>> >>> While having a UUID type in Iceberg can't really help engines that don't >>> support UUID take advantage of the type at runtime, it does seem slightly >>> better to have the UUID type in general since at least one engine supports >>> it and it provides the expected user experience with a compact >>> representation. >>> >>> IPv4 addresses are a good thing to think about as well, since most of >>> the same arguments apply. If we keep the UUID type, should we also add IPv4 >>> or IPv6 types? I would probably draw the line at UUID because it helps in >>> joins, which are an important operation. IPv4 representations aren't that >>> big of an inconvenience unless you need to do IP manipulation, which is >>> typically in a UDF and not the query engine. And you can always keep both >>> representations in a table fairly inexpensively. Does this sound like a >>> valid rationale for having UUID but not IP types? >>> >>> Ryan >>> >>> On Thu, Jul 29, 2021 at 5:08 PM Jacques Nadeau <jacquesnad...@gmail.com> >>> wrote: >>> >>>> It seems like Spark, Hive, Dremio and Impala all lack UUID as a native >>>> type. Which engines are you thinking of that have a native UUID type >>>> besides the Presto derivatives and support Iceberg? >>>> >>>> I agree that Trino should expose a UUID type on top of Iceberg tables. >>>> All the user experience things that you are describing as important >>>> (compact storage, friendly display, ddl, clean literals) are possible >>>> without it being a first class type in Iceberg using a trino specific >>>> property. >>>> >>>> I don't really have a strong opinion about UUID. In general, type bloat >>>> is probably just a part of this kind of project. Generally, CHAR(X) and >>>> VARCHAR(X) feel like much bigger concerns given that they exist in all of >>>> the engines but not Iceberg--especially when we start talking about views. >>>> >>>> Some of this argues for physical vs logical type abstraction. >>>> (Something that was always challenging in Parquet but also helped to >>>> resolve how these types are managed in engines that don't support them.) >>>> >>>> thanks, >>>> Jacques >>>> >>>> PS: Funny aside, the bloat on an ip address is actually worse than a >>>> UUID, right? IPv4 = 4 bytes. IPv4 String = 15 bytes.... 15/4 => 275% bloat. >>>> UUID 36/16 => 125% bloat. >>>> >>>> On Thu, Jul 29, 2021 at 4:39 PM Ryan Blue <b...@tabular.io> wrote: >>>> >>>>> I don't think this is just a problem in Trino. >>>>> >>>>> If there is no UUID type, then a user must choose between a 36-byte >>>>> string and a 16-byte binary. That's not a good choice to force people >>>>> into. >>>>> If someone chooses binary, then it's harder to work with rows and >>>>> construct >>>>> queries even though there is a standard representation for UUIDs. To avoid >>>>> the user headache, people will probably choose to store values as strings. >>>>> Using a string would mean that more than half the value is needlessly >>>>> discarded by default in Iceberg lower/upper bounds instead of keeping the >>>>> entire value. And since engines don't know what's in the string, the full >>>>> value must be used in comparison, which is extra work and extra space. >>>>> >>>>> Inflated values may not be a problem in some cases. IPv4 addresses are >>>>> one case where you could argue that it doesn't matter very much that they >>>>> are typically stored as strings. But I expect the use of UUIDs to be >>>>> common >>>>> for ID columns because you can generate them without coordination (unlike >>>>> an incrementing ID) and that's a concern because the use as an ID makes >>>>> them likely to be join keys. >>>>> >>>>> If we want the values to be stored as 16-byte fixed, then we need to >>>>> make it easy to get the expected string representation in and out, just >>>>> like we do with date/time types. I don't think that's specific to any >>>>> engine. >>>>> >>>>> On Thu, Jul 29, 2021 at 9:00 AM Jacques Nadeau < >>>>> jacquesnad...@gmail.com> wrote: >>>>> >>>>>> I think points 1&2 don't really apply since a fixed width binary >>>>>> already covers those properties. >>>>>> >>>>>> It seems like this isn't really a concern of iceberg but rather a >>>>>> cosmetic layer that exists primarily (only?) in trino. In that case I >>>>>> would >>>>>> be inclined to say that trino should just use custom metadata and a fixed >>>>>> binary type. That way you still have the desired ux without exposing >>>>>> those >>>>>> extra concepts to the iceberg. It actually feels like better >>>>>> encapsulation >>>>>> imo. >>>>>> >>>>>> On Thu, Jul 29, 2021, 3:00 AM Piotr Findeisen < >>>>>> pi...@starburstdata.com> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I agree with Ryan, that it takes some precautions before one can >>>>>>> assume uniqueness of UUID values, and that this shouldn't be any special >>>>>>> for UUIDs at all. >>>>>>> After all, this is just a primitive type, which is commonly used for >>>>>>> certain things, but "commonly" doesn't mean "always". >>>>>>> >>>>>>> The advantages of having a dedicated type are on 3 layers. >>>>>>> The compact representation in the file, and compact representation >>>>>>> in memory in the query engine are the ones mentioned above. >>>>>>> >>>>>>> The third layer is the usability. Seeing a UUID column i know what >>>>>>> values i can expect, so it's more descriptive than `id char(36)`. >>>>>>> This also means i can CREATE TABLE ... AS SELECT uuid(), .... >>>>>>> without need for casting to varchar. >>>>>>> It also removes temptation of casting uuid to varbinary to achieve >>>>>>> compact representation. >>>>>>> >>>>>>> Thus i think it would be good to have them. >>>>>>> >>>>>>> Best >>>>>>> PF >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Jul 28, 2021 at 5:57 PM Ryan Blue <b...@tabular.io> wrote: >>>>>>> >>>>>>>> The original reason why I added UUID to the spec was that I thought >>>>>>>> there would be opportunities to take advantage of UUIDs as unique >>>>>>>> values >>>>>>>> and to optimize the use of UUIDs. I was thinking about auto-increment >>>>>>>> ID >>>>>>>> fields and how we might do something similar in Iceberg. >>>>>>>> >>>>>>>> The reason we have thought about removing UUID is that there aren't >>>>>>>> as many opportunities to take advantage of UUIDs as I thought. My >>>>>>>> original >>>>>>>> assumption was that we could do things like bucket on UUID fields or >>>>>>>> assume >>>>>>>> that a UUID field has a high NDV. But that's not necessarily the case >>>>>>>> with >>>>>>>> when a UUID field is a foreign key, only when it is used as an >>>>>>>> identifier >>>>>>>> or primary key. Before Jack added tracking for row identifier fields, >>>>>>>> we >>>>>>>> couldn't know that a UUID was unique in a table. As a result, we didn't >>>>>>>> invest in support for UUID. >>>>>>>> >>>>>>>> Quick aside: Now that row identifier fields are tracked, we can do >>>>>>>> some of these things with the row identifier fields. Engines can assume >>>>>>>> that the tuple of row identifier fields is unique in a table for join >>>>>>>> estimation. And engines can use row identifier fields in sort keys to >>>>>>>> ensure lots of partition split locations (this is really important for >>>>>>>> Spark). >>>>>>>> >>>>>>>> Coming back to UUIDs, the second reason to have a UUID type is >>>>>>>> still valid: it is better to represent UUIDs as fixed[16] than as 36 >>>>>>>> byte >>>>>>>> UTF-8 strings that are more than twice as large, or even worse UCS-16 >>>>>>>> Strings that are 4x as large. Since UUIDs are likely to be used in >>>>>>>> joins, >>>>>>>> this could really help engines as long as they can keep the values as >>>>>>>> fixed-width binary. >>>>>>>> >>>>>>>> I could go either way on this. I think it is valuable to have a >>>>>>>> compact representation for UUIDs rather than using the string >>>>>>>> representation. But that will require investing in the type and >>>>>>>> building >>>>>>>> support in engines that won't take advantage of it. If Trino can use >>>>>>>> this, >>>>>>>> I think it may be worth keeping and investing in. >>>>>>>> >>>>>>>> Ryan >>>>>>>> >>>>>>>> On Tue, Jul 27, 2021 at 9:54 PM Jack Ye <yezhao...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Yes I agree with Jacques that fixed binary is what it is in the >>>>>>>>> end. I think It is more about user experience, whether the conversion >>>>>>>>> is >>>>>>>>> done at the user side or Iceberg and engine side. Many people just >>>>>>>>> store >>>>>>>>> UUID as a 36 byte string instead of a 16 byte binary, so with an >>>>>>>>> explicit >>>>>>>>> UUID type, Iceberg can optimize this common use case internally for >>>>>>>>> users. >>>>>>>>> There might be some other benefits I overlooked, but maybe the >>>>>>>>> complication >>>>>>>>> introduced by this type does not really justify the slightly better >>>>>>>>> user >>>>>>>>> experience. I am also on the fence about it. >>>>>>>>> >>>>>>>>> -Jack Ye >>>>>>>>> >>>>>>>>> On Tue, Jul 27, 2021 at 7:54 PM Jacques Nadeau < >>>>>>>>> jacquesnad...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> What specific arguments are there for it being a first class type >>>>>>>>>> besides it is elsewhere? Is there some kind of optimization iceberg >>>>>>>>>> or an >>>>>>>>>> engine could do if it was typed versus just a bucket of bits? Fixed >>>>>>>>>> width >>>>>>>>>> binary seems to cover the cases I see in terms of actual >>>>>>>>>> functionality in >>>>>>>>>> the iceberg libraries or engines… >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Jul 27, 2021 at 6:54 PM Yan Yan <yyany...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> One conversation I used to come across regarding UUID >>>>>>>>>>> deprecation was from https://github.com/apache/iceberg/pull/1611 >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Yan >>>>>>>>>>> >>>>>>>>>>> On Tue, Jul 27, 2021 at 1:07 PM Peter Vary >>>>>>>>>>> <pv...@cloudera.com.invalid> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Joshua, >>>>>>>>>>>> >>>>>>>>>>>> I do not have a strong preference about the UUID type, but I >>>>>>>>>>>> would like the highlight, that the type is handled inconsistently >>>>>>>>>>>> in >>>>>>>>>>>> Iceberg with different file formats. (See: >>>>>>>>>>>> https://github.com/apache/iceberg/issues/1881) >>>>>>>>>>>> >>>>>>>>>>>> If we keep the type, it would be good to standardize the >>>>>>>>>>>> handling in every file format. >>>>>>>>>>>> >>>>>>>>>>>> Thanks, Peter >>>>>>>>>>>> >>>>>>>>>>>> On Tue, 27 Jul 2021, 17:08 Joshua Howard, < >>>>>>>>>>>> joshthow...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi. >>>>>>>>>>>>> >>>>>>>>>>>>> UUID is a current data type according to the Iceberg spec ( >>>>>>>>>>>>> https://iceberg.apache.org/spec/#primitive-types), but there >>>>>>>>>>>>> seems to have been some discussion about removing it? I could not >>>>>>>>>>>>> find the >>>>>>>>>>>>> original discussion, but a reference to the discussion can be >>>>>>>>>>>>> found here ( >>>>>>>>>>>>> https://github.com/trinodb/trino/issues/6663). >>>>>>>>>>>>> >>>>>>>>>>>>> I generally agree with the consensus in the Trino issue to >>>>>>>>>>>>> keep UUID in Iceberg. To summarize… >>>>>>>>>>>>> >>>>>>>>>>>>> - It makes sense to keep the type now that row identifiers are >>>>>>>>>>>>> supported >>>>>>>>>>>>> - Some engines (Trino) have support for the UUID type >>>>>>>>>>>>> - Engines w/o support for UUID type can determine how to map >>>>>>>>>>>>> >>>>>>>>>>>>> Does anyone want to remove the type? If so, why? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Ryan Blue >>>>>>>> Tabular >>>>>>>> >>>>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Tabular >>>>> >>>> >>> >>> -- >>> Ryan Blue >>> Tabular >>> >>> > > -- > Josh Howard > -- Ryan Blue Tabular