I already added it to Substrait because of Iceberg lazy consensus :D
On Fri, Sep 17, 2021 at 2:05 PM Ryan Blue <b...@tabular.io> wrote: > Let's move forward with it. I'm not hearing much dissent after saying the > general trend is to keep UUID. So let's call it lazy consensus. > > Ryan > > On Fri, Sep 17, 2021 at 1:32 PM Piotr Findeisen <pi...@starburstdata.com> > wrote: > >> Hi Ryan, >> >> Please advise whatever feels more appropriate from your perspective. >> From my perspective, we could just go ahead and merge Trino Iceberg >> support for UUID, since this is just fulfilling the spec as it is defined >> today. >> >> Best >> PF >> >> >> On Wed, Sep 15, 2021 at 10:17 PM Ryan Blue <b...@tabular.io> wrote: >> >>> I don't think we necessarily reached consensus, but I think the general >>> trend toward the end was to keep support for UUID. Should we start a vote >>> to validate consensus? >>> >>> On Wed, Sep 15, 2021 at 1:15 PM Joshua Howard <joshthow...@gmail.com> >>> wrote: >>> >>>> Just following up on Piotr's message here. >>>> >>>> Have we converged? I think most people would assume that silence is a >>>> vote for the status-quo. >>>> >>>> On Mon, Sep 13, 2021 at 7:30 AM Piotr Findeisen < >>>> pi...@starburstdata.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> It seems we converged here that UUID should remain included. >>>>> I read this as a consensus reached, but it may be subjective. Did we >>>>> objectively reached consensus on this? >>>>> >>>>> From Iceberg project perspective there isn't anything to do, as UUID >>>>> already *is* part of the spec ( >>>>> https://iceberg.apache.org/spec/#schemas-and-data-types). >>>>> Trino Iceberg PR adding support for UUID >>>>> https://github.com/trinodb/trino/pull/8747 was pending merge while >>>>> this conversation has been ongoing. >>>>> >>>>> Best, >>>>> PF >>>>> >>>>> >>>>> >>>>> On Mon, Aug 2, 2021 at 6:22 AM Kyle B <kjbendick...@gmail.com> wrote: >>>>> >>>>>> Hi Ryan and all, >>>>>> >>>>>> That sounds like a reasonable reason to leave IP address types out. >>>>>> In my experience, dedicated IP address types are mostly found in logging >>>>>> tools and other things for sysadmins / DevOps etc. >>>>>> >>>>>> When querying data with IP addresses, I’ve seen it done quite a lot >>>>>> (eg security reasons) but usually stored as string or manipulated in a >>>>>> UDF. >>>>>> They’re not commonly supported types. >>>>>> >>>>>> I would also draw the line at UUID types. >>>>>> >>>>>> - Kyle Bendickson >>>>>> >>>>>> On Jul 30, 2021, at 3:15 PM, Ryan Blue <b...@tabular.io> wrote: >>>>>> >>>>>> >>>>>> Jacques, you make some good points here. I think my argument about >>>>>> usability leading to performance issues is a stronger argument for >>>>>> engines >>>>>> than for Iceberg. Still, there are inefficiencies in Iceberg if someone >>>>>> chooses to use a string in an engine that doesn't have a UUID type. >>>>>> >>>>>> Another thing to consider is cross-engine support. If Iceberg removes >>>>>> UUID, then Trino would probably translate to fixed[16]. That results in a >>>>>> table that's difficult to query in other engines, where people would >>>>>> probably choose to store the data as a string. On the other hand, if >>>>>> Iceberg keeps the UUID type then integrations would simply translate to >>>>>> the >>>>>> UUID string representation before passing data to the other engines. >>>>>> While the engines would be using 36-byte values in join keys, the user >>>>>> experience issue is fixed and the data is more compact on disk and in >>>>>> Iceberg's bounds metadata. >>>>>> >>>>>> While having a UUID type in Iceberg can't really help engines that >>>>>> don't support UUID take advantage of the type at runtime, it does seem >>>>>> slightly better to have the UUID type in general since at least one >>>>>> engine >>>>>> supports it and it provides the expected user experience with a compact >>>>>> representation. >>>>>> >>>>>> IPv4 addresses are a good thing to think about as well, since most of >>>>>> the same arguments apply. If we keep the UUID type, should we also add >>>>>> IPv4 >>>>>> or IPv6 types? I would probably draw the line at UUID because it helps in >>>>>> joins, which are an important operation. IPv4 representations aren't that >>>>>> big of an inconvenience unless you need to do IP manipulation, which is >>>>>> typically in a UDF and not the query engine. And you can always keep both >>>>>> representations in a table fairly inexpensively. Does this sound like a >>>>>> valid rationale for having UUID but not IP types? >>>>>> >>>>>> Ryan >>>>>> >>>>>> On Thu, Jul 29, 2021 at 5:08 PM Jacques Nadeau < >>>>>> jacquesnad...@gmail.com> wrote: >>>>>> >>>>>>> It seems like Spark, Hive, Dremio and Impala all lack UUID as a >>>>>>> native type. Which engines are you thinking of that have a native UUID >>>>>>> type >>>>>>> besides the Presto derivatives and support Iceberg? >>>>>>> >>>>>>> I agree that Trino should expose a UUID type on top of Iceberg >>>>>>> tables. All the user experience things that you are describing as >>>>>>> important >>>>>>> (compact storage, friendly display, ddl, clean literals) are possible >>>>>>> without it being a first class type in Iceberg using a trino specific >>>>>>> property. >>>>>>> >>>>>>> I don't really have a strong opinion about UUID. In general, type >>>>>>> bloat is probably just a part of this kind of project. Generally, >>>>>>> CHAR(X) >>>>>>> and VARCHAR(X) feel like much bigger concerns given that they exist in >>>>>>> all >>>>>>> of the engines but not Iceberg--especially when we start talking about >>>>>>> views. >>>>>>> >>>>>>> Some of this argues for physical vs logical type abstraction. >>>>>>> (Something that was always challenging in Parquet but also helped to >>>>>>> resolve how these types are managed in engines that don't support them.) >>>>>>> >>>>>>> thanks, >>>>>>> Jacques >>>>>>> >>>>>>> PS: Funny aside, the bloat on an ip address is actually worse than a >>>>>>> UUID, right? IPv4 = 4 bytes. IPv4 String = 15 bytes.... 15/4 => 275% >>>>>>> bloat. >>>>>>> UUID 36/16 => 125% bloat. >>>>>>> >>>>>>> On Thu, Jul 29, 2021 at 4:39 PM Ryan Blue <b...@tabular.io> wrote: >>>>>>> >>>>>>>> I don't think this is just a problem in Trino. >>>>>>>> >>>>>>>> If there is no UUID type, then a user must choose between a 36-byte >>>>>>>> string and a 16-byte binary. That's not a good choice to force people >>>>>>>> into. >>>>>>>> If someone chooses binary, then it's harder to work with rows and >>>>>>>> construct >>>>>>>> queries even though there is a standard representation for UUIDs. To >>>>>>>> avoid >>>>>>>> the user headache, people will probably choose to store values as >>>>>>>> strings. >>>>>>>> Using a string would mean that more than half the value is needlessly >>>>>>>> discarded by default in Iceberg lower/upper bounds instead of keeping >>>>>>>> the >>>>>>>> entire value. And since engines don't know what's in the string, the >>>>>>>> full >>>>>>>> value must be used in comparison, which is extra work and extra space. >>>>>>>> >>>>>>>> Inflated values may not be a problem in some cases. IPv4 addresses >>>>>>>> are one case where you could argue that it doesn't matter very much >>>>>>>> that >>>>>>>> they are typically stored as strings. But I expect the use of UUIDs to >>>>>>>> be >>>>>>>> common for ID columns because you can generate them without >>>>>>>> coordination >>>>>>>> (unlike an incrementing ID) and that's a concern because the use as an >>>>>>>> ID >>>>>>>> makes them likely to be join keys. >>>>>>>> >>>>>>>> If we want the values to be stored as 16-byte fixed, then we need >>>>>>>> to make it easy to get the expected string representation in and out, >>>>>>>> just >>>>>>>> like we do with date/time types. I don't think that's specific to any >>>>>>>> engine. >>>>>>>> >>>>>>>> On Thu, Jul 29, 2021 at 9:00 AM Jacques Nadeau < >>>>>>>> jacquesnad...@gmail.com> wrote: >>>>>>>> >>>>>>>>> I think points 1&2 don't really apply since a fixed width binary >>>>>>>>> already covers those properties. >>>>>>>>> >>>>>>>>> It seems like this isn't really a concern of iceberg but rather a >>>>>>>>> cosmetic layer that exists primarily (only?) in trino. In that case I >>>>>>>>> would >>>>>>>>> be inclined to say that trino should just use custom metadata and a >>>>>>>>> fixed >>>>>>>>> binary type. That way you still have the desired ux without exposing >>>>>>>>> those >>>>>>>>> extra concepts to the iceberg. It actually feels like better >>>>>>>>> encapsulation >>>>>>>>> imo. >>>>>>>>> >>>>>>>>> On Thu, Jul 29, 2021, 3:00 AM Piotr Findeisen < >>>>>>>>> pi...@starburstdata.com> wrote: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I agree with Ryan, that it takes some precautions before one can >>>>>>>>>> assume uniqueness of UUID values, and that this shouldn't be any >>>>>>>>>> special >>>>>>>>>> for UUIDs at all. >>>>>>>>>> After all, this is just a primitive type, which is commonly used >>>>>>>>>> for certain things, but "commonly" doesn't mean "always". >>>>>>>>>> >>>>>>>>>> The advantages of having a dedicated type are on 3 layers. >>>>>>>>>> The compact representation in the file, and compact >>>>>>>>>> representation in memory in the query engine are the ones mentioned >>>>>>>>>> above. >>>>>>>>>> >>>>>>>>>> The third layer is the usability. Seeing a UUID column i know >>>>>>>>>> what values i can expect, so it's more descriptive than `id >>>>>>>>>> char(36)`. >>>>>>>>>> This also means i can CREATE TABLE ... AS SELECT uuid(), .... >>>>>>>>>> without need for casting to varchar. >>>>>>>>>> It also removes temptation of casting uuid to varbinary to >>>>>>>>>> achieve compact representation. >>>>>>>>>> >>>>>>>>>> Thus i think it would be good to have them. >>>>>>>>>> >>>>>>>>>> Best >>>>>>>>>> PF >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, Jul 28, 2021 at 5:57 PM Ryan Blue <b...@tabular.io> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> The original reason why I added UUID to the spec was that I >>>>>>>>>>> thought there would be opportunities to take advantage of UUIDs as >>>>>>>>>>> unique >>>>>>>>>>> values and to optimize the use of UUIDs. I was thinking about >>>>>>>>>>> auto-increment ID fields and how we might do something similar in >>>>>>>>>>> Iceberg. >>>>>>>>>>> >>>>>>>>>>> The reason we have thought about removing UUID is that there >>>>>>>>>>> aren't as many opportunities to take advantage of UUIDs as I >>>>>>>>>>> thought. My >>>>>>>>>>> original assumption was that we could do things like bucket on UUID >>>>>>>>>>> fields >>>>>>>>>>> or assume that a UUID field has a high NDV. But that's not >>>>>>>>>>> necessarily the >>>>>>>>>>> case with when a UUID field is a foreign key, only when it is used >>>>>>>>>>> as an >>>>>>>>>>> identifier or primary key. Before Jack added tracking for row >>>>>>>>>>> identifier >>>>>>>>>>> fields, we couldn't know that a UUID was unique in a table. As a >>>>>>>>>>> result, we >>>>>>>>>>> didn't invest in support for UUID. >>>>>>>>>>> >>>>>>>>>>> Quick aside: Now that row identifier fields are tracked, we can >>>>>>>>>>> do some of these things with the row identifier fields. Engines can >>>>>>>>>>> assume >>>>>>>>>>> that the tuple of row identifier fields is unique in a table for >>>>>>>>>>> join >>>>>>>>>>> estimation. And engines can use row identifier fields in sort keys >>>>>>>>>>> to >>>>>>>>>>> ensure lots of partition split locations (this is really important >>>>>>>>>>> for >>>>>>>>>>> Spark). >>>>>>>>>>> >>>>>>>>>>> Coming back to UUIDs, the second reason to have a UUID type is >>>>>>>>>>> still valid: it is better to represent UUIDs as fixed[16] than as >>>>>>>>>>> 36 byte >>>>>>>>>>> UTF-8 strings that are more than twice as large, or even worse >>>>>>>>>>> UCS-16 >>>>>>>>>>> Strings that are 4x as large. Since UUIDs are likely to be used in >>>>>>>>>>> joins, >>>>>>>>>>> this could really help engines as long as they can keep the values >>>>>>>>>>> as >>>>>>>>>>> fixed-width binary. >>>>>>>>>>> >>>>>>>>>>> I could go either way on this. I think it is valuable to have a >>>>>>>>>>> compact representation for UUIDs rather than using the string >>>>>>>>>>> representation. But that will require investing in the type and >>>>>>>>>>> building >>>>>>>>>>> support in engines that won't take advantage of it. If Trino can >>>>>>>>>>> use this, >>>>>>>>>>> I think it may be worth keeping and investing in. >>>>>>>>>>> >>>>>>>>>>> Ryan >>>>>>>>>>> >>>>>>>>>>> On Tue, Jul 27, 2021 at 9:54 PM Jack Ye <yezhao...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Yes I agree with Jacques that fixed binary is what it is in the >>>>>>>>>>>> end. I think It is more about user experience, whether the >>>>>>>>>>>> conversion is >>>>>>>>>>>> done at the user side or Iceberg and engine side. Many people just >>>>>>>>>>>> store >>>>>>>>>>>> UUID as a 36 byte string instead of a 16 byte binary, so with an >>>>>>>>>>>> explicit >>>>>>>>>>>> UUID type, Iceberg can optimize this common use case internally >>>>>>>>>>>> for users. >>>>>>>>>>>> There might be some other benefits I overlooked, but maybe the >>>>>>>>>>>> complication >>>>>>>>>>>> introduced by this type does not really justify the slightly >>>>>>>>>>>> better user >>>>>>>>>>>> experience. I am also on the fence about it. >>>>>>>>>>>> >>>>>>>>>>>> -Jack Ye >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Jul 27, 2021 at 7:54 PM Jacques Nadeau < >>>>>>>>>>>> jacquesnad...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> What specific arguments are there for it being a first class >>>>>>>>>>>>> type besides it is elsewhere? Is there some kind of optimization >>>>>>>>>>>>> iceberg or >>>>>>>>>>>>> an engine could do if it was typed versus just a bucket of bits? >>>>>>>>>>>>> Fixed >>>>>>>>>>>>> width binary seems to cover the cases I see in terms of actual >>>>>>>>>>>>> functionality in the iceberg libraries or engines… >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Jul 27, 2021 at 6:54 PM Yan Yan <yyany...@gmail.com> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> One conversation I used to come across regarding UUID >>>>>>>>>>>>>> deprecation was from >>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/1611 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> Yan >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, Jul 27, 2021 at 1:07 PM Peter Vary >>>>>>>>>>>>>> <pv...@cloudera.com.invalid> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Joshua, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I do not have a strong preference about the UUID type, but I >>>>>>>>>>>>>>> would like the highlight, that the type is handled >>>>>>>>>>>>>>> inconsistently in >>>>>>>>>>>>>>> Iceberg with different file formats. (See: >>>>>>>>>>>>>>> https://github.com/apache/iceberg/issues/1881) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> If we keep the type, it would be good to standardize the >>>>>>>>>>>>>>> handling in every file format. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks, Peter >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Tue, 27 Jul 2021, 17:08 Joshua Howard, < >>>>>>>>>>>>>>> joshthow...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> UUID is a current data type according to the Iceberg spec ( >>>>>>>>>>>>>>>> https://iceberg.apache.org/spec/#primitive-types), but >>>>>>>>>>>>>>>> there seems to have been some discussion about removing it? I >>>>>>>>>>>>>>>> could not >>>>>>>>>>>>>>>> find the original discussion, but a reference to the >>>>>>>>>>>>>>>> discussion can be >>>>>>>>>>>>>>>> found here (https://github.com/trinodb/trino/issues/6663). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I generally agree with the consensus in the Trino issue to >>>>>>>>>>>>>>>> keep UUID in Iceberg. To summarize… >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> - It makes sense to keep the type now that row identifiers >>>>>>>>>>>>>>>> are supported >>>>>>>>>>>>>>>> - Some engines (Trino) have support for the UUID type >>>>>>>>>>>>>>>> - Engines w/o support for UUID type can determine how to map >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Does anyone want to remove the type? If so, why? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Ryan Blue >>>>>>>>>>> Tabular >>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Ryan Blue >>>>>>>> Tabular >>>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> Ryan Blue >>>>>> Tabular >>>>>> >>>>>> >>>> >>>> -- >>>> Josh Howard >>>> >>> >>> >>> -- >>> Ryan Blue >>> Tabular >>> >> > > -- > Ryan Blue > Tabular >