I think it's a good idea to start adding timestamp types with nanosecond precision. I've heard this a few times lately and having a column of nanos as a work-around isn't a great solution. I'm much more skeptical that we should allow millis though. That just introduces more for engines to implement and there isn't significant value compared to using micros. That said, it isn't that much larger of a change to support another so if there is strong support I probably wouldn't oppose it.
For primary keys, I agree with Renjie. The standard NULL != NULL behavior makes it difficult to have a key that contains NULL and to have expectations of uniform behavior across engines. I don't see how the work-arounds are relevant to whether optional fields are allowed in a key. Most big data engines don't enforce primary key constraints because a uniqueness guarantee would be expensive and confusing (e.g. why is INSERT running a self-join?), regardless of whether the values can be optional. I don't think it's because they'd need to choose whether to use SQL 3-valued boolean logic or implicit null-safe equality. On Thu, Aug 24, 2023 at 9:29 AM Jacob Marble <jacobmar...@influxdata.com> wrote: > wrt 2) Agreed, NULL != NULL is standard. The human interpretation is NULL > = "unknown". However, exceptions are not uncommon. > > - MySQL docs state "The NULL value means “no data.” > <https://dev.mysql.com/doc/refman/8.1/en/null-values.html> ". > - SQL Server accommodates (NULL = NULL) == TRUE via SET ANSI_NULLS OFF > <https://learn.microsoft.com/en-us/sql/t-sql/statements/set-ansi-nulls-transact-sql?view=sql-server-ver16> > . > - Snowflake's external table Iceberg feature does not complain when I have > created schemas with optional identifier fields. This makes sense because > Snowflake > doesn't enforce primary key constraints > <https://docs.snowflake.com/en/sql-reference/constraints-overview#supported-constraint-types>, > even though it enforces NOT NULL constraints. > - Databricks doesn't enforce primary key constraints > <https://docs.databricks.com/en/tables/constraints.html#declare-primary-key-and-foreign-key-relationships> > . > - InfluxDB allows tags (identifier columns) to be NULL or "missing". (I am > an employee of InfluxData.) > > I might argue that, with analytical / data warehouse use cases, > identifying columns do not have a common interpretation. Indeed, a > query/compute engine may have its own reasons for handling primary > > On Wed, Aug 23, 2023 at 7:57 PM Renjie Liu <liurenjie2...@gmail.com> > wrote: > >> +1 for 1). >> >> >> >> For 2), I don’t think allowing optional field in identifier field would >> be a good idea. If I understand correctly, identifier fields is quite >> similar to primary key in relation database. In standard sql standard, NULL >> != NULL. If optional field is allowed, then two rows (1, NULL), (1, NULL) >> have exactly same value while they are not equal. The reason why float, >> double can’t be contained in primary key is similar. >> >> >> >> *From: *Jacob Marble <jacobmar...@influxdata.com> >> *Date: *Thursday, August 24, 2023 at 04:18 >> *To: *dev@iceberg.apache.org <dev@iceberg.apache.org> >> *Subject: *two proposed spec changes >> >> Good afternoon, >> >> >> >> I would like to propose two changes to the Iceberg spec: >> >> >> >> 1) *Primitive types time, timestamp, timestamptz gain property >> "precision",* with three possible values: millis, micros, nanos >> (borrowing the list from Parquet >> <https://github.com/apache/parquet-format/blob/apache-parquet-format-2.9.0/LogicalTypes.md#timestamp>). >> The stringified type names would be extended to time[nanos], >> timestamp[millis], timestamptz[micros], allowing for easy fallback to >> micros whenever the suffix is not present. >> >> >> >> For this proposal, here is a diff >> <https://github.com/apache/iceberg/compare/master...jacobmarble:apache-iceberg:jgm-time-units> >> demonstrating the idea just a bit. >> >> >> >> 2) *Identifier fields allowed to be optional.* From the spec "it is the >> responsibility of processing engines or data providers to enforce" which >> means that any such provider could limit the use of optional identifiers, >> just as they may limit particular data types or file formats. >> >> To be clear, the spec currently reads "Float, double, and optional fields >> cannot be used as identifier fields and a nested field cannot be used as an >> identifier field if it is nested in an optional struct, to avoid null >> values in identifiers." and I propose "Float and double fields cannot be >> used as identifier fields." >> >> >> >> - What do people think of these two proposed changes? >> >> - What can I do next? >> >> >> >> The spec mentions v3 >> <https://github.com/apache/iceberg/blob/9df8ddb05428cf3d7145bc5cf4a130de36dbb96a/format/spec.md#version-3>; >> is there a plan for a v3 release yet? I saw a conversation about enabling >> v2 by default, so I assume v3 is a ways off yet. >> >> -- >> >> Jacob Marble >> >> 🇺🇸 🇺🇦 >> > > > -- > Jacob Marble > 🇺🇸 🇺🇦 > -- Ryan Blue Tabular