wrt 2) Agreed, NULL != NULL is standard. The human interpretation is NULL = "unknown". However, exceptions are not uncommon.
- MySQL docs state "The NULL value means “no data.” <https://dev.mysql.com/doc/refman/8.1/en/null-values.html> ". - SQL Server accommodates (NULL = NULL) == TRUE via SET ANSI_NULLS OFF <https://learn.microsoft.com/en-us/sql/t-sql/statements/set-ansi-nulls-transact-sql?view=sql-server-ver16> . - Snowflake's external table Iceberg feature does not complain when I have created schemas with optional identifier fields. This makes sense because Snowflake doesn't enforce primary key constraints <https://docs.snowflake.com/en/sql-reference/constraints-overview#supported-constraint-types>, even though it enforces NOT NULL constraints. - Databricks doesn't enforce primary key constraints <https://docs.databricks.com/en/tables/constraints.html#declare-primary-key-and-foreign-key-relationships> . - InfluxDB allows tags (identifier columns) to be NULL or "missing". (I am an employee of InfluxData.) I might argue that, with analytical / data warehouse use cases, identifying columns do not have a common interpretation. Indeed, a query/compute engine may have its own reasons for handling primary On Wed, Aug 23, 2023 at 7:57 PM Renjie Liu <liurenjie2...@gmail.com> wrote: > +1 for 1). > > > > For 2), I don’t think allowing optional field in identifier field would be > a good idea. If I understand correctly, identifier fields is quite similar > to primary key in relation database. In standard sql standard, NULL != > NULL. If optional field is allowed, then two rows (1, NULL), (1, NULL) have > exactly same value while they are not equal. The reason why float, double > can’t be contained in primary key is similar. > > > > *From: *Jacob Marble <jacobmar...@influxdata.com> > *Date: *Thursday, August 24, 2023 at 04:18 > *To: *dev@iceberg.apache.org <dev@iceberg.apache.org> > *Subject: *two proposed spec changes > > Good afternoon, > > > > I would like to propose two changes to the Iceberg spec: > > > > 1) *Primitive types time, timestamp, timestamptz gain property > "precision",* with three possible values: millis, micros, nanos > (borrowing the list from Parquet > <https://github.com/apache/parquet-format/blob/apache-parquet-format-2.9.0/LogicalTypes.md#timestamp>). > The stringified type names would be extended to time[nanos], > timestamp[millis], timestamptz[micros], allowing for easy fallback to > micros whenever the suffix is not present. > > > > For this proposal, here is a diff > <https://github.com/apache/iceberg/compare/master...jacobmarble:apache-iceberg:jgm-time-units> > demonstrating the idea just a bit. > > > > 2) *Identifier fields allowed to be optional.* From the spec "it is the > responsibility of processing engines or data providers to enforce" which > means that any such provider could limit the use of optional identifiers, > just as they may limit particular data types or file formats. > > To be clear, the spec currently reads "Float, double, and optional fields > cannot be used as identifier fields and a nested field cannot be used as an > identifier field if it is nested in an optional struct, to avoid null > values in identifiers." and I propose "Float and double fields cannot be > used as identifier fields." > > > > - What do people think of these two proposed changes? > > - What can I do next? > > > > The spec mentions v3 > <https://github.com/apache/iceberg/blob/9df8ddb05428cf3d7145bc5cf4a130de36dbb96a/format/spec.md#version-3>; > is there a plan for a v3 release yet? I saw a conversation about enabling > v2 by default, so I assume v3 is a ways off yet. > > -- > > Jacob Marble > > 🇺🇸 🇺🇦 > -- Jacob Marble 🇺🇸 🇺🇦