Re: two proposed spec changes

Jacob Marble Thu, 24 Aug 2023 09:29:54 -0700

wrt 2) Agreed, NULL != NULL is standard. The human interpretation is NULL =
"unknown". However, exceptions are not uncommon.


- MySQL docs state "The NULL value means “no data.”
<https://dev.mysql.com/doc/refman/8.1/en/null-values.html> ".
- SQL Server accommodates (NULL = NULL) == TRUE via SET ANSI_NULLS OFF
<https://learn.microsoft.com/en-us/sql/t-sql/statements/set-ansi-nulls-transact-sql?view=sql-server-ver16>
.
- Snowflake's external table Iceberg feature does not complain when I have
created schemas with optional identifier fields. This makes sense
because Snowflake
doesn't enforce primary key constraints
<https://docs.snowflake.com/en/sql-reference/constraints-overview#supported-constraint-types>,
even though it enforces NOT NULL constraints.
- Databricks doesn't enforce primary key constraints
<https://docs.databricks.com/en/tables/constraints.html#declare-primary-key-and-foreign-key-relationships>
.
- InfluxDB allows tags (identifier columns) to be NULL or "missing". (I am
an employee of InfluxData.)

I might argue that, with analytical / data warehouse use cases, identifying
columns do not have a common interpretation. Indeed, a query/compute engine
may have its own reasons for handling primary

On Wed, Aug 23, 2023 at 7:57 PM Renjie Liu <liurenjie2...@gmail.com> wrote:

> +1 for 1).
>
>
>
> For 2), I don’t think allowing optional field in identifier field would be
> a good idea. If I understand correctly, identifier fields is quite similar
> to primary key in relation database. In standard sql standard, NULL !=
> NULL. If optional field is allowed, then two rows (1, NULL), (1, NULL) have
> exactly same value while they are not equal. The reason why float, double
> can’t be contained in primary key is similar.
>
>
>
> *From: *Jacob Marble <jacobmar...@influxdata.com>
> *Date: *Thursday, August 24, 2023 at 04:18
> *To: *dev@iceberg.apache.org <dev@iceberg.apache.org>
> *Subject: *two proposed spec changes
>
> Good afternoon,
>
>
>
> I would like to propose two changes to the Iceberg spec:
>
>
>
> 1) *Primitive types time, timestamp, timestamptz gain property
> "precision",* with three possible values: millis, micros, nanos
> (borrowing the list from Parquet
> <https://github.com/apache/parquet-format/blob/apache-parquet-format-2.9.0/LogicalTypes.md#timestamp>).
> The stringified type names would be extended to time[nanos],
> timestamp[millis], timestamptz[micros], allowing for easy fallback to
> micros whenever the suffix is not present.
>
>
>
> For this proposal, here is a diff
> <https://github.com/apache/iceberg/compare/master...jacobmarble:apache-iceberg:jgm-time-units>
> demonstrating the idea just a bit.
>
>
>
> 2) *Identifier fields allowed to be optional.* From the spec "it is the
> responsibility of processing engines or data providers to enforce" which
> means that any such provider could limit the use of optional identifiers,
> just as they may limit particular data types or file formats.
>
> To be clear, the spec currently reads "Float, double, and optional fields
> cannot be used as identifier fields and a nested field cannot be used as an
> identifier field if it is nested in an optional struct, to avoid null
> values in identifiers." and I propose "Float and double fields cannot be
> used as identifier fields."
>
>
>
> - What do people think of these two proposed changes?
>
> - What can I do next?
>
>
>
> The spec mentions v3
> <https://github.com/apache/iceberg/blob/9df8ddb05428cf3d7145bc5cf4a130de36dbb96a/format/spec.md#version-3>;
> is there a plan for a v3 release yet? I saw a conversation about enabling
> v2 by default, so I assume v3 is a ways off yet.
>
> --
>
> Jacob Marble
>
> 🇺🇸 🇺🇦
>


-- 
Jacob Marble
🇺🇸 🇺🇦

Re: two proposed spec changes

Reply via email to