Re: two proposed spec changes

Ryan Blue Thu, 24 Aug 2023 17:26:47 -0700

I think it's a good idea to start adding timestamp types with nanosecond
precision. I've heard this a few times lately and having a column of nanos
as a work-around isn't a great solution. I'm much more skeptical that we
should allow millis though. That just introduces more for engines to
implement and there isn't significant value compared to using micros. That
said, it isn't that much larger of a change to support another so if there
is strong support I probably wouldn't oppose it.


For primary keys, I agree with Renjie. The standard NULL != NULL behavior
makes it difficult to have a key that contains NULL and to have
expectations of uniform behavior across engines. I don't see how the
work-arounds are relevant to whether optional fields are allowed in a key.
Most big data engines don't enforce primary key constraints because a
uniqueness guarantee would be expensive and confusing (e.g. why is INSERT
running a self-join?), regardless of whether the values can be optional. I
don't think it's because they'd need to choose whether to use SQL 3-valued
boolean logic or implicit null-safe equality.

On Thu, Aug 24, 2023 at 9:29 AM Jacob Marble <[email protected]>
wrote:

> wrt 2) Agreed, NULL != NULL is standard. The human interpretation is NULL
> = "unknown". However, exceptions are not uncommon.
>
> - MySQL docs state "The NULL value means “no data.”
> <https://dev.mysql.com/doc/refman/8.1/en/null-values.html> ".
> - SQL Server accommodates (NULL = NULL) == TRUE via SET ANSI_NULLS OFF
> <https://learn.microsoft.com/en-us/sql/t-sql/statements/set-ansi-nulls-transact-sql?view=sql-server-ver16>
> .
> - Snowflake's external table Iceberg feature does not complain when I have
> created schemas with optional identifier fields. This makes sense because 
> Snowflake
> doesn't enforce primary key constraints
> <https://docs.snowflake.com/en/sql-reference/constraints-overview#supported-constraint-types>,
> even though it enforces NOT NULL constraints.
> - Databricks doesn't enforce primary key constraints
> <https://docs.databricks.com/en/tables/constraints.html#declare-primary-key-and-foreign-key-relationships>
> .
> - InfluxDB allows tags (identifier columns) to be NULL or "missing". (I am
> an employee of InfluxData.)
>
> I might argue that, with analytical / data warehouse use cases,
> identifying columns do not have a common interpretation. Indeed, a
> query/compute engine may have its own reasons for handling primary
>
> On Wed, Aug 23, 2023 at 7:57 PM Renjie Liu <[email protected]>
> wrote:
>
>> +1 for 1).
>>
>>
>>
>> For 2), I don’t think allowing optional field in identifier field would
>> be a good idea. If I understand correctly, identifier fields is quite
>> similar to primary key in relation database. In standard sql standard, NULL
>> != NULL. If optional field is allowed, then two rows (1, NULL), (1, NULL)
>> have exactly same value while they are not equal. The reason why float,
>> double can’t be contained in primary key is similar.
>>
>>
>>
>> *From: *Jacob Marble <[email protected]>
>> *Date: *Thursday, August 24, 2023 at 04:18
>> *To: *[email protected] <[email protected]>
>> *Subject: *two proposed spec changes
>>
>> Good afternoon,
>>
>>
>>
>> I would like to propose two changes to the Iceberg spec:
>>
>>
>>
>> 1) *Primitive types time, timestamp, timestamptz gain property
>> "precision",* with three possible values: millis, micros, nanos
>> (borrowing the list from Parquet
>> <https://github.com/apache/parquet-format/blob/apache-parquet-format-2.9.0/LogicalTypes.md#timestamp>).
>> The stringified type names would be extended to time[nanos],
>> timestamp[millis], timestamptz[micros], allowing for easy fallback to
>> micros whenever the suffix is not present.
>>
>>
>>
>> For this proposal, here is a diff
>> <https://github.com/apache/iceberg/compare/master...jacobmarble:apache-iceberg:jgm-time-units>
>> demonstrating the idea just a bit.
>>
>>
>>
>> 2) *Identifier fields allowed to be optional.* From the spec "it is the
>> responsibility of processing engines or data providers to enforce" which
>> means that any such provider could limit the use of optional identifiers,
>> just as they may limit particular data types or file formats.
>>
>> To be clear, the spec currently reads "Float, double, and optional fields
>> cannot be used as identifier fields and a nested field cannot be used as an
>> identifier field if it is nested in an optional struct, to avoid null
>> values in identifiers." and I propose "Float and double fields cannot be
>> used as identifier fields."
>>
>>
>>
>> - What do people think of these two proposed changes?
>>
>> - What can I do next?
>>
>>
>>
>> The spec mentions v3
>> <https://github.com/apache/iceberg/blob/9df8ddb05428cf3d7145bc5cf4a130de36dbb96a/format/spec.md#version-3>;
>> is there a plan for a v3 release yet? I saw a conversation about enabling
>> v2 by default, so I assume v3 is a ways off yet.
>>
>> --
>>
>> Jacob Marble
>>
>> 🇺🇸 🇺🇦
>>
>
>
> --
> Jacob Marble
> 🇺🇸 🇺🇦
>


-- 
Ryan Blue
Tabular

Re: two proposed spec changes

Reply via email to