Re: two proposed spec changes

Jacob Marble Thu, 24 Aug 2023 17:40:42 -0700

On Thu, Aug 24, 2023 at 5:28 PM Ryan Blue <b...@tabular.io> wrote:

> I think it's a good idea to start adding timestamp types with nanosecond
> precision. I've heard this a few times lately and having a column of nanos
> as a work-around isn't a great solution. I'm much more skeptical that we
> should allow millis though. That just introduces more for engines to
> implement and there isn't significant value compared to using micros. That
> said, it isn't that much larger of a change to support another so if there
> is strong support I probably wouldn't oppose it.
>


InfluxDB needs nanosecond timestamp; we don't need millis, and we don't
need types time or timestamptz. However, it seemed not much effort to fill
the blanks that others might need later.

A funny wrinkle I learned: Parquet stores TIME(MILLIS) as int32, and all
other TIME and TIMESTAMP values as int64.


> For primary keys, I agree with Renjie. The standard NULL != NULL behavior
> makes it difficult to have a key that contains NULL and to have
> expectations of uniform behavior across engines. I don't see how the
> work-arounds are relevant to whether optional fields are allowed in a key.
> Most big data engines don't enforce primary key constraints because a
> uniqueness guarantee would be expensive and confusing (e.g. why is INSERT
> running a self-join?), regardless of whether the values can be optional. I
> don't think it's because they'd need to choose whether to use SQL 3-valued
> boolean logic or implicit null-safe equality.
>

I simply meant to show that there isn't much of a standard in how primary
keys are handled in big data engines. I'll work with my team to consider
work-arounds.

On Thu, Aug 24, 2023 at 9:29 AM Jacob Marble <jacobmar...@influxdata.com>
> wrote:
>
>> wrt 2) Agreed, NULL != NULL is standard. The human interpretation is NULL
>> = "unknown". However, exceptions are not uncommon.
>>
>> - MySQL docs state "The NULL value means “no data.”
>> <https://dev.mysql.com/doc/refman/8.1/en/null-values.html> ".
>> - SQL Server accommodates (NULL = NULL) == TRUE via SET ANSI_NULLS OFF
>> <https://learn.microsoft.com/en-us/sql/t-sql/statements/set-ansi-nulls-transact-sql?view=sql-server-ver16>
>> .
>> - Snowflake's external table Iceberg feature does not complain when I
>> have created schemas with optional identifier fields. This makes sense
>> because Snowflake doesn't enforce primary key constraints
>> <https://docs.snowflake.com/en/sql-reference/constraints-overview#supported-constraint-types>,
>> even though it enforces NOT NULL constraints.
>> - Databricks doesn't enforce primary key constraints
>> <https://docs.databricks.com/en/tables/constraints.html#declare-primary-key-and-foreign-key-relationships>
>> .
>> - InfluxDB allows tags (identifier columns) to be NULL or "missing". (I
>> am an employee of InfluxData.)
>>
>> I might argue that, with analytical / data warehouse use cases,
>> identifying columns do not have a common interpretation. Indeed, a
>> query/compute engine may have its own reasons for handling primary
>>
>> On Wed, Aug 23, 2023 at 7:57 PM Renjie Liu <liurenjie2...@gmail.com>
>> wrote:
>>
>>> +1 for 1).
>>>
>>>
>>>
>>> For 2), I don’t think allowing optional field in identifier field would
>>> be a good idea. If I understand correctly, identifier fields is quite
>>> similar to primary key in relation database. In standard sql standard, NULL
>>> != NULL. If optional field is allowed, then two rows (1, NULL), (1, NULL)
>>> have exactly same value while they are not equal. The reason why float,
>>> double can’t be contained in primary key is similar.
>>>
>>>
>>>
>>> *From: *Jacob Marble <jacobmar...@influxdata.com>
>>> *Date: *Thursday, August 24, 2023 at 04:18
>>> *To: *dev@iceberg.apache.org <dev@iceberg.apache.org>
>>> *Subject: *two proposed spec changes
>>>
>>> Good afternoon,
>>>
>>>
>>>
>>> I would like to propose two changes to the Iceberg spec:
>>>
>>>
>>>
>>> 1) *Primitive types time, timestamp, timestamptz gain property
>>> "precision",* with three possible values: millis, micros, nanos
>>> (borrowing the list from Parquet
>>> <https://github.com/apache/parquet-format/blob/apache-parquet-format-2.9.0/LogicalTypes.md#timestamp>).
>>> The stringified type names would be extended to time[nanos],
>>> timestamp[millis], timestamptz[micros], allowing for easy fallback to
>>> micros whenever the suffix is not present.
>>>
>>>
>>>
>>> For this proposal, here is a diff
>>> <https://github.com/apache/iceberg/compare/master...jacobmarble:apache-iceberg:jgm-time-units>
>>> demonstrating the idea just a bit.
>>>
>>>
>>>
>>> 2) *Identifier fields allowed to be optional.* From the spec "it is the
>>> responsibility of processing engines or data providers to enforce" which
>>> means that any such provider could limit the use of optional identifiers,
>>> just as they may limit particular data types or file formats.
>>>
>>> To be clear, the spec currently reads "Float, double, and optional
>>> fields cannot be used as identifier fields and a nested field cannot be
>>> used as an identifier field if it is nested in an optional struct, to avoid
>>> null values in identifiers." and I propose "Float and double fields cannot
>>> be used as identifier fields."
>>>
>>>
>>>
>>> - What do people think of these two proposed changes?
>>>
>>> - What can I do next?
>>>
>>>
>>>
>>> The spec mentions v3
>>> <https://github.com/apache/iceberg/blob/9df8ddb05428cf3d7145bc5cf4a130de36dbb96a/format/spec.md#version-3>;
>>> is there a plan for a v3 release yet? I saw a conversation about enabling
>>> v2 by default, so I assume v3 is a ways off yet.
>>>
>>> --
>>>
>>> Jacob Marble
>>>
>>> 🇺🇸 🇺🇦
>>>
>>
>>
>> --
>> Jacob Marble
>> 🇺🇸 🇺🇦
>>
>
>
> --
> Ryan Blue
> Tabular
>


-- 
Jacob Marble
🇺🇸 🇺🇦

Re: two proposed spec changes

Reply via email to