On Thu, Aug 24, 2023 at 5:28 PM Ryan Blue <b...@tabular.io> wrote: > I think it's a good idea to start adding timestamp types with nanosecond > precision. I've heard this a few times lately and having a column of nanos > as a work-around isn't a great solution. I'm much more skeptical that we > should allow millis though. That just introduces more for engines to > implement and there isn't significant value compared to using micros. That > said, it isn't that much larger of a change to support another so if there > is strong support I probably wouldn't oppose it. >
InfluxDB needs nanosecond timestamp; we don't need millis, and we don't need types time or timestamptz. However, it seemed not much effort to fill the blanks that others might need later. A funny wrinkle I learned: Parquet stores TIME(MILLIS) as int32, and all other TIME and TIMESTAMP values as int64. > For primary keys, I agree with Renjie. The standard NULL != NULL behavior > makes it difficult to have a key that contains NULL and to have > expectations of uniform behavior across engines. I don't see how the > work-arounds are relevant to whether optional fields are allowed in a key. > Most big data engines don't enforce primary key constraints because a > uniqueness guarantee would be expensive and confusing (e.g. why is INSERT > running a self-join?), regardless of whether the values can be optional. I > don't think it's because they'd need to choose whether to use SQL 3-valued > boolean logic or implicit null-safe equality. > I simply meant to show that there isn't much of a standard in how primary keys are handled in big data engines. I'll work with my team to consider work-arounds. On Thu, Aug 24, 2023 at 9:29 AM Jacob Marble <jacobmar...@influxdata.com> > wrote: > >> wrt 2) Agreed, NULL != NULL is standard. The human interpretation is NULL >> = "unknown". However, exceptions are not uncommon. >> >> - MySQL docs state "The NULL value means “no data.” >> <https://dev.mysql.com/doc/refman/8.1/en/null-values.html> ". >> - SQL Server accommodates (NULL = NULL) == TRUE via SET ANSI_NULLS OFF >> <https://learn.microsoft.com/en-us/sql/t-sql/statements/set-ansi-nulls-transact-sql?view=sql-server-ver16> >> . >> - Snowflake's external table Iceberg feature does not complain when I >> have created schemas with optional identifier fields. This makes sense >> because Snowflake doesn't enforce primary key constraints >> <https://docs.snowflake.com/en/sql-reference/constraints-overview#supported-constraint-types>, >> even though it enforces NOT NULL constraints. >> - Databricks doesn't enforce primary key constraints >> <https://docs.databricks.com/en/tables/constraints.html#declare-primary-key-and-foreign-key-relationships> >> . >> - InfluxDB allows tags (identifier columns) to be NULL or "missing". (I >> am an employee of InfluxData.) >> >> I might argue that, with analytical / data warehouse use cases, >> identifying columns do not have a common interpretation. Indeed, a >> query/compute engine may have its own reasons for handling primary >> >> On Wed, Aug 23, 2023 at 7:57 PM Renjie Liu <liurenjie2...@gmail.com> >> wrote: >> >>> +1 for 1). >>> >>> >>> >>> For 2), I don’t think allowing optional field in identifier field would >>> be a good idea. If I understand correctly, identifier fields is quite >>> similar to primary key in relation database. In standard sql standard, NULL >>> != NULL. If optional field is allowed, then two rows (1, NULL), (1, NULL) >>> have exactly same value while they are not equal. The reason why float, >>> double can’t be contained in primary key is similar. >>> >>> >>> >>> *From: *Jacob Marble <jacobmar...@influxdata.com> >>> *Date: *Thursday, August 24, 2023 at 04:18 >>> *To: *dev@iceberg.apache.org <dev@iceberg.apache.org> >>> *Subject: *two proposed spec changes >>> >>> Good afternoon, >>> >>> >>> >>> I would like to propose two changes to the Iceberg spec: >>> >>> >>> >>> 1) *Primitive types time, timestamp, timestamptz gain property >>> "precision",* with three possible values: millis, micros, nanos >>> (borrowing the list from Parquet >>> <https://github.com/apache/parquet-format/blob/apache-parquet-format-2.9.0/LogicalTypes.md#timestamp>). >>> The stringified type names would be extended to time[nanos], >>> timestamp[millis], timestamptz[micros], allowing for easy fallback to >>> micros whenever the suffix is not present. >>> >>> >>> >>> For this proposal, here is a diff >>> <https://github.com/apache/iceberg/compare/master...jacobmarble:apache-iceberg:jgm-time-units> >>> demonstrating the idea just a bit. >>> >>> >>> >>> 2) *Identifier fields allowed to be optional.* From the spec "it is the >>> responsibility of processing engines or data providers to enforce" which >>> means that any such provider could limit the use of optional identifiers, >>> just as they may limit particular data types or file formats. >>> >>> To be clear, the spec currently reads "Float, double, and optional >>> fields cannot be used as identifier fields and a nested field cannot be >>> used as an identifier field if it is nested in an optional struct, to avoid >>> null values in identifiers." and I propose "Float and double fields cannot >>> be used as identifier fields." >>> >>> >>> >>> - What do people think of these two proposed changes? >>> >>> - What can I do next? >>> >>> >>> >>> The spec mentions v3 >>> <https://github.com/apache/iceberg/blob/9df8ddb05428cf3d7145bc5cf4a130de36dbb96a/format/spec.md#version-3>; >>> is there a plan for a v3 release yet? I saw a conversation about enabling >>> v2 by default, so I assume v3 is a ways off yet. >>> >>> -- >>> >>> Jacob Marble >>> >>> 🇺🇸 🇺🇦 >>> >> >> >> -- >> Jacob Marble >> 🇺🇸 🇺🇦 >> > > > -- > Ryan Blue > Tabular > -- Jacob Marble 🇺🇸 🇺🇦