Hi Joana, Here are my thoughts, which are by no means the definitive answer here.
> 1. Given that variant can store any data type (both structured and > primitive), I'm unclear when unknown would be preferred as similar > behavior could be achieved by adding nullable variant columns? It seems > like variant could handle most schema evolution scenarios. Are there > specific situations where unknown is the better choice? I think the point of the type is to not impose on a system the need have to use a nullable variant column if it can't infer the type. The variant type has more overhead and can't easily be narrowed solely based on a metadata operation to other types (but a NullType can easily be widened to any type as a metadata operation). The null type is generally meant from moving from schema-less systems to ones with a schema. e.g. A CSV file that has an empty value for every field in a particular column. I think Parquet's description of its analogous type [1] is a good illustration: "Sometimes when discovering the schema of existing data, values are always null and the physical type can't be determined. This annotation signals the case where the physical type was guessed from all null values." That being said I don't think it is necessarily a bad idea if a system wants to use Nullable variants for this use-case. 2. Also, is unknown intended for explicit use in DDL? Meaning, should users > write DDL like: In general, I don't think there is much of a use-case for allowing users to set this through DDL, other than perhaps cloning it from an existing table. As you pointed out if someone wishing to keep there options open is likely better off using variant, or a type that can be widened later. There are probably multiple ways of handling evolution but two possible workable alternatives (I don't think these belong in the iceberg spec): 1. Automatically evolve the schema based on the first inserted non-null value for the column. 2. Block insertions that try to insert a non-null values in the column until user explicitly alters the column to a specific type. Cheers, Micah [1] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L330 On Tue, Nov 18, 2025 at 4:45 AM Joana Hrotkó <[email protected]> wrote: > Hi Iceberg Community, > > I'm working with Iceberg v3 and trying to understand the practical use > cases for the unknown type, especially in relation to the variant type. > > The variant type handles both semi-structured data (JSON, nested > objects/arrays) and primitive types (strings, integers, booleans, dates, > timestamps, etc.) with efficient binary encoding. It supports schema > evolution and provides good query performance. > > The unknown type is described as being for "evolving schemas without > forcing immediate resolution" and must always default to null. > > 1. Given that variant can store any data type (both structured and > primitive), I'm unclear when unknown would be preferred as similar > behavior could be achieved by adding nullable variant columns? It seems > like variant could handle most schema evolution scenarios. Are there > specific situations where unknown is the better choice? > > 2. Also, is unknown intended for explicit use in DDL? Meaning, should > users write DDL like: > > CREATE TABLE foo (col1 unknown)ALTER TABLE foo ADD COLUMN col2 unknown > > Or is unknown an internal type that engines use automatically during > schema evolution? > > Cheers, > > Joana Hrotkó >
