Thanks for updating the PR, Fokko! If we are okay with the approach of failing these unsupported cases, then I think we need to talk about when to fail. The PR currently changes what is allowed by Iceberg, but I think that Iceberg should allow these cases (which work with Avro) and should fail when converting the Iceberg schema to Parquet. That way if/when we add support in Parquet, we don't need to update what is allowed globally.
On Fri, Aug 8, 2025 at 6:14 AM Fokko Driesprong <fo...@apache.org> wrote: > Hey everyone, > > Thanks for jumping in here. > > To reiterate Bart, Having a UnknownType in a StructType, or a nested > Struct is not an issue AFAIK, except when the struct solely contains > Unknowns. > > Another scenario I would be concerned about is a table that contains only >> UnknownType top-level fields. What will happen in such a scenario? Will the >> Parquet format tolerate zero-column files? I think it's likely that even if >> Parquet-java supports this, it might be an untested and unsupported corner >> case in many of the alternative Parquet readers. > > > I don't think this is supported, and it will also lead to issues with > nullability, since you can't tell if the struct, or the nested UnknownType > is null. Also, Parquet-Java disallows *writing* empty structs > <https://github.com/apache/parquet-java/blob/6ac9b29a5c3eb879fd16cc79df52237edaa5e6c6/parquet-column/src/main/java/org/apache/parquet/schema/TypeUtil.java#L27> > . > > I've updated the PR <https://github.com/apache/iceberg/pull/13445> to > disallow for: > > - Having structs with solely UnknownType(s) > - Using UnknownType as the Element in a ListType > - Using the UnknownType as the Value in a MapType > > The logic in the current PR is needed to project the missing column, which > is needed anyway for the files that were written prior to when the > UnknownType was added. As Ryan suggested we can revisit this later on to > extend this. > > Kind regards, > Fokko Driesprong > > > > > > > > Op do 7 aug 2025 om 21:36 schreef Ryan Blue <rdb...@gmail.com>: > >> I think it's reasonable to fail in cases where the underlying format >> can't represent a type, like the element of a list. We can go back and fix >> this by adding support for using Parquet's UNKNOWN type annotation >> <https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#unknown-always-null>, >> but my original concern about it was that it introduces the need to choose >> an underlying physical type and I didn't want that to cause future >> problems. Maybe we should just standardize that as fixed[0]. >> >> On Mon, Jul 28, 2025 at 1:13 AM Bart Samwel <b...@databricks.com.invalid> >> wrote: >> >>> On Sat, Jul 26, 2025 at 6:09 PM Kevin Liu <kevinjq...@apache.org> wrote: >>> >>>> > My initial idea was to disallow the use of UnknownType as the >>>> element in ListType and not allow the UnknownType as either a Key or Value >>>> of a MapType. Any thoughts or concerns? >>>> >>>> That makes sense. I would also include `StructType` here too. >>>> `StructType` is another "complex type" (extends NestedType >>>> <https://github.com/apache/iceberg/blob/360f87326d4ccf67512a0240e529035801d9db2b/api/src/main/java/org/apache/iceberg/types/Types.java#L1001>) >>>> just like `ListType` and `MapType`. >>>> This will make `unknown` the first primitive type to not be allowed as >>>> part of another complex type. >>>> >>> >>> Do you mean to forbid `UnknownType` inside `StructType`? I'm afraid that >>> would undermine the orthogonality of the system. A common use of StructType >>> is to store entire rows. If StructType cannot contain elements that are >>> UnknownType but top-level rows can, then you can no longer store an >>> arbitrary top-level row inside a StructType. >>> >>> Unfortunately UnknownType in struct does have some issues. In >>> particular, if it's not stored, then IIUIC you can have issues with structs >>> containing only UnknownType fields -- they will look empty to Parquet, and >>> my understanding is that that isn't allowed. For orthogonality it would >>> have been better to actually store the unknown type, even if it's just as a >>> series of "this is NULL" bits. Omitting these fields in storage seems like >>> a convenient hack that leads to all sorts of surprising corner cases... >>> >>> >>> On Sat, Jul 26, 2025 at 5:43 AM Fokko Driesprong <fo...@apache.org> >>>> wrote: >>>> >>>>> Hi everyone, >>>>> >>>>> Recently I took a stab at implementing reading UknownType >>>>> <https://github.com/apache/iceberg/pull/13445> in the Java >>>>> implementation. I thought it would make sense to add this to the reference >>>>> implementation first. However, I ran into a limitation with the current >>>>> definition in the spec: >>>>> >>>>> Must be optional with null defaults; not stored in data files >>>>> >>>>> >>>>> One obvious limitation is that it cannot be the key of a MapType, as >>>>> it has to be not-null. It can't be stored either as the value of a >>>>> MapType since there is no easy way to store just the key without >>>>> doing awkward things, such as writing just the keys as a list. >>>>> >>>>> My initial idea was to disallow the use of UnknownType as the element >>>>> in ListType and not allow the UnknownType as either a Key or Value of >>>>> a MapType. Any thoughts or concerns? >>>>> >>>>> Kind regards from Belgium, >>>>> Fokko >>>>> >>>>