Hey everyone,

Thanks for jumping in here.

To reiterate Bart, Having a UnknownType in a StructType, or a nested Struct
is not an issue AFAIK, except when the struct solely contains Unknowns.

Another scenario I would be concerned about is a table that contains only
> UnknownType top-level fields. What will happen in such a scenario? Will the
> Parquet format tolerate zero-column files? I think it's likely that even if
> Parquet-java supports this, it might be an untested and unsupported corner
> case in many of the alternative Parquet readers.


I don't think this is supported, and it will also lead to issues with
nullability, since you can't tell if the struct, or the nested UnknownType
is null. Also, Parquet-Java disallows *writing* empty structs
<https://github.com/apache/parquet-java/blob/6ac9b29a5c3eb879fd16cc79df52237edaa5e6c6/parquet-column/src/main/java/org/apache/parquet/schema/TypeUtil.java#L27>
.

I've updated the PR <https://github.com/apache/iceberg/pull/13445> to
disallow for:

   - Having structs with solely UnknownType(s)
   - Using UnknownType as the Element in a ListType
   - Using the UnknownType as the Value in a MapType

The logic in the current PR is needed to project the missing column, which
is needed anyway for the files that were written prior to when the
UnknownType was added. As Ryan suggested we can revisit this later on to
extend this.

Kind regards,
Fokko Driesprong







Op do 7 aug 2025 om 21:36 schreef Ryan Blue <rdb...@gmail.com>:

> I think it's reasonable to fail in cases where the underlying format can't
> represent a type, like the element of a list. We can go back and fix this
> by adding support for using Parquet's UNKNOWN type annotation
> <https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#unknown-always-null>,
> but my original concern about it was that it introduces the need to choose
> an underlying physical type and I didn't want that to cause future
> problems. Maybe we should just standardize that as fixed[0].
>
> On Mon, Jul 28, 2025 at 1:13 AM Bart Samwel <b...@databricks.com.invalid>
> wrote:
>
>> On Sat, Jul 26, 2025 at 6:09 PM Kevin Liu <kevinjq...@apache.org> wrote:
>>
>>> > My initial idea was to disallow the use of UnknownType as the element
>>> in ListType and not allow the UnknownType as either a Key or Value of a
>>> MapType. Any thoughts or concerns?
>>>
>>> That makes sense. I would also include `StructType` here too.
>>> `StructType` is another  "complex type" (extends NestedType
>>> <https://github.com/apache/iceberg/blob/360f87326d4ccf67512a0240e529035801d9db2b/api/src/main/java/org/apache/iceberg/types/Types.java#L1001>)
>>> just like `ListType` and `MapType`.
>>> This will make `unknown` the first primitive type to not be allowed as
>>> part of another complex type.
>>>
>>
>> Do you mean to forbid `UnknownType` inside `StructType`? I'm afraid that
>> would undermine the orthogonality of the system. A common use of StructType
>> is to store entire rows. If StructType cannot contain elements that are
>> UnknownType but top-level rows can, then you can no longer store an
>> arbitrary top-level row inside a StructType.
>>
>> Unfortunately UnknownType in struct does have some issues. In particular,
>> if it's not stored, then IIUIC you can have issues with structs containing
>> only UnknownType fields -- they will look empty to Parquet, and my
>> understanding is that that isn't allowed. For orthogonality it would have
>> been better to actually store the unknown type, even if it's just as a
>> series of "this is NULL" bits. Omitting these fields in storage seems like
>> a convenient hack that leads to all sorts of surprising corner cases...
>>
>>
>> On Sat, Jul 26, 2025 at 5:43 AM Fokko Driesprong <fo...@apache.org>
>>> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> Recently I took a stab at implementing reading UknownType
>>>> <https://github.com/apache/iceberg/pull/13445> in the Java
>>>> implementation. I thought it would make sense to add this to the reference
>>>> implementation first. However, I ran into a limitation with the current
>>>> definition in the spec:
>>>>
>>>> Must be optional with null defaults; not stored in data files
>>>>
>>>>
>>>> One obvious limitation is that it cannot be the key of a MapType, as
>>>> it has to be not-null. It can't be stored either as the value of a
>>>> MapType since there is no easy way to store just the key without doing
>>>> awkward things, such as writing just the keys as a list.
>>>>
>>>> My initial idea was to disallow the use of UnknownType as the element
>>>> in ListType and not allow the UnknownType as either a Key or Value of
>>>> a MapType. Any thoughts or concerns?
>>>>
>>>> Kind regards from Belgium,
>>>> Fokko
>>>>
>>>

Reply via email to