Thanks for updating the PR, Fokko!

If we are okay with the approach of failing these unsupported cases, then I
think we need to talk about when to fail. The PR currently changes what is
allowed by Iceberg, but I think that Iceberg should allow these cases
(which work with Avro) and should fail when converting the Iceberg schema
to Parquet. That way if/when we add support in Parquet, we don't need to
update what is allowed globally.

On Fri, Aug 8, 2025 at 6:14 AM Fokko Driesprong <fo...@apache.org> wrote:

> Hey everyone,
>
> Thanks for jumping in here.
>
> To reiterate Bart, Having a UnknownType in a StructType, or a nested
> Struct is not an issue AFAIK, except when the struct solely contains
> Unknowns.
>
> Another scenario I would be concerned about is a table that contains only
>> UnknownType top-level fields. What will happen in such a scenario? Will the
>> Parquet format tolerate zero-column files? I think it's likely that even if
>> Parquet-java supports this, it might be an untested and unsupported corner
>> case in many of the alternative Parquet readers.
>
>
> I don't think this is supported, and it will also lead to issues with
> nullability, since you can't tell if the struct, or the nested UnknownType
> is null. Also, Parquet-Java disallows *writing* empty structs
> <https://github.com/apache/parquet-java/blob/6ac9b29a5c3eb879fd16cc79df52237edaa5e6c6/parquet-column/src/main/java/org/apache/parquet/schema/TypeUtil.java#L27>
> .
>
> I've updated the PR <https://github.com/apache/iceberg/pull/13445> to
> disallow for:
>
>    - Having structs with solely UnknownType(s)
>    - Using UnknownType as the Element in a ListType
>    - Using the UnknownType as the Value in a MapType
>
> The logic in the current PR is needed to project the missing column, which
> is needed anyway for the files that were written prior to when the
> UnknownType was added. As Ryan suggested we can revisit this later on to
> extend this.
>
> Kind regards,
> Fokko Driesprong
>
>
>
>
>
>
>
> Op do 7 aug 2025 om 21:36 schreef Ryan Blue <rdb...@gmail.com>:
>
>> I think it's reasonable to fail in cases where the underlying format
>> can't represent a type, like the element of a list. We can go back and fix
>> this by adding support for using Parquet's UNKNOWN type annotation
>> <https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#unknown-always-null>,
>> but my original concern about it was that it introduces the need to choose
>> an underlying physical type and I didn't want that to cause future
>> problems. Maybe we should just standardize that as fixed[0].
>>
>> On Mon, Jul 28, 2025 at 1:13 AM Bart Samwel <b...@databricks.com.invalid>
>> wrote:
>>
>>> On Sat, Jul 26, 2025 at 6:09 PM Kevin Liu <kevinjq...@apache.org> wrote:
>>>
>>>> > My initial idea was to disallow the use of UnknownType as the
>>>> element in ListType and not allow the UnknownType as either a Key or Value
>>>> of a MapType. Any thoughts or concerns?
>>>>
>>>> That makes sense. I would also include `StructType` here too.
>>>> `StructType` is another  "complex type" (extends NestedType
>>>> <https://github.com/apache/iceberg/blob/360f87326d4ccf67512a0240e529035801d9db2b/api/src/main/java/org/apache/iceberg/types/Types.java#L1001>)
>>>> just like `ListType` and `MapType`.
>>>> This will make `unknown` the first primitive type to not be allowed as
>>>> part of another complex type.
>>>>
>>>
>>> Do you mean to forbid `UnknownType` inside `StructType`? I'm afraid that
>>> would undermine the orthogonality of the system. A common use of StructType
>>> is to store entire rows. If StructType cannot contain elements that are
>>> UnknownType but top-level rows can, then you can no longer store an
>>> arbitrary top-level row inside a StructType.
>>>
>>> Unfortunately UnknownType in struct does have some issues. In
>>> particular, if it's not stored, then IIUIC you can have issues with structs
>>> containing only UnknownType fields -- they will look empty to Parquet, and
>>> my understanding is that that isn't allowed. For orthogonality it would
>>> have been better to actually store the unknown type, even if it's just as a
>>> series of "this is NULL" bits. Omitting these fields in storage seems like
>>> a convenient hack that leads to all sorts of surprising corner cases...
>>>
>>>
>>> On Sat, Jul 26, 2025 at 5:43 AM Fokko Driesprong <fo...@apache.org>
>>>> wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> Recently I took a stab at implementing reading UknownType
>>>>> <https://github.com/apache/iceberg/pull/13445> in the Java
>>>>> implementation. I thought it would make sense to add this to the reference
>>>>> implementation first. However, I ran into a limitation with the current
>>>>> definition in the spec:
>>>>>
>>>>> Must be optional with null defaults; not stored in data files
>>>>>
>>>>>
>>>>> One obvious limitation is that it cannot be the key of a MapType, as
>>>>> it has to be not-null. It can't be stored either as the value of a
>>>>> MapType since there is no easy way to store just the key without
>>>>> doing awkward things, such as writing just the keys as a list.
>>>>>
>>>>> My initial idea was to disallow the use of UnknownType as the element
>>>>> in ListType and not allow the UnknownType as either a Key or Value of
>>>>> a MapType. Any thoughts or concerns?
>>>>>
>>>>> Kind regards from Belgium,
>>>>> Fokko
>>>>>
>>>>

Reply via email to