Re: Supporting TINYINT and SMALLINT in Iceberg

Walaa Eldin Moustafa Fri, 10 Dec 2021 10:58:13 -0800

Just to update this thread, we have agreed internally to use INT in the
struct schema corresponding to union types. The reasons are two-fold:
(1) Uncertainty around whether TINYINT will make it to Iceberg while we
wanted to stick to the spec.
(2) Since Avro does not support TINYINT either, this issue can be faced
independently of whether Iceberg supports TINYINT or not.


That said, we had preliminary discussions with Dain Sundstrom and David
Phillips on the Trino representation of the schema, and we agree it is
directionally okay to change TINYINT to INT in the Trino version of the
conversion [1].

I will start another thread to check people's opinion on converting union
types to structs of that spec, and if we agree, we can start another
discussion with the Spark community.

Thanks,
Walaa.

[1]  https://github.com/trinodb/trino/pull/3483

On Fri, Dec 3, 2021 at 11:49 AM Walaa Eldin Moustafa <wa.moust...@gmail.com>
wrote:

> Also, wanted to add another observation that is on the flip side of the
> initial argument. Some data formats like Avro do not support TINYINT
> either. So even if Trino uses TINYINT in the struct schema, when the table
> is written back to Avro, TINYINT will not be used. This supports the side
> of the argument that Trino should use INT in
> https://github.com/trinodb/trino/pull/3483 to minimize incompatibility
> issues. Thoughts?
>
> On Thu, Dec 2, 2021 at 5:28 PM Walaa Eldin Moustafa <wa.moust...@gmail.com>
> wrote:
>
>> Hi all,
>>
>> I wanted to bring up the topic of supporting TINYINT and SMALLINT data
>> types in Iceberg to the dev mailing list. We have started some discussion
>> on the ASF Iceberg channel [1], and agreed to bring more visibility to the
>> topic here.
>>
>> This came up while we were implementing the equivalent of Trino's union
>> to struct conversion [2] in our internal fork of Iceberg, but we are facing
>> the issue that the output struct schema contains a field that is a TINYINT.
>> I am starting this thread to get people's opinion on adding TINYINT and
>> SMALLINT types to Iceberg. The reasons for the change are:
>>
>> * TINYINT (and SMALLINT) are supported by the Hive table format, and
>> consequently by engines like Trino and Spark. So adding them to Iceberg can
>> help with Iceberg adoption, and makes it easy for people to convert Hive
>> tables to Iceberg tables smoothly.
>>
>> * We have observed some regressions when trying to convert a table with
>> TINYINT to INT, especially with views that are defined on the TINYINT table
>> (see the Slack thread for details).
>>
>> * We want to converge the community to one form of transforming union
>> types to structs. Currently, the Trino Hive connector uses TINYINT in the
>> output schema, so we want to lock down a type there. Hopefully, that
>> becomes a standard across engines, and IRs. We are doing the same
>> conversion in Coral [3], but hoping to wrap our discussion here before
>> proceeding.
>>
>> * The more systems not agreeing on this conversion (to TINYINT or INT),
>> the more challenging it will be to fix this problem. Sometimes the schema
>> leaks transitively across tables and it becomes more difficult to retract,
>> so now is the best time to decide on this.
>>
>> That said, first, I am hoping to get an alignment on supporting TINYINT
>> and SMALLINT in Iceberg since this is a more fundamental issue, and has its
>> own justications. If we agree we can support TINYINT, the union type
>> standardization discussion would be simpler since we will not expect Trino
>> to change.
>>
>> Thanks,
>> Walaa.
>>
>> [1] https://the-asf.slack.com/archives/CF01LKV9S/p1637642089008400
>> [2] https://github.com/trinodb/trino/pull/3483
>> [3] https://github.com/linkedin/coral/pull/192
>>
>

Re: Supporting TINYINT and SMALLINT in Iceberg

Reply via email to