Re: Vendor integration strategy

OpenInx Thu, 09 Dec 2021 04:29:55 -0800

Thanks Jack for bringing this up, and thanks Ryan for sharing your point.

> Getting a minimal set of transitive dependencies, relocating the classes
that they pull in to avoid conflicts, and tracking licensing is a huge
amount of work that has so far been done or validated by a very small set
of people.

I did the iceberg-flink-runtime package work before. In that time, I need
to search all the dependencies from that module and pick out all the
licenses & notices and relocate all the common packages.  Yes, it's a huge
amount of work.  But I think great open source software should solve those
abstract common problems, recalling that we were discussing whether we need
to support multiple versions of the same engine in apache iceberg. I
remember that Ryan said at the time that if we do not solve this problem in
the official Apache iceberg repo, it means that every user needs to
manually solve these multi-version compatibility problems.  It is the
abstract common problem that I mentioned. This is why I am very pleased to
devote my bandwidth to multiple-version support, although I initially voted
in the opposite direction.

Back to this vendor bundle runtime jar issue,  it's still the abstract
common problem.  If we don't solve the problem, that means everyone who
wants to access the iceberg tables in aliyun need to build their own bundle
runtime jar to make this work.  We may argue that it's the vendor's duty to
provide the vendor bundle sdk (which is similar to the AWS bundle SDK),
but I don't think every vendor who wants to integrate apache iceberg has
provided the bundle SDK. I checked the aliyun client SDK, only the aliyun
object storage service has provided the SDK package [1] , but it's a zip
package with all individual dependencies in it, which means we still need
to load the individual dependencies one by one for flink/hive.  This will
make it costly for users to access the iceberg table, and even eventually
cause users to give up using iceberg.

As for the legal or license issues, I checked all the transitive
dependencies from iceberg-aliyun [2], all the dependencies are apache
license friendly and are allowed to redistribute. For my understanding, it
should not be a problem.  Besides, the apache hadoop release has already
included aliyun oss sdk into it, I think it provides an example.

[1]. https://www.alibabacloud.com/help/en/doc-detail/32009.html
[2]. https://github.com/apache/iceberg/pull/3684

On Thu, Dec 9, 2021 at 12:31 AM Ryan Blue <b...@tabular.io> wrote:

> The main problem with creating runtime Jars is transitive dependencies.
> Getting a minimal set of transitive dependencies, relocating the classes
> that they pull in to avoid conflicts, and tracking licensing is a huge
> amount of work that has so far been done or validated by a very small set
> of people.
>
> In addition, it is easy to make mistakes here. Updating a dependency can
> inadvertently pull in extra transitive dependencies that have incompatible
> licenses, aren't relocated, or otherwise cause significant license or
> runtime problems.
>
> We currently support runtime Jars for engines because it would otherwise
> be very difficult for people to use Iceberg. I don't think that same logic
> applies to vendor bundles. So the main question is: why are we doing this
> in Iceberg? Couldn't this integration be provided as a third-party Jar? The
> FileIO API is quite stable. And while I think it makes sense to have the
> implementations in Iceberg for maintenance, I don't think that it makes
> sense to provide a runtime Jar.
>
> I could be convinced otherwise, but I'm skeptical.
>
> Ryan
>
> On Tue, Dec 7, 2021 at 7:52 PM Jack Ye <yezhao...@gmail.com> wrote:
>
>> Hi everyone,
>>
>> As we are adding Aliyun as a new vendor integration in the upcoming
>> release, we are discussing the strategy we should take to integrate the
>> iceberg-aliyun package with all the engine runtimes.
>>
>> For some background, we had some discussions about this topic when
>> releasing Nessie and AWS modules in
>> https://github.com/apache/iceberg/issues/1887. In summary:
>>
>> 1. The iceberg-<vendor> package is always added to the engine runtimes to
>> avoid the need for users to load them manually.
>> 1. Use 1MB as a threshold. If the total size of the vendor's dependencies
>> is less than 1MB, just include it in engine runtime. Otherwise the vendor
>> dependencies are marked as provided and not bundled in the runtime jar.
>>
>> However, Aliyun is proposing a different approach, which:
>> 1. Does not include the vendor package in engine runtime
>> 2. Have an additional iceberg-<vendor>-runtime package that bundles all
>> the vendor dependencies, so user just need to specify 1 additional jar to
>> use the vendor
>>
>> AWS did not choose the approach proposed by Aliyun because AWS users
>> usually maintain their own version of AWS SDK and would like to upgrade
>> them independent of the AWS SDK version used by Iceberg. Although currently
>> it takes more effort for users to specify all the compile-only
>> dependencies, compute vendor services like AWS EMR are going to offer all
>> the jars directly in the classpath to avoid such need in the very near
>> future, and EMR will maintain their AWS SDK version upgrade independently.
>>
>> But the approach proposed by Aliyun seems to fit the use case of Aliyun
>> users better. For more context, please read
>> https://github.com/apache/iceberg/pull/3270 for the discussion between
>> me and Openinx and https://github.com/apache/iceberg/pull/3684 for the
>> approach proposed.
>>
>> I think we should consolidate the vendor integration strategy going
>> forward. It could be we support both approaches, or just choose one
>> approach going forward. It would be great if people with similar experience
>> or need could provide some insights.
>>
>> Best,
>> Jack Ye
>>
>>
>>
>
> --
> Ryan Blue
> Tabular
>

Re: Vendor integration strategy

Reply via email to