Thanks all for the response, much appreciated.

That said, I'd love to hear from more people on this. I think it would be
> great to drop support, but I don't know how many people still use it. Is
> upgrading Hadoop a good reason to drop support for an engine? Hadoop seems
> like a minor concern to me unless it is blocking something.


I noticed that we needed to bump Hadoop when we wanted to upgrade to
Parquet 1.13.0 <https://github.com/apache/iceberg/pull/7301>. It would be
nice to get this in since it allows for removing a workaround from the
Iceberg codebase (see PR for details).

Netflix is still on Spark-2.4.4 with Iceberg-0.9. We are actively migrating
> to Spark-3.x and Iceberg 1.1 (or later). I do not anticipate us
> using Spark-2.4.4 with newer versions of Iceberg (>0.9).


For Spark 2.4 Iceberg up to 1.2.1 is available:
https://mvnrepository.com/artifact/org.apache.iceberg/iceberg-spark-2.4

As for the Hadoop upgrade, I think that could be problematic for us if
> there's any non-backwards compatible API change required at compile time
> since we're still running a 2.8.x version.


Thanks for raising this. I took some time today to dig into this. There is
an effort to upgrade Hadoop <https://github.com/apache/iceberg/pull/5024> in
Iceberg, but that's stuck on incompatibilities with Tez. Unfortunately, Parquet
1.13.0
<https://github.com/apache/iceberg/actions/runs/4740904793/jobs/8417296190?pr=7301>
doesn't
compile against Hadoop 2.8.5 and also bringing back support Hadoop 2.8.x is
going to be hard <https://github.com/apache/parquet-mr/pull/1075>. For
Parquet, I've created a PR to run the CI against Hadoop 2.9.2
<https://github.com/apache/parquet-mr/pull/1076> so we know when we're
breaking compatibility.

TLDR: It looks like if we want to upgrade Parquet, and other libraries in
the future, we need to drop Hadoop 2. I'm hesitant to do that right now
because we might exclude users that are still on older versions of Hadoop
(such as Airbnb). Spark has announced that Spark 3.5 Hadoop 2 will be
dropped <https://lists.apache.org/thread/vr6bx2bmkgo4mhdspjm9g29h2c3lmrrz>.
I'll create a PR for removing Spark 2.4 shortly because I see a consensus
for removing that.

Kind regards,
Fokko

Op wo 19 apr 2023 om 19:02 schreef Anton Okolnychyi
<aokolnyc...@apple.com.invalid>:

> Yes, yes, yes!
>
> - Anton
>
> On Apr 19, 2023, at 8:17 AM, Ryan Blue <b...@tabular.io> wrote:
>
> Sounds like we have consensus for removing Spark 2.4.
>
> Thanks, everyone!
>
> On Wed, Apr 19, 2023 at 12:36 AM Ajantha Bhat <ajanthab...@gmail.com>
> wrote:
>
>> +1,
>> Spark-2.4 has reached EOL (
>> https://lists.apache.org/thread/tdk7r5gx3nwrds3fg7qmp5h2jnqgc6tb and
>> https://spark.apache.org/versioning-policy.html)
>>
>> Thanks,
>> Ajantha
>>
>> On Wed, Apr 19, 2023 at 3:52 AM Edgar Rodriguez <
>> edgar.rodrig...@airbnb.com.invalid> wrote:
>>
>>> I'm generally +1 on dropping Spark 2.4 - mostly everyone is moving to
>>> Spark 3.x, if not already moved.
>>>
>>> As for the Hadoop upgrade, I think that could be problematic for us if
>>> there's any non-backwards compatible API change required at compile time
>>> since we're still running a 2.8.x version.
>>>
>>> Cheers,
>>>
>>> On Mon, Apr 17, 2023 at 3:50 PM Steve Zhang <
>>> hongyue_zh...@apple.com.invalid> wrote:
>>>
>>>> +1 for dropping Spark 2.4 support and we can clean up doc as well such
>>>> as https://iceberg.apache.org/docs/latest/spark-queries/#spark-24
>>>>
>>>> Thanks,
>>>> Steve Zhang
>>>>
>>>>
>>>>
>>>> On Apr 13, 2023, at 12:53 PM, Jack Ye <yezhao...@gmail.com> wrote:
>>>>
>>>> +1 for dropping 2.4 support
>>>>
>>>>
>>>>
>>>
>>> --
>>> Edgar R
>>> Data Warehouse Infrastructure
>>>
>>
>
> --
> Ryan Blue
> Tabular
>
>
>

Reply via email to