Re: Dropping Apache Spark Hadoop2 Binary Distribution?

Dongjoon Hyun Tue, 04 Oct 2022 12:33:17 -0700

Yes, it's yours. I added you (Steve Loughran <[email protected]>) as BCC at
the first email, Steve. :)


Dongjoon.

On Tue, Oct 4, 2022 at 6:24 AM Steve Loughran <[email protected]> wrote:

>
> that sounds suspiciously like something I'd write :)
>
> the move to java8 happened in HADOOP-11858; 3.0.0
>
> HADOOP-16219, "[JDK8] Set minimum version of Hadoop 2 to JDK 8" has been
> open since 2019 and I just closed as WONTFIX.
>
> Most of the big production hadoop 2 clusters use java7, because that is
> what they were deployed with and if you are upgrading java versions then
> you'd want to upgrade to a java8 version of guava -with fixes, java8
> version of jackson -with fixes, and at that point "upgrade the cluster"
> becomes the strategy.
>
> If spark (or parquet, or avro, or ORC) claim to work on hadoop-2 then it's
> not enough to set hadoop.version in the build, it needs full integration
> testing with all those long-out-of-date transitive dependencies. And who
> does that? nobody.
>
>
> Does still claiming to support hadoop-2 cause any problems? Yes, because
> it forces anything which wants to use more recent APIs either to play
> reflection games (SparkHadoopUtil.createFile()...) have branch-3 only
> source trees (spark-hadoop-cloud), or stay stuck using older
> classes/methods for no real benefit. Finally: what are you going to do if
> someone actually files a bug related to spark 3.3.1 on hadoop 2.8.1? is
> anyone really going to care?
>
> Where this really frustrates me is in the libraries used downstream which
> worry about java11, java17 compatibility etc still set hadoop.version to be
> 2.10, even though it blocks them from basic improvements, such as skipping
> a HEAD request whenever they open a file on abfs, s3a or gcs (AVRO-3594).
> which transitively hurts iceberg, because it uses avro for its manifests,
> doesn't it?
>
> As for the cutting edge stuff...anyone at ApacheCon reading this email on
> oct 4 should attend the talk "Hadoop Vectored IO" where Mukund Thakur will
> be presenting the results of hive using the vector IO version of ORC, and
> seeing a reduction in 10-20% in the overall runtime of TPCDH benchmarks
> (300G). That doesn't need hive changes, just a build of ORC using the new
> API for async/parallel fetch of stripes. The parquet support with spark
> benchmarks is still a WiP, but I would expect to see similar numbers, and
> again, no changes to spark, just parquet
>
> And as the JMH microbenchmarks against the raw local FS show a 20x speedup
> in reads (async fetch into direct buffers), anyone running spark on a
> laptop should see some speedups too.
>
> Cloudera can ship this stuff internally. But the ASF projects are all
> stuck in time because of the belief that building against branch-2 makes
> sense. And it is transitive. Hive's requirements hold back iceberg, for
> example. (see also , PARQUET-2173. ...)
>
> If you want your applications to work better, especially in cloud, you
> should not just be running on a modern version of hadoop (and java11+,
> ideally), you *and your libraries* should be using the newer APIs to work
> with the data.
>
> Finally note that while that scatter/gather read call will only be on
> 3.3.5 we are doing a shim lib to offer the API to apps on older builds
> -it'll use readFully() to do the reads, just as the default implementation
> on all filesystems does on hadoop 3.3.5. See
> https://github.com/steveloughran/fs-api-shim ; it will become a hadoop
> extension lib. One which will not run on hadoop-2, but 3.2.x+ only.
> Obviously
>
> steve
>
> On Tue, 4 Oct 2022 at 04:16, Dongjoon Hyun <[email protected]>
> wrote:
>
>> Hi, All.
>>
>> I'm wondering if the following Apache Spark Hadoop2 Binary Distribution
>> is still used by someone in the community or not. If it's not used or not
>> useful,
>> we may remove it from Apache Spark 3.4.0 release.
>>
>>
>> https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz
>>
>> Here is the background of this question.
>> Since Apache Spark 2.2.0 (SPARK-19493, SPARK-19550), the Apache
>> Spark community has been building and releasing with Java 8 only.
>> I believe that the user applications also use Java8+ in these days.
>> Recently, I received the following message from the Hadoop PMC.
>>
>>   > "if you really want to claim hadoop 2.x compatibility, then you have
>> to
>>   > be building against java 7". Otherwise a lot of people with hadoop 2.x
>>   > clusters won't be able to run your code. If your projects are java8+
>>   > only, then they are implicitly hadoop 3.1+, no matter what you use
>>   > in your build. Hence: no need for branch-2 branches except
>>   > to complicate your build/test/release processes [1]
>>
>> If Hadoop2 binary distribution is no longer used as of today,
>> or incomplete somewhere due to Java 8 building, the following three
>> existing alternative Hadoop 3 binary distributions could be
>> the better official solution for old Hadoop 2 clusters.
>>
>>     1) Scala 2.12 and without-hadoop distribution
>>     2) Scala 2.12 and Hadoop 3 distribution
>>     3) Scala 2.13 and Hadoop 3 distribution
>>
>> In short, is there anyone who is using Apache Spark 3.3.0 Hadoop2 Binary
>> distribution?
>>
>> Dongjoon
>>
>> [1]
>> https://issues.apache.org/jira/browse/ORC-1251?focusedCommentId=17608247&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17608247
>>
>

Re: Dropping Apache Spark Hadoop2 Binary Distribution?

Reply via email to