Re: Dropping Apache Spark Hadoop2 Binary Distribution?

Steve Loughran Tue, 04 Oct 2022 06:25:19 -0700

that sounds suspiciously like something I'd write :)

the move to java8 happened in HADOOP-11858; 3.0.0

HADOOP-16219, "[JDK8] Set minimum version of Hadoop 2 to JDK 8" has been
open since 2019 and I just closed as WONTFIX.

Most of the big production hadoop 2 clusters use java7, because that is
what they were deployed with and if you are upgrading java versions then
you'd want to upgrade to a java8 version of guava -with fixes, java8
version of jackson -with fixes, and at that point "upgrade the cluster"
becomes the strategy.

If spark (or parquet, or avro, or ORC) claim to work on hadoop-2 then it's
not enough to set hadoop.version in the build, it needs full integration
testing with all those long-out-of-date transitive dependencies. And who
does that? nobody.

Does still claiming to support hadoop-2 cause any problems? Yes, because it
forces anything which wants to use more recent APIs either to play
reflection games (SparkHadoopUtil.createFile()...) have branch-3 only
source trees (spark-hadoop-cloud), or stay stuck using older
classes/methods for no real benefit. Finally: what are you going to do if
someone actually files a bug related to spark 3.3.1 on hadoop 2.8.1? is
anyone really going to care?

Where this really frustrates me is in the libraries used downstream which
worry about java11, java17 compatibility etc still set hadoop.version to be
2.10, even though it blocks them from basic improvements, such as skipping
a HEAD request whenever they open a file on abfs, s3a or gcs (AVRO-3594).
which transitively hurts iceberg, because it uses avro for its manifests,
doesn't it?

As for the cutting edge stuff...anyone at ApacheCon reading this email on
oct 4 should attend the talk "Hadoop Vectored IO" where Mukund Thakur will
be presenting the results of hive using the vector IO version of ORC, and
seeing a reduction in 10-20% in the overall runtime of TPCDH benchmarks
(300G). That doesn't need hive changes, just a build of ORC using the new
API for async/parallel fetch of stripes. The parquet support with spark
benchmarks is still a WiP, but I would expect to see similar numbers, and
again, no changes to spark, just parquet

And as the JMH microbenchmarks against the raw local FS show a 20x speedup
in reads (async fetch into direct buffers), anyone running spark on a
laptop should see some speedups too.

Cloudera can ship this stuff internally. But the ASF projects are all stuck
in time because of the belief that building against branch-2 makes sense.
And it is transitive. Hive's requirements hold back iceberg, for example.
(see also , PARQUET-2173. ...)

If you want your applications to work better, especially in cloud, you
should not just be running on a modern version of hadoop (and java11+,
ideally), you *and your libraries* should be using the newer APIs to work
with the data.

Finally note that while that scatter/gather read call will only be on 3.3.5
we are doing a shim lib to offer the API to apps on older builds -it'll use
readFully() to do the reads, just as the default implementation on all
filesystems does on hadoop 3.3.5. See
https://github.com/steveloughran/fs-api-shim ; it will become a hadoop
extension lib. One which will not run on hadoop-2, but 3.2.x+ only.
Obviously

steve

On Tue, 4 Oct 2022 at 04:16, Dongjoon Hyun <[email protected]> wrote:

> Hi, All.
>
> I'm wondering if the following Apache Spark Hadoop2 Binary Distribution
> is still used by someone in the community or not. If it's not used or not
> useful,
> we may remove it from Apache Spark 3.4.0 release.
>
>
> https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz
>
> Here is the background of this question.
> Since Apache Spark 2.2.0 (SPARK-19493, SPARK-19550), the Apache
> Spark community has been building and releasing with Java 8 only.
> I believe that the user applications also use Java8+ in these days.
> Recently, I received the following message from the Hadoop PMC.
>
>   > "if you really want to claim hadoop 2.x compatibility, then you have to
>   > be building against java 7". Otherwise a lot of people with hadoop 2.x
>   > clusters won't be able to run your code. If your projects are java8+
>   > only, then they are implicitly hadoop 3.1+, no matter what you use
>   > in your build. Hence: no need for branch-2 branches except
>   > to complicate your build/test/release processes [1]
>
> If Hadoop2 binary distribution is no longer used as of today,
> or incomplete somewhere due to Java 8 building, the following three
> existing alternative Hadoop 3 binary distributions could be
> the better official solution for old Hadoop 2 clusters.
>
>     1) Scala 2.12 and without-hadoop distribution
>     2) Scala 2.12 and Hadoop 3 distribution
>     3) Scala 2.13 and Hadoop 3 distribution
>
> In short, is there anyone who is using Apache Spark 3.3.0 Hadoop2 Binary
> distribution?
>
> Dongjoon
>
> [1]
> https://issues.apache.org/jira/browse/ORC-1251?focusedCommentId=17608247&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17608247
>

Re: Dropping Apache Spark Hadoop2 Binary Distribution?

Reply via email to