Re: Dropping Apache Spark Hadoop2 Binary Distribution?

Dongjoon Hyun Tue, 04 Oct 2022 12:43:03 -0700

Thank you for your feedback and support, YangJie and Steve.

For the internally-built Hadoop clusters, I believe internally-built Spark 
distribution with the corresponding custom Hadoop will be the best solution 
instead of Apache Spark with Apache Hadoop 2.7.4 client to have the full 
internal changes.


I opened a PR to make this thread visible in Apache Spark 3.4.0.

    SPARK-40651 Drop Hadoop2 binary distribution from release process
    https://github.com/apache/spark/pull/38099

Dongjoon.

On 2022/10/04 19:32:52 Dongjoon Hyun wrote:
> Yes, it's yours. I added you (Steve Loughran <ste...@apache.org>) as BCC at
> the first email, Steve. :)
> 
> Dongjoon.
> 
> On Tue, Oct 4, 2022 at 6:24 AM Steve Loughran <ste...@cloudera.com> wrote:
> 
> >
> > that sounds suspiciously like something I'd write :)
> >
> > the move to java8 happened in HADOOP-11858; 3.0.0
> >
> > HADOOP-16219, "[JDK8] Set minimum version of Hadoop 2 to JDK 8" has been
> > open since 2019 and I just closed as WONTFIX.
> >
> > Most of the big production hadoop 2 clusters use java7, because that is
> > what they were deployed with and if you are upgrading java versions then
> > you'd want to upgrade to a java8 version of guava -with fixes, java8
> > version of jackson -with fixes, and at that point "upgrade the cluster"
> > becomes the strategy.
> >
> > If spark (or parquet, or avro, or ORC) claim to work on hadoop-2 then it's
> > not enough to set hadoop.version in the build, it needs full integration
> > testing with all those long-out-of-date transitive dependencies. And who
> > does that? nobody.
> >
> >
> > Does still claiming to support hadoop-2 cause any problems? Yes, because
> > it forces anything which wants to use more recent APIs either to play
> > reflection games (SparkHadoopUtil.createFile()...) have branch-3 only
> > source trees (spark-hadoop-cloud), or stay stuck using older
> > classes/methods for no real benefit. Finally: what are you going to do if
> > someone actually files a bug related to spark 3.3.1 on hadoop 2.8.1? is
> > anyone really going to care?
> >
> > Where this really frustrates me is in the libraries used downstream which
> > worry about java11, java17 compatibility etc still set hadoop.version to be
> > 2.10, even though it blocks them from basic improvements, such as skipping
> > a HEAD request whenever they open a file on abfs, s3a or gcs (AVRO-3594).
> > which transitively hurts iceberg, because it uses avro for its manifests,
> > doesn't it?
> >
> > As for the cutting edge stuff...anyone at ApacheCon reading this email on
> > oct 4 should attend the talk "Hadoop Vectored IO" where Mukund Thakur will
> > be presenting the results of hive using the vector IO version of ORC, and
> > seeing a reduction in 10-20% in the overall runtime of TPCDH benchmarks
> > (300G). That doesn't need hive changes, just a build of ORC using the new
> > API for async/parallel fetch of stripes. The parquet support with spark
> > benchmarks is still a WiP, but I would expect to see similar numbers, and
> > again, no changes to spark, just parquet
> >
> > And as the JMH microbenchmarks against the raw local FS show a 20x speedup
> > in reads (async fetch into direct buffers), anyone running spark on a
> > laptop should see some speedups too.
> >
> > Cloudera can ship this stuff internally. But the ASF projects are all
> > stuck in time because of the belief that building against branch-2 makes
> > sense. And it is transitive. Hive's requirements hold back iceberg, for
> > example. (see also , PARQUET-2173. ...)
> >
> > If you want your applications to work better, especially in cloud, you
> > should not just be running on a modern version of hadoop (and java11+,
> > ideally), you *and your libraries* should be using the newer APIs to work
> > with the data.
> >
> > Finally note that while that scatter/gather read call will only be on
> > 3.3.5 we are doing a shim lib to offer the API to apps on older builds
> > -it'll use readFully() to do the reads, just as the default implementation
> > on all filesystems does on hadoop 3.3.5. See
> > https://github.com/steveloughran/fs-api-shim ; it will become a hadoop
> > extension lib. One which will not run on hadoop-2, but 3.2.x+ only.
> > Obviously
> >
> > steve
> >
> > On Tue, 4 Oct 2022 at 04:16, Dongjoon Hyun <dongjoon.h...@gmail.com>
> > wrote:
> >
> >> Hi, All.
> >>
> >> I'm wondering if the following Apache Spark Hadoop2 Binary Distribution
> >> is still used by someone in the community or not. If it's not used or not
> >> useful,
> >> we may remove it from Apache Spark 3.4.0 release.
> >>
> >>
> >> https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz
> >>
> >> Here is the background of this question.
> >> Since Apache Spark 2.2.0 (SPARK-19493, SPARK-19550), the Apache
> >> Spark community has been building and releasing with Java 8 only.
> >> I believe that the user applications also use Java8+ in these days.
> >> Recently, I received the following message from the Hadoop PMC.
> >>
> >>   > "if you really want to claim hadoop 2.x compatibility, then you have
> >> to
> >>   > be building against java 7". Otherwise a lot of people with hadoop 2.x
> >>   > clusters won't be able to run your code. If your projects are java8+
> >>   > only, then they are implicitly hadoop 3.1+, no matter what you use
> >>   > in your build. Hence: no need for branch-2 branches except
> >>   > to complicate your build/test/release processes [1]
> >>
> >> If Hadoop2 binary distribution is no longer used as of today,
> >> or incomplete somewhere due to Java 8 building, the following three
> >> existing alternative Hadoop 3 binary distributions could be
> >> the better official solution for old Hadoop 2 clusters.
> >>
> >>     1) Scala 2.12 and without-hadoop distribution
> >>     2) Scala 2.12 and Hadoop 3 distribution
> >>     3) Scala 2.13 and Hadoop 3 distribution
> >>
> >> In short, is there anyone who is using Apache Spark 3.3.0 Hadoop2 Binary
> >> distribution?
> >>
> >> Dongjoon
> >>
> >> [1]
> >> https://issues.apache.org/jira/browse/ORC-1251?focusedCommentId=17608247&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17608247
> >>
> >
> 

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Dropping Apache Spark Hadoop2 Binary Distribution?

Reply via email to