Thank you for your feedback and support, YangJie and Steve. For the internally-built Hadoop clusters, I believe internally-built Spark distribution with the corresponding custom Hadoop will be the best solution instead of Apache Spark with Apache Hadoop 2.7.4 client to have the full internal changes.
I opened a PR to make this thread visible in Apache Spark 3.4.0. SPARK-40651 Drop Hadoop2 binary distribution from release process https://github.com/apache/spark/pull/38099 Dongjoon. On 2022/10/04 19:32:52 Dongjoon Hyun wrote: > Yes, it's yours. I added you (Steve Loughran <ste...@apache.org>) as BCC at > the first email, Steve. :) > > Dongjoon. > > On Tue, Oct 4, 2022 at 6:24 AM Steve Loughran <ste...@cloudera.com> wrote: > > > > > that sounds suspiciously like something I'd write :) > > > > the move to java8 happened in HADOOP-11858; 3.0.0 > > > > HADOOP-16219, "[JDK8] Set minimum version of Hadoop 2 to JDK 8" has been > > open since 2019 and I just closed as WONTFIX. > > > > Most of the big production hadoop 2 clusters use java7, because that is > > what they were deployed with and if you are upgrading java versions then > > you'd want to upgrade to a java8 version of guava -with fixes, java8 > > version of jackson -with fixes, and at that point "upgrade the cluster" > > becomes the strategy. > > > > If spark (or parquet, or avro, or ORC) claim to work on hadoop-2 then it's > > not enough to set hadoop.version in the build, it needs full integration > > testing with all those long-out-of-date transitive dependencies. And who > > does that? nobody. > > > > > > Does still claiming to support hadoop-2 cause any problems? Yes, because > > it forces anything which wants to use more recent APIs either to play > > reflection games (SparkHadoopUtil.createFile()...) have branch-3 only > > source trees (spark-hadoop-cloud), or stay stuck using older > > classes/methods for no real benefit. Finally: what are you going to do if > > someone actually files a bug related to spark 3.3.1 on hadoop 2.8.1? is > > anyone really going to care? > > > > Where this really frustrates me is in the libraries used downstream which > > worry about java11, java17 compatibility etc still set hadoop.version to be > > 2.10, even though it blocks them from basic improvements, such as skipping > > a HEAD request whenever they open a file on abfs, s3a or gcs (AVRO-3594). > > which transitively hurts iceberg, because it uses avro for its manifests, > > doesn't it? > > > > As for the cutting edge stuff...anyone at ApacheCon reading this email on > > oct 4 should attend the talk "Hadoop Vectored IO" where Mukund Thakur will > > be presenting the results of hive using the vector IO version of ORC, and > > seeing a reduction in 10-20% in the overall runtime of TPCDH benchmarks > > (300G). That doesn't need hive changes, just a build of ORC using the new > > API for async/parallel fetch of stripes. The parquet support with spark > > benchmarks is still a WiP, but I would expect to see similar numbers, and > > again, no changes to spark, just parquet > > > > And as the JMH microbenchmarks against the raw local FS show a 20x speedup > > in reads (async fetch into direct buffers), anyone running spark on a > > laptop should see some speedups too. > > > > Cloudera can ship this stuff internally. But the ASF projects are all > > stuck in time because of the belief that building against branch-2 makes > > sense. And it is transitive. Hive's requirements hold back iceberg, for > > example. (see also , PARQUET-2173. ...) > > > > If you want your applications to work better, especially in cloud, you > > should not just be running on a modern version of hadoop (and java11+, > > ideally), you *and your libraries* should be using the newer APIs to work > > with the data. > > > > Finally note that while that scatter/gather read call will only be on > > 3.3.5 we are doing a shim lib to offer the API to apps on older builds > > -it'll use readFully() to do the reads, just as the default implementation > > on all filesystems does on hadoop 3.3.5. See > > https://github.com/steveloughran/fs-api-shim ; it will become a hadoop > > extension lib. One which will not run on hadoop-2, but 3.2.x+ only. > > Obviously > > > > steve > > > > On Tue, 4 Oct 2022 at 04:16, Dongjoon Hyun <dongjoon.h...@gmail.com> > > wrote: > > > >> Hi, All. > >> > >> I'm wondering if the following Apache Spark Hadoop2 Binary Distribution > >> is still used by someone in the community or not. If it's not used or not > >> useful, > >> we may remove it from Apache Spark 3.4.0 release. > >> > >> > >> https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz > >> > >> Here is the background of this question. > >> Since Apache Spark 2.2.0 (SPARK-19493, SPARK-19550), the Apache > >> Spark community has been building and releasing with Java 8 only. > >> I believe that the user applications also use Java8+ in these days. > >> Recently, I received the following message from the Hadoop PMC. > >> > >> > "if you really want to claim hadoop 2.x compatibility, then you have > >> to > >> > be building against java 7". Otherwise a lot of people with hadoop 2.x > >> > clusters won't be able to run your code. If your projects are java8+ > >> > only, then they are implicitly hadoop 3.1+, no matter what you use > >> > in your build. Hence: no need for branch-2 branches except > >> > to complicate your build/test/release processes [1] > >> > >> If Hadoop2 binary distribution is no longer used as of today, > >> or incomplete somewhere due to Java 8 building, the following three > >> existing alternative Hadoop 3 binary distributions could be > >> the better official solution for old Hadoop 2 clusters. > >> > >> 1) Scala 2.12 and without-hadoop distribution > >> 2) Scala 2.12 and Hadoop 3 distribution > >> 3) Scala 2.13 and Hadoop 3 distribution > >> > >> In short, is there anyone who is using Apache Spark 3.3.0 Hadoop2 Binary > >> distribution? > >> > >> Dongjoon > >> > >> [1] > >> https://issues.apache.org/jira/browse/ORC-1251?focusedCommentId=17608247&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17608247 > >> > > > --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org