Yes, it's yours. I added you (Steve Loughran <ste...@apache.org>) as BCC at the first email, Steve. :)
Dongjoon. On Tue, Oct 4, 2022 at 6:24 AM Steve Loughran <ste...@cloudera.com> wrote: > > that sounds suspiciously like something I'd write :) > > the move to java8 happened in HADOOP-11858; 3.0.0 > > HADOOP-16219, "[JDK8] Set minimum version of Hadoop 2 to JDK 8" has been > open since 2019 and I just closed as WONTFIX. > > Most of the big production hadoop 2 clusters use java7, because that is > what they were deployed with and if you are upgrading java versions then > you'd want to upgrade to a java8 version of guava -with fixes, java8 > version of jackson -with fixes, and at that point "upgrade the cluster" > becomes the strategy. > > If spark (or parquet, or avro, or ORC) claim to work on hadoop-2 then it's > not enough to set hadoop.version in the build, it needs full integration > testing with all those long-out-of-date transitive dependencies. And who > does that? nobody. > > > Does still claiming to support hadoop-2 cause any problems? Yes, because > it forces anything which wants to use more recent APIs either to play > reflection games (SparkHadoopUtil.createFile()...) have branch-3 only > source trees (spark-hadoop-cloud), or stay stuck using older > classes/methods for no real benefit. Finally: what are you going to do if > someone actually files a bug related to spark 3.3.1 on hadoop 2.8.1? is > anyone really going to care? > > Where this really frustrates me is in the libraries used downstream which > worry about java11, java17 compatibility etc still set hadoop.version to be > 2.10, even though it blocks them from basic improvements, such as skipping > a HEAD request whenever they open a file on abfs, s3a or gcs (AVRO-3594). > which transitively hurts iceberg, because it uses avro for its manifests, > doesn't it? > > As for the cutting edge stuff...anyone at ApacheCon reading this email on > oct 4 should attend the talk "Hadoop Vectored IO" where Mukund Thakur will > be presenting the results of hive using the vector IO version of ORC, and > seeing a reduction in 10-20% in the overall runtime of TPCDH benchmarks > (300G). That doesn't need hive changes, just a build of ORC using the new > API for async/parallel fetch of stripes. The parquet support with spark > benchmarks is still a WiP, but I would expect to see similar numbers, and > again, no changes to spark, just parquet > > And as the JMH microbenchmarks against the raw local FS show a 20x speedup > in reads (async fetch into direct buffers), anyone running spark on a > laptop should see some speedups too. > > Cloudera can ship this stuff internally. But the ASF projects are all > stuck in time because of the belief that building against branch-2 makes > sense. And it is transitive. Hive's requirements hold back iceberg, for > example. (see also , PARQUET-2173. ...) > > If you want your applications to work better, especially in cloud, you > should not just be running on a modern version of hadoop (and java11+, > ideally), you *and your libraries* should be using the newer APIs to work > with the data. > > Finally note that while that scatter/gather read call will only be on > 3.3.5 we are doing a shim lib to offer the API to apps on older builds > -it'll use readFully() to do the reads, just as the default implementation > on all filesystems does on hadoop 3.3.5. See > https://github.com/steveloughran/fs-api-shim ; it will become a hadoop > extension lib. One which will not run on hadoop-2, but 3.2.x+ only. > Obviously > > steve > > On Tue, 4 Oct 2022 at 04:16, Dongjoon Hyun <dongjoon.h...@gmail.com> > wrote: > >> Hi, All. >> >> I'm wondering if the following Apache Spark Hadoop2 Binary Distribution >> is still used by someone in the community or not. If it's not used or not >> useful, >> we may remove it from Apache Spark 3.4.0 release. >> >> >> https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz >> >> Here is the background of this question. >> Since Apache Spark 2.2.0 (SPARK-19493, SPARK-19550), the Apache >> Spark community has been building and releasing with Java 8 only. >> I believe that the user applications also use Java8+ in these days. >> Recently, I received the following message from the Hadoop PMC. >> >> > "if you really want to claim hadoop 2.x compatibility, then you have >> to >> > be building against java 7". Otherwise a lot of people with hadoop 2.x >> > clusters won't be able to run your code. If your projects are java8+ >> > only, then they are implicitly hadoop 3.1+, no matter what you use >> > in your build. Hence: no need for branch-2 branches except >> > to complicate your build/test/release processes [1] >> >> If Hadoop2 binary distribution is no longer used as of today, >> or incomplete somewhere due to Java 8 building, the following three >> existing alternative Hadoop 3 binary distributions could be >> the better official solution for old Hadoop 2 clusters. >> >> 1) Scala 2.12 and without-hadoop distribution >> 2) Scala 2.12 and Hadoop 3 distribution >> 3) Scala 2.13 and Hadoop 3 distribution >> >> In short, is there anyone who is using Apache Spark 3.3.0 Hadoop2 Binary >> distribution? >> >> Dongjoon >> >> [1] >> https://issues.apache.org/jira/browse/ORC-1251?focusedCommentId=17608247&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17608247 >> >