Hi all,

Since Spark 3.2, we have supported Hadoop 3.3.1 now, but its profile name
is *hadoop-3.2* (and *hadoop-2.7*) that is not correct.
So we made a change in https://github.com/apache/spark/pull/34715
Starting from Spark 3.3, we use hadoop profile *hadoop-2* and *hadoop-3 *,
and default hadoop profile is hadoop-3.
Profile changes

*hadoop-2.7* changed to *hadoop-2*
*hadoop-3.2* changed to *hadoop-3*
Release tar file

Spark-3.3.0 with profile hadoop-3: *spark-3.3.0-bin-hadoop3.tgz*
Spark-3.3.0 with profile hadoop-2: *spark-3.3.0-bin-hadoop2.tgz*

For Spark 3.2.0, the release tar file was, for example,
*spark-3.2.0-bin-hadoop3.2.tgz*.
Pip install option changes

For PySpark with/without a specific Hadoop version, you can install it by
using PYSPARK_HADOOP_VERSION environment variables as below (Hadoop 3):

PYSPARK_HADOOP_VERSION=3 pip install pyspark

For Hadoop 2:

PYSPARK_HADOOP_VERSION=2 pip install pyspark

Supported values in PYSPARK_HADOOP_VERSION are now:

   - without: Spark pre-built with user-provided Apache Hadoop
   - 2: Spark pre-built for Apache Hadoop 2.
   - 3: Spark pre-built for Apache Hadoop 3.3 and later (default)

Building Spark and Specifying the Hadoop Version
<https://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version-and-enabling-yarn>

You can specify the exact version of Hadoop to compile against through the
hadoop.version property.
Example:

./build/mvn -Pyarn -Dhadoop.version=3.3.0 -DskipTests clean package

or you can specify *hadoop-3* profile

./build/mvn -Pyarn -Phadoop-3 -Dhadoop.version=3.3.0 -DskipTests clean package

If you want to build with Hadoop 2.x, enable *hadoop-2* profile:

./build/mvn -Phadoop-2 -Pyarn -Dhadoop.version=2.8.5 -DskipTests clean package

Notes

In the current master, it will use the default Hadoop 3 if you continue to
use -Phadoop-2.7 and -Phadoop-3.2 to build Spark
because Maven or SBT will just warn and ignore these non-existent profiles.
Please change profiles to -Phadoop-2 or -Phadoop-3.

Reply via email to