Shall we start a new thread to discuss the bundled Hadoop version in PySpark? I don't have a strong opinion on changing the default, as users can still download the Hadoop 2.7 version.
On Thu, Jun 25, 2020 at 2:23 AM Dongjoon Hyun <dongjoon.h...@gmail.com> wrote: > To Xiao. > Why Apache project releases should be blocked by PyPi / CRAN? It's > completely optional, isn't it? > > > let me repeat my opinion: the top priority is to provide two > options for PyPi distribution > > IIRC, Apache Spark 3.0.0 fails to upload to CRAN and this is not the first > incident. Apache Spark already has a history of missing SparkR uploading. > We don't say Spark 3.0.0 fails due to CRAN uploading or other non-Apache > distribution channels. In short, non-Apache distribution channels cannot be > a `blocker` for Apache project releases. We only do our best for the > community. > > SPARK-32017 (Make Pyspark Hadoop 3.2+ Variant available in PyPI) is really > irrelevant to this PR. If someone wants to do that and the PR is ready, why > don't we do it in `Apache Spark 3.0.1 timeline`? Why do we wait for > December? Is there a reason why we need to wait? > > To Sean > Thanks! > > To Nicholas. > Do you think `pip install pyspark` is version-agnostic? In the Python > world, `pip install somepackage` fails frequently. In production, you > should use `pip install somepackage==specificversion`. I don't think the > production pipeline has non-versinoned Python package installation. > > The bottom line is that the PR doesn't change PyPi uploading, the AS-IS PR > keeps Hadoop 2.7 on PySpark due to Xiao's comments. I don't think there is > a blocker for that PR. > > Bests, > Dongjoon. > > On Wed, Jun 24, 2020 at 10:54 AM Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> To rephrase my earlier email, PyPI users would care about the bundled >> Hadoop version if they have a workflow that, in effect, looks something >> like this: >> >> ``` >> pip install pyspark >> pyspark --packages org.apache.hadoop:hadoop-aws:2.7.7 >> spark.read.parquet('s3a://...') >> ``` >> >> I agree that Hadoop 3 would be a better default (again, the s3a support >> is just much better). But to Xiao's point, if you are expecting Spark to >> work with some package like hadoop-aws that assumes an older version of >> Hadoop bundled with Spark, then changing the default may break your >> workflow. >> >> In the case of hadoop-aws the fix is simple--just bump hadoop-aws:2.7.7 >> to hadoop-aws:3.2.1. But perhaps there are other PyPI-based workflows that >> would be more difficult to repair. 🤷♂️ >> >> On Wed, Jun 24, 2020 at 1:44 PM Sean Owen <sro...@gmail.com> wrote: >> >>> I'm also genuinely curious when PyPI users would care about the >>> bundled Hadoop jars - do we even need two versions? that itself is >>> extra complexity for end users. >>> I do think Hadoop 3 is the better choice for the user who doesn't >>> care, and better long term. >>> OK but let's at least move ahead with changing defaults. >>> >>> On Wed, Jun 24, 2020 at 12:38 PM Xiao Li <lix...@databricks.com> wrote: >>> > >>> > Hi, Dongjoon, >>> > >>> > Please do not misinterpret my point. I already clearly said "I do not >>> know how to track the popularity of Hadoop 2 vs Hadoop 3." >>> > >>> > Also, let me repeat my opinion: the top priority is to provide two >>> options for PyPi distribution and let the end users choose the ones they >>> need. Hadoop 3.2 or Hadoop 2.7. In general, when we want to make any >>> breaking change, let us follow our protocol documented in >>> https://spark.apache.org/versioning-policy.html. >>> > >>> > If you just want to change the Jenkins setup, I am OK about it. If you >>> want to change the default distribution, we need more discussions in the >>> community for getting an agreement. >>> > >>> > Thanks, >>> > >>> > Xiao >>> > >>> >>