I dont have a strong opinion on changing default too but I also a little bit more prefer to have the option to switch Hadoop version first just to stay safer.
To be clear, we're more now discussing about the timing about when to set Hadoop 3.0.0 by default, and which change has to be first, right? I can work on the option to switch soon. I guess that will consolidate all opinions here (?). On Thu, 25 Jun 2020, 04:25 Andrew Melo, <andrew.m...@gmail.com> wrote: > Hello, > > On Wed, Jun 24, 2020 at 2:13 PM Holden Karau <hol...@pigscanfly.ca> wrote: > > > > So I thought our theory for the pypi packages was it was for local > developers, they really shouldn't care about the Hadoop version. If you're > running on a production cluster you ideally pip install from the same > release artifacts as your production cluster to match. > > That's certainly one use of pypi packages, but not the only one. In > our case, we provide clusters for our users, with SPARK_CONF pre > configured with (e.g.) the master connection URL. But the analyses > they're doing are their own and unique, so they work in their own > personal python virtual environments. There are no "release artifacts" > to publish, per-se, since each user is working independently and can > install whatever they'd like into their virtual environments. > > Cheers > Andrew > > > > > On Wed, Jun 24, 2020 at 12:11 PM Wenchen Fan <cloud0...@gmail.com> > wrote: > >> > >> Shall we start a new thread to discuss the bundled Hadoop version in > PySpark? I don't have a strong opinion on changing the default, as users > can still download the Hadoop 2.7 version. > >> > >> On Thu, Jun 25, 2020 at 2:23 AM Dongjoon Hyun <dongjoon.h...@gmail.com> > wrote: > >>> > >>> To Xiao. > >>> Why Apache project releases should be blocked by PyPi / CRAN? It's > completely optional, isn't it? > >>> > >>> > let me repeat my opinion: the top priority is to provide two > options for PyPi distribution > >>> > >>> IIRC, Apache Spark 3.0.0 fails to upload to CRAN and this is not the > first incident. Apache Spark already has a history of missing SparkR > uploading. We don't say Spark 3.0.0 fails due to CRAN uploading or other > non-Apache distribution channels. In short, non-Apache distribution > channels cannot be a `blocker` for Apache project releases. We only do our > best for the community. > >>> > >>> SPARK-32017 (Make Pyspark Hadoop 3.2+ Variant available in PyPI) is > really irrelevant to this PR. If someone wants to do that and the PR is > ready, why don't we do it in `Apache Spark 3.0.1 timeline`? Why do we wait > for December? Is there a reason why we need to wait? > >>> > >>> To Sean > >>> Thanks! > >>> > >>> To Nicholas. > >>> Do you think `pip install pyspark` is version-agnostic? In the Python > world, `pip install somepackage` fails frequently. In production, you > should use `pip install somepackage==specificversion`. I don't think the > production pipeline has non-versinoned Python package installation. > >>> > >>> The bottom line is that the PR doesn't change PyPi uploading, the > AS-IS PR keeps Hadoop 2.7 on PySpark due to Xiao's comments. I don't think > there is a blocker for that PR. > >>> > >>> Bests, > >>> Dongjoon. > >>> > >>> On Wed, Jun 24, 2020 at 10:54 AM Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >>>> > >>>> To rephrase my earlier email, PyPI users would care about the bundled > Hadoop version if they have a workflow that, in effect, looks something > like this: > >>>> > >>>> ``` > >>>> pip install pyspark > >>>> pyspark --packages org.apache.hadoop:hadoop-aws:2.7.7 > >>>> spark.read.parquet('s3a://...') > >>>> ``` > >>>> > >>>> I agree that Hadoop 3 would be a better default (again, the s3a > support is just much better). But to Xiao's point, if you are expecting > Spark to work with some package like hadoop-aws that assumes an older > version of Hadoop bundled with Spark, then changing the default may break > your workflow. > >>>> > >>>> In the case of hadoop-aws the fix is simple--just bump > hadoop-aws:2.7.7 to hadoop-aws:3.2.1. But perhaps there are other > PyPI-based workflows that would be more difficult to repair. 🤷♂️ > >>>> > >>>> On Wed, Jun 24, 2020 at 1:44 PM Sean Owen <sro...@gmail.com> wrote: > >>>>> > >>>>> I'm also genuinely curious when PyPI users would care about the > >>>>> bundled Hadoop jars - do we even need two versions? that itself is > >>>>> extra complexity for end users. > >>>>> I do think Hadoop 3 is the better choice for the user who doesn't > >>>>> care, and better long term. > >>>>> OK but let's at least move ahead with changing defaults. > >>>>> > >>>>> On Wed, Jun 24, 2020 at 12:38 PM Xiao Li <lix...@databricks.com> > wrote: > >>>>> > > >>>>> > Hi, Dongjoon, > >>>>> > > >>>>> > Please do not misinterpret my point. I already clearly said "I do > not know how to track the popularity of Hadoop 2 vs Hadoop 3." > >>>>> > > >>>>> > Also, let me repeat my opinion: the top priority is to provide > two options for PyPi distribution and let the end users choose the ones > they need. Hadoop 3.2 or Hadoop 2.7. In general, when we want to make any > breaking change, let us follow our protocol documented in > https://spark.apache.org/versioning-policy.html. > >>>>> > > >>>>> > If you just want to change the Jenkins setup, I am OK about it. If > you want to change the default distribution, we need more discussions in > the community for getting an agreement. > >>>>> > > >>>>> > Thanks, > >>>>> > > >>>>> > Xiao > >>>>> > > > > > > > > > -- > > Twitter: https://twitter.com/holdenkarau > > Books (Learning Spark, High Performance Spark, etc.): > https://amzn.to/2MaRAG9 > > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >