Shall we start a new thread to discuss the bundled Hadoop version in
PySpark? I don't have a strong opinion on changing the default, as users
can still download the Hadoop 2.7 version.

On Thu, Jun 25, 2020 at 2:23 AM Dongjoon Hyun <dongjoon.h...@gmail.com>
wrote:

> To Xiao.
> Why Apache project releases should be blocked by PyPi / CRAN? It's
> completely optional, isn't it?
>
>     > let me repeat my opinion:  the top priority is to provide two
> options for PyPi distribution
>
> IIRC, Apache Spark 3.0.0 fails to upload to CRAN and this is not the first
> incident. Apache Spark already has a history of missing SparkR uploading.
> We don't say Spark 3.0.0 fails due to CRAN uploading or other non-Apache
> distribution channels. In short, non-Apache distribution channels cannot be
> a `blocker` for Apache project releases. We only do our best for the
> community.
>
> SPARK-32017 (Make Pyspark Hadoop 3.2+ Variant available in PyPI) is really
> irrelevant to this PR. If someone wants to do that and the PR is ready, why
> don't we do it in `Apache Spark 3.0.1 timeline`? Why do we wait for
> December? Is there a reason why we need to wait?
>
> To Sean
> Thanks!
>
> To Nicholas.
> Do you think `pip install pyspark` is version-agnostic? In the Python
> world, `pip install somepackage` fails frequently. In production, you
> should use `pip install somepackage==specificversion`. I don't think the
> production pipeline has non-versinoned Python package installation.
>
> The bottom line is that the PR doesn't change PyPi uploading, the AS-IS PR
> keeps Hadoop 2.7 on PySpark due to Xiao's comments. I don't think there is
> a blocker for that PR.
>
> Bests,
> Dongjoon.
>
> On Wed, Jun 24, 2020 at 10:54 AM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> To rephrase my earlier email, PyPI users would care about the bundled
>> Hadoop version if they have a workflow that, in effect, looks something
>> like this:
>>
>> ```
>> pip install pyspark
>> pyspark --packages org.apache.hadoop:hadoop-aws:2.7.7
>> spark.read.parquet('s3a://...')
>> ```
>>
>> I agree that Hadoop 3 would be a better default (again, the s3a support
>> is just much better). But to Xiao's point, if you are expecting Spark to
>> work with some package like hadoop-aws that assumes an older version of
>> Hadoop bundled with Spark, then changing the default may break your
>> workflow.
>>
>> In the case of hadoop-aws the fix is simple--just bump hadoop-aws:2.7.7
>> to hadoop-aws:3.2.1. But perhaps there are other PyPI-based workflows that
>> would be more difficult to repair. 🤷‍♂️
>>
>> On Wed, Jun 24, 2020 at 1:44 PM Sean Owen <sro...@gmail.com> wrote:
>>
>>> I'm also genuinely curious when PyPI users would care about the
>>> bundled Hadoop jars - do we even need two versions? that itself is
>>> extra complexity for end users.
>>> I do think Hadoop 3 is the better choice for the user who doesn't
>>> care, and better long term.
>>> OK but let's at least move ahead with changing defaults.
>>>
>>> On Wed, Jun 24, 2020 at 12:38 PM Xiao Li <lix...@databricks.com> wrote:
>>> >
>>> > Hi, Dongjoon,
>>> >
>>> > Please do not misinterpret my point. I already clearly said "I do not
>>> know how to track the popularity of Hadoop 2 vs Hadoop 3."
>>> >
>>> > Also, let me repeat my opinion:  the top priority is to provide two
>>> options for PyPi distribution and let the end users choose the ones they
>>> need. Hadoop 3.2 or Hadoop 2.7. In general, when we want to make any
>>> breaking change, let us follow our protocol documented in
>>> https://spark.apache.org/versioning-policy.html.
>>> >
>>> > If you just want to change the Jenkins setup, I am OK about it. If you
>>> want to change the default distribution, we need more discussions in the
>>> community for getting an agreement.
>>> >
>>> >  Thanks,
>>> >
>>> > Xiao
>>> >
>>>
>>

Reply via email to