Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Dongjoon Hyun Wed, 24 Jun 2020 11:24:26 -0700

To Xiao.
Why Apache project releases should be blocked by PyPi / CRAN? It's
completely optional, isn't it?

    > let me repeat my opinion:  the top priority is to provide two options
for PyPi distribution

IIRC, Apache Spark 3.0.0 fails to upload to CRAN and this is not the first
incident. Apache Spark already has a history of missing SparkR uploading.
We don't say Spark 3.0.0 fails due to CRAN uploading or other non-Apache
distribution channels. In short, non-Apache distribution channels cannot be
a `blocker` for Apache project releases. We only do our best for the
community.

SPARK-32017 (Make Pyspark Hadoop 3.2+ Variant available in PyPI) is really
irrelevant to this PR. If someone wants to do that and the PR is ready, why
don't we do it in `Apache Spark 3.0.1 timeline`? Why do we wait for
December? Is there a reason why we need to wait?

To Sean
Thanks!

To Nicholas.
Do you think `pip install pyspark` is version-agnostic? In the Python
world, `pip install somepackage` fails frequently. In production, you
should use `pip install somepackage==specificversion`. I don't think the
production pipeline has non-versinoned Python package installation.

The bottom line is that the PR doesn't change PyPi uploading, the AS-IS PR
keeps Hadoop 2.7 on PySpark due to Xiao's comments. I don't think there is
a blocker for that PR.

Bests,
Dongjoon.

On Wed, Jun 24, 2020 at 10:54 AM Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> To rephrase my earlier email, PyPI users would care about the bundled
> Hadoop version if they have a workflow that, in effect, looks something
> like this:
>
> ```
> pip install pyspark
> pyspark --packages org.apache.hadoop:hadoop-aws:2.7.7
> spark.read.parquet('s3a://...')
> ```
>
> I agree that Hadoop 3 would be a better default (again, the s3a support is
> just much better). But to Xiao's point, if you are expecting Spark to work
> with some package like hadoop-aws that assumes an older version of Hadoop
> bundled with Spark, then changing the default may break your workflow.
>
> In the case of hadoop-aws the fix is simple--just bump hadoop-aws:2.7.7 to
> hadoop-aws:3.2.1. But perhaps there are other PyPI-based workflows that
> would be more difficult to repair. 🤷‍♂️
>
> On Wed, Jun 24, 2020 at 1:44 PM Sean Owen <sro...@gmail.com> wrote:
>
>> I'm also genuinely curious when PyPI users would care about the
>> bundled Hadoop jars - do we even need two versions? that itself is
>> extra complexity for end users.
>> I do think Hadoop 3 is the better choice for the user who doesn't
>> care, and better long term.
>> OK but let's at least move ahead with changing defaults.
>>
>> On Wed, Jun 24, 2020 at 12:38 PM Xiao Li <lix...@databricks.com> wrote:
>> >
>> > Hi, Dongjoon,
>> >
>> > Please do not misinterpret my point. I already clearly said "I do not
>> know how to track the popularity of Hadoop 2 vs Hadoop 3."
>> >
>> > Also, let me repeat my opinion:  the top priority is to provide two
>> options for PyPi distribution and let the end users choose the ones they
>> need. Hadoop 3.2 or Hadoop 2.7. In general, when we want to make any
>> breaking change, let us follow our protocol documented in
>> https://spark.apache.org/versioning-policy.html.
>> >
>> > If you just want to change the Jenkins setup, I am OK about it. If you
>> want to change the default distribution, we need more discussions in the
>> community for getting an agreement.
>> >
>> >  Thanks,
>> >
>> > Xiao
>> >
>>
>

Re: Use Hadoop-3.2 as a default Hadoop profile in 3.0.0?

Reply via email to