Hello,

On Wed, Jun 24, 2020 at 2:13 PM Holden Karau <hol...@pigscanfly.ca> wrote:
>
> So I thought our theory for the pypi packages was it was for local 
> developers, they really shouldn't care about the Hadoop version. If you're 
> running on a production cluster you ideally pip install from the same release 
> artifacts as your production cluster to match.

That's certainly one use of pypi packages, but not the only one. In
our case, we provide clusters for our users, with SPARK_CONF pre
configured with (e.g.) the master connection URL. But the analyses
they're doing are their own and unique, so they work in their own
personal python virtual environments. There are no "release artifacts"
to publish, per-se, since each user is working independently and can
install whatever they'd like into their virtual environments.

Cheers
Andrew

>
> On Wed, Jun 24, 2020 at 12:11 PM Wenchen Fan <cloud0...@gmail.com> wrote:
>>
>> Shall we start a new thread to discuss the bundled Hadoop version in 
>> PySpark? I don't have a strong opinion on changing the default, as users can 
>> still download the Hadoop 2.7 version.
>>
>> On Thu, Jun 25, 2020 at 2:23 AM Dongjoon Hyun <dongjoon.h...@gmail.com> 
>> wrote:
>>>
>>> To Xiao.
>>> Why Apache project releases should be blocked by PyPi / CRAN? It's 
>>> completely optional, isn't it?
>>>
>>>     > let me repeat my opinion:  the top priority is to provide two options 
>>> for PyPi distribution
>>>
>>> IIRC, Apache Spark 3.0.0 fails to upload to CRAN and this is not the first 
>>> incident. Apache Spark already has a history of missing SparkR uploading. 
>>> We don't say Spark 3.0.0 fails due to CRAN uploading or other non-Apache 
>>> distribution channels. In short, non-Apache distribution channels cannot be 
>>> a `blocker` for Apache project releases. We only do our best for the 
>>> community.
>>>
>>> SPARK-32017 (Make Pyspark Hadoop 3.2+ Variant available in PyPI) is really 
>>> irrelevant to this PR. If someone wants to do that and the PR is ready, why 
>>> don't we do it in `Apache Spark 3.0.1 timeline`? Why do we wait for 
>>> December? Is there a reason why we need to wait?
>>>
>>> To Sean
>>> Thanks!
>>>
>>> To Nicholas.
>>> Do you think `pip install pyspark` is version-agnostic? In the Python 
>>> world, `pip install somepackage` fails frequently. In production, you 
>>> should use `pip install somepackage==specificversion`. I don't think the 
>>> production pipeline has non-versinoned Python package installation.
>>>
>>> The bottom line is that the PR doesn't change PyPi uploading, the AS-IS PR 
>>> keeps Hadoop 2.7 on PySpark due to Xiao's comments. I don't think there is 
>>> a blocker for that PR.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Wed, Jun 24, 2020 at 10:54 AM Nicholas Chammas 
>>> <nicholas.cham...@gmail.com> wrote:
>>>>
>>>> To rephrase my earlier email, PyPI users would care about the bundled 
>>>> Hadoop version if they have a workflow that, in effect, looks something 
>>>> like this:
>>>>
>>>> ```
>>>> pip install pyspark
>>>> pyspark --packages org.apache.hadoop:hadoop-aws:2.7.7
>>>> spark.read.parquet('s3a://...')
>>>> ```
>>>>
>>>> I agree that Hadoop 3 would be a better default (again, the s3a support is 
>>>> just much better). But to Xiao's point, if you are expecting Spark to work 
>>>> with some package like hadoop-aws that assumes an older version of Hadoop 
>>>> bundled with Spark, then changing the default may break your workflow.
>>>>
>>>> In the case of hadoop-aws the fix is simple--just bump hadoop-aws:2.7.7 to 
>>>> hadoop-aws:3.2.1. But perhaps there are other PyPI-based workflows that 
>>>> would be more difficult to repair. 🤷‍♂️
>>>>
>>>> On Wed, Jun 24, 2020 at 1:44 PM Sean Owen <sro...@gmail.com> wrote:
>>>>>
>>>>> I'm also genuinely curious when PyPI users would care about the
>>>>> bundled Hadoop jars - do we even need two versions? that itself is
>>>>> extra complexity for end users.
>>>>> I do think Hadoop 3 is the better choice for the user who doesn't
>>>>> care, and better long term.
>>>>> OK but let's at least move ahead with changing defaults.
>>>>>
>>>>> On Wed, Jun 24, 2020 at 12:38 PM Xiao Li <lix...@databricks.com> wrote:
>>>>> >
>>>>> > Hi, Dongjoon,
>>>>> >
>>>>> > Please do not misinterpret my point. I already clearly said "I do not 
>>>>> > know how to track the popularity of Hadoop 2 vs Hadoop 3."
>>>>> >
>>>>> > Also, let me repeat my opinion:  the top priority is to provide two 
>>>>> > options for PyPi distribution and let the end users choose the ones 
>>>>> > they need. Hadoop 3.2 or Hadoop 2.7. In general, when we want to make 
>>>>> > any breaking change, let us follow our protocol documented in 
>>>>> > https://spark.apache.org/versioning-policy.html.
>>>>> >
>>>>> > If you just want to change the Jenkins setup, I am OK about it. If you 
>>>>> > want to change the default distribution, we need more discussions in 
>>>>> > the community for getting an agreement.
>>>>> >
>>>>> >  Thanks,
>>>>> >
>>>>> > Xiao
>>>>> >
>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to