Re: [DISCUSS] Show Python code examples first in Spark documentation

Allan Folting Thu, 23 Feb 2023 19:46:16 -0800

I think this needs to be consistently done on all relevant pages and my
intent is to do that work in time for when it is first released.
I started with the "Spark SQL, DataFrames and Datasets Guide" page to break
it up into multiple, scoped PRs.
I should have made that clear before.


I think it's a great idea to have an umbrella JIRA for this to outline the
full scope and track overall progress and I'm happy to create it.

I can't speak on behalf of all Scala users of course, but I don't think
this change makes Scala appear as a 2nd class citizen, like I don't think
of Python as a 2nd class citizen because it is not first currently, but it
does recognize that Python is more broadly popular today.

Thanks,
Allan

On Thu, Feb 23, 2023 at 6:55 PM Dongjoon Hyun <[email protected]>
wrote:

> Thank you all.
>
> Yes, attracting more Python users and being more Python user-friendly is
> always good.
>
> Basically, SPARK-42493 is proposing to introduce intentional inconsistency
> to Apache Spark documentation.
>
> The inconsistency from SPARK-42493 might give Python users the following
> questions first.
>
> - Why not RDD pages which are the heart of Apache Spark? Is Python not
> good in RDD?
> - Why not ML and Structured Streaming pages when DATA+AI Summit focuses on
> ML heavily?
>
> Also, more questions to the Scala users.
> - Is Scala language stepping down to the 2nd citizen language?
> - What about Scala 3?
>
> Of course, I understand SPARK-42493 has specific scopes
> (SQL/Dataset/Dataframe) and didn't mean anything like the above at all.
> However, if SPARK-42493 is emphasized as "the first step" to introduce
> that inconsistency, I'm wondering
> - What direction we are heading?
> - What is the next target scope?
> - When it will be achieved (or completed)?
> - Or, is the goal to be permanently inconsistent in terms of the
> documentation?
>
> It's unclear even in the documentation-only scope. If we are expecting
> more and more subtasks during Apache Spark 3.5 timeframe, shall we have an
> umbrella JIRA?
>
> Bests,
> Dongjoon.
>
>
> On Thu, Feb 23, 2023 at 6:15 PM Allan Folting <[email protected]>
> wrote:
>
>> Thanks a lot for the questions and comments/feedback!
>>
>> To address your questions Dongjoon, I do not intend for these updates to
>> the documentation to be tied to the potential changes/suggestions you ask
>> about.
>>
>> In other words, this proposal is only about adjusting the documentation
>> to target the majority of people reading it - namely the large and growing
>> number of Python users - and new users in particular as they are often
>> already familiar with and have a preference for Python when evaluating or
>> starting to use Spark.
>>
>> While we may want to strengthen support for Python in other ways, I think
>> such efforts should be tracked separately from this.
>>
>> Allan
>>
>> On Thu, Feb 23, 2023 at 1:44 AM Mich Talebzadeh <
>> [email protected]> wrote:
>>
>>> If this is not just flip flopping the document pages and involves other
>>> changes, then a proper impact analysis needs to be done to assess the
>>> efforts involved. Personally I don't think it really matters.
>>>
>>> HTH
>>>
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Thu, 23 Feb 2023 at 01:40, Hyukjin Kwon <[email protected]> wrote:
>>>
>>>> > 1. Does this suggestion imply Python API implementation will be the
>>>> new blocker in the future in terms of feature parity among languages? Until
>>>> now, Python API feature parity was one of the audit items because it's not
>>>> enforced. In other words, Scala and Java have been the full feature because
>>>> they are the underlying main developer languages while Python/R/SQL
>>>> environments were the nice-to-have.
>>>>
>>>> I think it wouldn't be treated as a blocker .. but I do believe we have
>>>> added all new features into the Python side for the last couple of
>>>> releases. So, I wouldn't worry about this at this moment - we have been
>>>> doing fine in terms of feature parity.
>>>>
>>>> > 2. Does this suggestion assume that the Python environment is easier
>>>> for users than Scala/Java always? Given that we support Python 3.8 to 3.11,
>>>> the support matrix for Python library dependency is a problem for the
>>>> Apache Spark community to solve in order to claim that. As we say
>>>> at SPARK-41454, Python language also introduces breaking changes to us
>>>> historically and we have many `Pinned` python libraries issues.
>>>>
>>>> Yes. In fact, regardless of this change, I do believe we should test
>>>> more versions, etc. At least scheduled jobs like we're doing JDK and Scala
>>>> versions.
>>>>
>>>>
>>>> FWIW, my take about this change is: people use Python and PySpark more
>>>> (according to the chart and stats provided) so let's put those examples
>>>> first :-).
>>>>
>>>>
>>>> On Thu, 23 Feb 2023 at 10:27, Dongjoon Hyun <[email protected]>
>>>> wrote:
>>>>
>>>>> I have two questions to clarify the scope and boundaries.
>>>>>
>>>>> 1. Does this suggestion imply Python API implementation will be the
>>>>> new blocker in the future in terms of feature parity among languages? 
>>>>> Until
>>>>> now, Python API feature parity was one of the audit items because it's not
>>>>> enforced. In other words, Scala and Java have been the full feature 
>>>>> because
>>>>> they are the underlying main developer languages while Python/R/SQL
>>>>> environments were the nice-to-have.
>>>>>
>>>>> 2. Does this suggestion assume that the Python environment is easier
>>>>> for users than Scala/Java always? Given that we support Python 3.8 to 
>>>>> 3.11,
>>>>> the support matrix for Python library dependency is a problem for the
>>>>> Apache Spark community to solve in order to claim that. As we say
>>>>> at SPARK-41454, Python language also introduces breaking changes to us
>>>>> historically and we have many `Pinned` python libraries issues.
>>>>>
>>>>> Changing documentation is easy, but I hope we can give clear
>>>>> communication and direction in this effort because this is one of the most
>>>>> user-facing changes.
>>>>>
>>>>> Dongjoon.
>>>>>
>>>>> On Wed, Feb 22, 2023 at 5:26 PM [email protected] <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> +1 LGTM
>>>>>>
>>>>>> ------------------------------
>>>>>> Ruifeng Zheng
>>>>>> [email protected]
>>>>>>
>>>>>> <https://wx.mail.qq.com/home/index?t=readmail_businesscard_midpage&nocheck=true&name=Ruifeng+Zheng&icon=https%3A%2F%2Fres.mail.qq.com%2Fzh_CN%2Fhtmledition%2Fimages%2Frss%2Fmale.gif%3Frand%3D1617349242&mail=ruifengz%40foxmail.com&code=>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------ Original ------------------
>>>>>> *From:* "Xinrong Meng" <[email protected]>;
>>>>>> *Date:* Thu, Feb 23, 2023 09:17 AM
>>>>>> *To:* "Allan Folting"<[email protected]>;
>>>>>> *Cc:* "dev"<[email protected]>;
>>>>>> *Subject:* Re: [DISCUSS] Show Python code examples first in Spark
>>>>>> documentation
>>>>>>
>>>>>> +1 Good idea!
>>>>>>
>>>>>> On Thu, Feb 23, 2023 at 7:41 AM Jack Goodson <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Good idea, at the company I work at we discussed using Scala as our
>>>>>>> primary language because technically it is slightly stronger than python
>>>>>>> but ultimately chose python in the end as it’s easier for other devs to 
>>>>>>> be
>>>>>>> on boarded to our platform and future hiring for the team etc would be
>>>>>>> easier
>>>>>>>
>>>>>>> On Thu, 23 Feb 2023 at 12:20 PM, Hyukjin Kwon <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> +1 I like this idea too.
>>>>>>>>
>>>>>>>> On Thu, Feb 23, 2023 at 6:00 AM Allan Folting <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> I would like to propose that we show Python code examples first in
>>>>>>>>> the Spark documentation where we have multiple programming language
>>>>>>>>> examples.
>>>>>>>>> An example is on the Quick Start page:
>>>>>>>>> https://spark.apache.org/docs/latest/quick-start.html
>>>>>>>>>
>>>>>>>>> I propose this change because Python has become more popular than
>>>>>>>>> the other languages supported in Apache Spark. There are a lot more 
>>>>>>>>> users
>>>>>>>>> of Spark in Python than Scala today and Python attracts a broader set 
>>>>>>>>> of
>>>>>>>>> new users.
>>>>>>>>> For Python usage data, see https://www.tiobe.com/tiobe-index/ and
>>>>>>>>> https://insights.stackoverflow.com/trends?tags=r%2Cscala%2Cpython%2Cjava
>>>>>>>>> .
>>>>>>>>>
>>>>>>>>> Also, this change aligns with Python already being the first tab
>>>>>>>>> on our home page:
>>>>>>>>> https://spark.apache.org/
>>>>>>>>>
>>>>>>>>> Anyone who wants to use another language can still just click on
>>>>>>>>> the other tabs.
>>>>>>>>>
>>>>>>>>> I created a draft PR for the Spark SQL, DataFrames and Datasets
>>>>>>>>> Guide page as a first step:
>>>>>>>>> https://github.com/apache/spark/pull/40087
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I would appreciate it if you could share your thoughts on this
>>>>>>>>> proposal.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks a lot,
>>>>>>>>> Allan Folting
>>>>>>>>>
>>>>>>>>

Re: [DISCUSS] Show Python code examples first in Spark documentation

Reply via email to