Re: [DISCUSS] Show Python code examples first in Spark documentation

Hyukjin Kwon Thu, 23 Feb 2023 23:01:43 -0800

That sounds good to have that especially given that it will allow more
flexibility to the users.
But I think that's slightly orthogonal to this proposal since this proposal
is more about the default (before users take an action).



On Fri, 24 Feb 2023 at 15:35, Santosh Pingale <[email protected]>
wrote:

> Very interesting and user focused discussion, thanks for the proposal.
>
> Would it be better if we rather let users set the preference about the
> language they want to see first in the code examples? This preference can
> be easily stored on the browser side and used to decide ordering. This is
> inline with freedom users have with spark today.
>
>
> On Fri, Feb 24, 2023, 4:46 AM Allan Folting <[email protected]> wrote:
>
>> I think this needs to be consistently done on all relevant pages and my
>> intent is to do that work in time for when it is first released.
>> I started with the "Spark SQL, DataFrames and Datasets Guide" page to
>> break it up into multiple, scoped PRs.
>> I should have made that clear before.
>>
>> I think it's a great idea to have an umbrella JIRA for this to outline
>> the full scope and track overall progress and I'm happy to create it.
>>
>> I can't speak on behalf of all Scala users of course, but I don't think
>> this change makes Scala appear as a 2nd class citizen, like I don't think
>> of Python as a 2nd class citizen because it is not first currently, but it
>> does recognize that Python is more broadly popular today.
>>
>> Thanks,
>> Allan
>>
>> On Thu, Feb 23, 2023 at 6:55 PM Dongjoon Hyun <[email protected]>
>> wrote:
>>
>>> Thank you all.
>>>
>>> Yes, attracting more Python users and being more Python user-friendly is
>>> always good.
>>>
>>> Basically, SPARK-42493 is proposing to introduce intentional
>>> inconsistency to Apache Spark documentation.
>>>
>>> The inconsistency from SPARK-42493 might give Python users the following
>>> questions first.
>>>
>>> - Why not RDD pages which are the heart of Apache Spark? Is Python not
>>> good in RDD?
>>> - Why not ML and Structured Streaming pages when DATA+AI Summit focuses
>>> on ML heavily?
>>>
>>> Also, more questions to the Scala users.
>>> - Is Scala language stepping down to the 2nd citizen language?
>>> - What about Scala 3?
>>>
>>> Of course, I understand SPARK-42493 has specific scopes
>>> (SQL/Dataset/Dataframe) and didn't mean anything like the above at all.
>>> However, if SPARK-42493 is emphasized as "the first step" to introduce
>>> that inconsistency, I'm wondering
>>> - What direction we are heading?
>>> - What is the next target scope?
>>> - When it will be achieved (or completed)?
>>> - Or, is the goal to be permanently inconsistent in terms of the
>>> documentation?
>>>
>>> It's unclear even in the documentation-only scope. If we are expecting
>>> more and more subtasks during Apache Spark 3.5 timeframe, shall we have an
>>> umbrella JIRA?
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Thu, Feb 23, 2023 at 6:15 PM Allan Folting <[email protected]>
>>> wrote:
>>>
>>>> Thanks a lot for the questions and comments/feedback!
>>>>
>>>> To address your questions Dongjoon, I do not intend for these updates
>>>> to the documentation to be tied to the potential changes/suggestions you
>>>> ask about.
>>>>
>>>> In other words, this proposal is only about adjusting the documentation
>>>> to target the majority of people reading it - namely the large and growing
>>>> number of Python users - and new users in particular as they are often
>>>> already familiar with and have a preference for Python when evaluating or
>>>> starting to use Spark.
>>>>
>>>> While we may want to strengthen support for Python in other ways, I
>>>> think such efforts should be tracked separately from this.
>>>>
>>>> Allan
>>>>
>>>> On Thu, Feb 23, 2023 at 1:44 AM Mich Talebzadeh <
>>>> [email protected]> wrote:
>>>>
>>>>> If this is not just flip flopping the document pages and involves
>>>>> other changes, then a proper impact analysis needs to be done to assess 
>>>>> the
>>>>> efforts involved. Personally I don't think it really matters.
>>>>>
>>>>> HTH
>>>>>
>>>>>
>>>>>
>>>>>    view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, 23 Feb 2023 at 01:40, Hyukjin Kwon <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> > 1. Does this suggestion imply Python API implementation will be the
>>>>>> new blocker in the future in terms of feature parity among languages? 
>>>>>> Until
>>>>>> now, Python API feature parity was one of the audit items because it's 
>>>>>> not
>>>>>> enforced. In other words, Scala and Java have been the full feature 
>>>>>> because
>>>>>> they are the underlying main developer languages while Python/R/SQL
>>>>>> environments were the nice-to-have.
>>>>>>
>>>>>> I think it wouldn't be treated as a blocker .. but I do believe we
>>>>>> have added all new features into the Python side for the last couple of
>>>>>> releases. So, I wouldn't worry about this at this moment - we have been
>>>>>> doing fine in terms of feature parity.
>>>>>>
>>>>>> > 2. Does this suggestion assume that the Python environment is
>>>>>> easier for users than Scala/Java always? Given that we support Python 3.8
>>>>>> to 3.11, the support matrix for Python library dependency is a problem 
>>>>>> for
>>>>>> the Apache Spark community to solve in order to claim that. As we say
>>>>>> at SPARK-41454, Python language also introduces breaking changes to us
>>>>>> historically and we have many `Pinned` python libraries issues.
>>>>>>
>>>>>> Yes. In fact, regardless of this change, I do believe we should test
>>>>>> more versions, etc. At least scheduled jobs like we're doing JDK and 
>>>>>> Scala
>>>>>> versions.
>>>>>>
>>>>>>
>>>>>> FWIW, my take about this change is: people use Python and PySpark
>>>>>> more (according to the chart and stats provided) so let's put those
>>>>>> examples first :-).
>>>>>>
>>>>>>
>>>>>> On Thu, 23 Feb 2023 at 10:27, Dongjoon Hyun <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> I have two questions to clarify the scope and boundaries.
>>>>>>>
>>>>>>> 1. Does this suggestion imply Python API implementation will be the
>>>>>>> new blocker in the future in terms of feature parity among languages? 
>>>>>>> Until
>>>>>>> now, Python API feature parity was one of the audit items because it's 
>>>>>>> not
>>>>>>> enforced. In other words, Scala and Java have been the full feature 
>>>>>>> because
>>>>>>> they are the underlying main developer languages while Python/R/SQL
>>>>>>> environments were the nice-to-have.
>>>>>>>
>>>>>>> 2. Does this suggestion assume that the Python environment is easier
>>>>>>> for users than Scala/Java always? Given that we support Python 3.8 to 
>>>>>>> 3.11,
>>>>>>> the support matrix for Python library dependency is a problem for the
>>>>>>> Apache Spark community to solve in order to claim that. As we say
>>>>>>> at SPARK-41454, Python language also introduces breaking changes to us
>>>>>>> historically and we have many `Pinned` python libraries issues.
>>>>>>>
>>>>>>> Changing documentation is easy, but I hope we can give clear
>>>>>>> communication and direction in this effort because this is one of the 
>>>>>>> most
>>>>>>> user-facing changes.
>>>>>>>
>>>>>>> Dongjoon.
>>>>>>>
>>>>>>> On Wed, Feb 22, 2023 at 5:26 PM [email protected] <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> +1 LGTM
>>>>>>>>
>>>>>>>> ------------------------------
>>>>>>>> Ruifeng Zheng
>>>>>>>> [email protected]
>>>>>>>>
>>>>>>>> <https://wx.mail.qq.com/home/index?t=readmail_businesscard_midpage&nocheck=true&name=Ruifeng+Zheng&icon=https%3A%2F%2Fres.mail.qq.com%2Fzh_CN%2Fhtmledition%2Fimages%2Frss%2Fmale.gif%3Frand%3D1617349242&mail=ruifengz%40foxmail.com&code=>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------ Original ------------------
>>>>>>>> *From:* "Xinrong Meng" <[email protected]>;
>>>>>>>> *Date:* Thu, Feb 23, 2023 09:17 AM
>>>>>>>> *To:* "Allan Folting"<[email protected]>;
>>>>>>>> *Cc:* "dev"<[email protected]>;
>>>>>>>> *Subject:* Re: [DISCUSS] Show Python code examples first in Spark
>>>>>>>> documentation
>>>>>>>>
>>>>>>>> +1 Good idea!
>>>>>>>>
>>>>>>>> On Thu, Feb 23, 2023 at 7:41 AM Jack Goodson <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Good idea, at the company I work at we discussed using Scala as
>>>>>>>>> our primary language because technically it is slightly stronger than
>>>>>>>>> python but ultimately chose python in the end as it’s easier for 
>>>>>>>>> other devs
>>>>>>>>> to be on boarded to our platform and future hiring for the team etc 
>>>>>>>>> would
>>>>>>>>> be easier
>>>>>>>>>
>>>>>>>>> On Thu, 23 Feb 2023 at 12:20 PM, Hyukjin Kwon <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> +1 I like this idea too.
>>>>>>>>>>
>>>>>>>>>> On Thu, Feb 23, 2023 at 6:00 AM Allan Folting <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi all,
>>>>>>>>>>>
>>>>>>>>>>> I would like to propose that we show Python code examples first
>>>>>>>>>>> in the Spark documentation where we have multiple programming 
>>>>>>>>>>> language
>>>>>>>>>>> examples.
>>>>>>>>>>> An example is on the Quick Start page:
>>>>>>>>>>> https://spark.apache.org/docs/latest/quick-start.html
>>>>>>>>>>>
>>>>>>>>>>> I propose this change because Python has become more popular
>>>>>>>>>>> than the other languages supported in Apache Spark. There are a lot 
>>>>>>>>>>> more
>>>>>>>>>>> users of Spark in Python than Scala today and Python attracts a 
>>>>>>>>>>> broader set
>>>>>>>>>>> of new users.
>>>>>>>>>>> For Python usage data, see https://www.tiobe.com/tiobe-index/
>>>>>>>>>>>  and
>>>>>>>>>>> https://insights.stackoverflow.com/trends?tags=r%2Cscala%2Cpython%2Cjava
>>>>>>>>>>> .
>>>>>>>>>>>
>>>>>>>>>>> Also, this change aligns with Python already being the first tab
>>>>>>>>>>> on our home page:
>>>>>>>>>>> https://spark.apache.org/
>>>>>>>>>>>
>>>>>>>>>>> Anyone who wants to use another language can still just click on
>>>>>>>>>>> the other tabs.
>>>>>>>>>>>
>>>>>>>>>>> I created a draft PR for the Spark SQL, DataFrames and Datasets
>>>>>>>>>>> Guide page as a first step:
>>>>>>>>>>> https://github.com/apache/spark/pull/40087
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I would appreciate it if you could share your thoughts on this
>>>>>>>>>>> proposal.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks a lot,
>>>>>>>>>>> Allan Folting
>>>>>>>>>>>
>>>>>>>>>>

Re: [DISCUSS] Show Python code examples first in Spark documentation

Reply via email to