I think this needs to be consistently done on all relevant pages and my intent is to do that work in time for when it is first released. I started with the "Spark SQL, DataFrames and Datasets Guide" page to break it up into multiple, scoped PRs. I should have made that clear before.
I think it's a great idea to have an umbrella JIRA for this to outline the full scope and track overall progress and I'm happy to create it. I can't speak on behalf of all Scala users of course, but I don't think this change makes Scala appear as a 2nd class citizen, like I don't think of Python as a 2nd class citizen because it is not first currently, but it does recognize that Python is more broadly popular today. Thanks, Allan On Thu, Feb 23, 2023 at 6:55 PM Dongjoon Hyun <dongjoon.h...@gmail.com> wrote: > Thank you all. > > Yes, attracting more Python users and being more Python user-friendly is > always good. > > Basically, SPARK-42493 is proposing to introduce intentional inconsistency > to Apache Spark documentation. > > The inconsistency from SPARK-42493 might give Python users the following > questions first. > > - Why not RDD pages which are the heart of Apache Spark? Is Python not > good in RDD? > - Why not ML and Structured Streaming pages when DATA+AI Summit focuses on > ML heavily? > > Also, more questions to the Scala users. > - Is Scala language stepping down to the 2nd citizen language? > - What about Scala 3? > > Of course, I understand SPARK-42493 has specific scopes > (SQL/Dataset/Dataframe) and didn't mean anything like the above at all. > However, if SPARK-42493 is emphasized as "the first step" to introduce > that inconsistency, I'm wondering > - What direction we are heading? > - What is the next target scope? > - When it will be achieved (or completed)? > - Or, is the goal to be permanently inconsistent in terms of the > documentation? > > It's unclear even in the documentation-only scope. If we are expecting > more and more subtasks during Apache Spark 3.5 timeframe, shall we have an > umbrella JIRA? > > Bests, > Dongjoon. > > > On Thu, Feb 23, 2023 at 6:15 PM Allan Folting <afolting...@gmail.com> > wrote: > >> Thanks a lot for the questions and comments/feedback! >> >> To address your questions Dongjoon, I do not intend for these updates to >> the documentation to be tied to the potential changes/suggestions you ask >> about. >> >> In other words, this proposal is only about adjusting the documentation >> to target the majority of people reading it - namely the large and growing >> number of Python users - and new users in particular as they are often >> already familiar with and have a preference for Python when evaluating or >> starting to use Spark. >> >> While we may want to strengthen support for Python in other ways, I think >> such efforts should be tracked separately from this. >> >> Allan >> >> On Thu, Feb 23, 2023 at 1:44 AM Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> If this is not just flip flopping the document pages and involves other >>> changes, then a proper impact analysis needs to be done to assess the >>> efforts involved. Personally I don't think it really matters. >>> >>> HTH >>> >>> >>> >>> view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> https://en.everybodywiki.com/Mich_Talebzadeh >>> >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> >>> On Thu, 23 Feb 2023 at 01:40, Hyukjin Kwon <gurwls...@gmail.com> wrote: >>> >>>> > 1. Does this suggestion imply Python API implementation will be the >>>> new blocker in the future in terms of feature parity among languages? Until >>>> now, Python API feature parity was one of the audit items because it's not >>>> enforced. In other words, Scala and Java have been the full feature because >>>> they are the underlying main developer languages while Python/R/SQL >>>> environments were the nice-to-have. >>>> >>>> I think it wouldn't be treated as a blocker .. but I do believe we have >>>> added all new features into the Python side for the last couple of >>>> releases. So, I wouldn't worry about this at this moment - we have been >>>> doing fine in terms of feature parity. >>>> >>>> > 2. Does this suggestion assume that the Python environment is easier >>>> for users than Scala/Java always? Given that we support Python 3.8 to 3.11, >>>> the support matrix for Python library dependency is a problem for the >>>> Apache Spark community to solve in order to claim that. As we say >>>> at SPARK-41454, Python language also introduces breaking changes to us >>>> historically and we have many `Pinned` python libraries issues. >>>> >>>> Yes. In fact, regardless of this change, I do believe we should test >>>> more versions, etc. At least scheduled jobs like we're doing JDK and Scala >>>> versions. >>>> >>>> >>>> FWIW, my take about this change is: people use Python and PySpark more >>>> (according to the chart and stats provided) so let's put those examples >>>> first :-). >>>> >>>> >>>> On Thu, 23 Feb 2023 at 10:27, Dongjoon Hyun <dongjoon.h...@gmail.com> >>>> wrote: >>>> >>>>> I have two questions to clarify the scope and boundaries. >>>>> >>>>> 1. Does this suggestion imply Python API implementation will be the >>>>> new blocker in the future in terms of feature parity among languages? >>>>> Until >>>>> now, Python API feature parity was one of the audit items because it's not >>>>> enforced. In other words, Scala and Java have been the full feature >>>>> because >>>>> they are the underlying main developer languages while Python/R/SQL >>>>> environments were the nice-to-have. >>>>> >>>>> 2. Does this suggestion assume that the Python environment is easier >>>>> for users than Scala/Java always? Given that we support Python 3.8 to >>>>> 3.11, >>>>> the support matrix for Python library dependency is a problem for the >>>>> Apache Spark community to solve in order to claim that. As we say >>>>> at SPARK-41454, Python language also introduces breaking changes to us >>>>> historically and we have many `Pinned` python libraries issues. >>>>> >>>>> Changing documentation is easy, but I hope we can give clear >>>>> communication and direction in this effort because this is one of the most >>>>> user-facing changes. >>>>> >>>>> Dongjoon. >>>>> >>>>> On Wed, Feb 22, 2023 at 5:26 PM 416161...@qq.com <ruife...@foxmail.com> >>>>> wrote: >>>>> >>>>>> +1 LGTM >>>>>> >>>>>> ------------------------------ >>>>>> Ruifeng Zheng >>>>>> ruife...@foxmail.com >>>>>> >>>>>> <https://wx.mail.qq.com/home/index?t=readmail_businesscard_midpage&nocheck=true&name=Ruifeng+Zheng&icon=https%3A%2F%2Fres.mail.qq.com%2Fzh_CN%2Fhtmledition%2Fimages%2Frss%2Fmale.gif%3Frand%3D1617349242&mail=ruifengz%40foxmail.com&code=> >>>>>> >>>>>> >>>>>> >>>>>> ------------------ Original ------------------ >>>>>> *From:* "Xinrong Meng" <xinrong.apa...@gmail.com>; >>>>>> *Date:* Thu, Feb 23, 2023 09:17 AM >>>>>> *To:* "Allan Folting"<afolting...@gmail.com>; >>>>>> *Cc:* "dev"<dev@spark.apache.org>; >>>>>> *Subject:* Re: [DISCUSS] Show Python code examples first in Spark >>>>>> documentation >>>>>> >>>>>> +1 Good idea! >>>>>> >>>>>> On Thu, Feb 23, 2023 at 7:41 AM Jack Goodson <jackagood...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Good idea, at the company I work at we discussed using Scala as our >>>>>>> primary language because technically it is slightly stronger than python >>>>>>> but ultimately chose python in the end as it’s easier for other devs to >>>>>>> be >>>>>>> on boarded to our platform and future hiring for the team etc would be >>>>>>> easier >>>>>>> >>>>>>> On Thu, 23 Feb 2023 at 12:20 PM, Hyukjin Kwon <gurwls...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> +1 I like this idea too. >>>>>>>> >>>>>>>> On Thu, Feb 23, 2023 at 6:00 AM Allan Folting < >>>>>>>> afolting...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi all, >>>>>>>>> >>>>>>>>> I would like to propose that we show Python code examples first in >>>>>>>>> the Spark documentation where we have multiple programming language >>>>>>>>> examples. >>>>>>>>> An example is on the Quick Start page: >>>>>>>>> https://spark.apache.org/docs/latest/quick-start.html >>>>>>>>> >>>>>>>>> I propose this change because Python has become more popular than >>>>>>>>> the other languages supported in Apache Spark. There are a lot more >>>>>>>>> users >>>>>>>>> of Spark in Python than Scala today and Python attracts a broader set >>>>>>>>> of >>>>>>>>> new users. >>>>>>>>> For Python usage data, see https://www.tiobe.com/tiobe-index/ and >>>>>>>>> https://insights.stackoverflow.com/trends?tags=r%2Cscala%2Cpython%2Cjava >>>>>>>>> . >>>>>>>>> >>>>>>>>> Also, this change aligns with Python already being the first tab >>>>>>>>> on our home page: >>>>>>>>> https://spark.apache.org/ >>>>>>>>> >>>>>>>>> Anyone who wants to use another language can still just click on >>>>>>>>> the other tabs. >>>>>>>>> >>>>>>>>> I created a draft PR for the Spark SQL, DataFrames and Datasets >>>>>>>>> Guide page as a first step: >>>>>>>>> https://github.com/apache/spark/pull/40087 >>>>>>>>> >>>>>>>>> >>>>>>>>> I would appreciate it if you could share your thoughts on this >>>>>>>>> proposal. >>>>>>>>> >>>>>>>>> >>>>>>>>> Thanks a lot, >>>>>>>>> Allan Folting >>>>>>>>> >>>>>>>>