I am +1 on this but as you guys mentioned, we should really be clear on how
to address different versions.

On Wed, 5 Jun 2024 at 18:27, Matthew Powers <matthewkevinpow...@gmail.com>
wrote:

> I am a huge fan of the Apache Spark docs and I regularly look at the
> analytics on this page
> <https://analytics.apache.org/index.php?module=CoreHome&action=index&date=yesterday&period=day&idSite=40#?period=day&date=yesterday&category=Dashboard_Dashboard&subcategory=1>
> to see how well they are doing.  Great work to everyone that's contributed
> to the docs over the years.
>
> We've been chipping away with some improvements over the past year and
> have made good progress.  For example, lots of the pages were missing
> canonical links.  Canonical links are a special type of link that are
> extremely important for any site that has duplicate content.  Versioned
> documentation sites have lots of duplicate pages, so getting these
> canonical links added was important.  It wasn't really easy to make this
> change though.
>
> The current site is confusing Google a bit.  If you do a "spark rocksdb"
> Google search for example, you get the Spark 3.2 Structured Streaming
> Programming Guide as the first result (because Google isn't properly
> indexing the docs).  You need to Control+F and search for "rocksdb" to
> navigate to the relevant section which says: "As of Spark 3.2, we add a
> new built-in state store implementation...", which is what you'd expect
> in a versionless docs site in any case.
>
> There are two different user experiences:
>
> * Option A: push Spark 3.1 Structured Streaming users to the Spark 3.1
> Structured Streaming Programming guide that doesn't mention RocksDB
> * Option B: push Spark Structured Streaming users to the latest Structure
> Streaming Programming guide, which mentions RocksDB, but caveat that this
> feature was added in Spark 3.2
>
> I think Option B provides Spark 3.1 users a better experience overall.
> It's better to let users know they can access RocksDB by upgrading than
> hiding this info from them IMO.
>
> Now if we want Option A, then we'd need to give users a reasonable way to
> actually navigate to the Spark 3.1 docs.  From what I can tell, the only
> way to navigate from the latest Structured Streaming Programming Guide
> <https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html>
> to a different version is by manually updating the URL.
>
> I was just skimming over the Structured Streaming Programming guide and
> noticing again how lots of the Python code snippets aren't PEP 8
> compliant.  It seems like our current docs publishing process would prevent
> us from improving the old docs pages.
>
> In this conversation, let's make sure we distinguish between "programming
> guides" and "API documentation".  API docs should be versioned and there is
> no question there.  Programming guides are higher level conceptual
> overviews, like the Polars user guide <https://docs.pola.rs/>, and should
> be relevant across many versions.
>
> I would also like to point out the the current programming guides are not
> consistent:
>
> * The Structured Streaming programming guide
> <https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html>
> is one giant page
> * The SQL programming guide
> <https://spark.apache.org/docs/latest/sql-programming-guide.html> is
> split on many pages
> * The PySpark programming guide
> <https://spark.apache.org/docs/latest/api/python/getting_started/index.html>
> takes you to a whole different URL structure and makes it so you can't even
> navigate to the other programming guides anymore
>
> I am looking forward to collaborating with the community and improving the
> docs to 1. delight existing users and 2. attract new users.  Docs are a
> "website problem" and we're big data people, but I'm confident we'll be
> able to work together and find a good path forward here.
>
>
> On Wed, Jun 5, 2024 at 3:22 PM Neil Ramaswamy <n...@ramaswamy.org> wrote:
>
>> Thanks all for the responses. Let me try to address everything.
>>
>> > the programming guides are also different between versions since
>> features are being added, configs are being added/ removed/ changed,
>> defaults are being changed etc.
>>
>> I agree that this is the case. But I think it's fine to mention what
>> version a feature is available in. In fact, I would argue that mentioning
>> an improvement that a version brings motivates users to upgrade more than
>> keeping docs improvement to "new releases to keep the community updating".
>> Users should upgrade to get a better Spark, not better Spark documentation.
>>
>> > having a programming guide that refers to features or API methods that
>> does not exist in that version is confusing and detrimental
>>
>> I don't think that we'd do this. Again, programming guides should teach
>> fundamentals that do not change version-to-version. TypeScript
>> <https://www.typescriptlang.org/docs/handbook/typescript-from-scratch.html> 
>> (which
>> has one of the best DX's and docs) does this exceptionally well. Their
>> guides are refined, versionless pages, new features are elaborated upon in
>> release notes (analogous to our version-specific docs), and for the
>> occasional caveat for a version, it is called out in the guides.
>>
>>  I agree with Wenchen's 3 points. I don't think we need to say that they
>> *have* to go to the old page, but that if they want to, they can.
>>
>> Neil
>>
>> On Wed, Jun 5, 2024 at 12:04 PM Wenchen Fan <cloud0...@gmail.com> wrote:
>>
>>> I agree with the idea of a versionless programming guide. But one thing
>>> we need to make sure of is we give clear messages for things that are only
>>> available in a new version. My proposal is:
>>>
>>>    1. keep the old versions' programming guide unchanged. For example,
>>>    people can still access
>>>    https://spark.apache.org/docs/3.3.4/quick-start.html
>>>    2. In the new versionless programming guide, we mention at the
>>>    beginning that for Spark versions before 4.0, go to the versioned doc 
>>> site
>>>    to read the programming guide.
>>>    3. Revisit the programming guide of Spark 4.0 (compare it with the
>>>    one of 3.5), and adjust the content to mention version-specific changes
>>>    (API change, new features, etc.)
>>>
>>> Then we can have a versionless programming guide starting from Spark
>>> 4.0. We can also revisit programming guides of all versions and combine
>>> them into one with version-specific notes, but that's probably too much
>>> work.
>>>
>>> Any thoughts?
>>>
>>> Wenchen
>>>
>>> On Wed, Jun 5, 2024 at 1:39 AM Martin Andersson <
>>> martin.anders...@kambi.com> wrote:
>>>
>>>> While I have no practical knowledge of how documentation is maintained
>>>> in the spark project, I must agree with Nimrod. For users on older
>>>> versions, having a programming guide that refers to features or API methods
>>>> that does not exist in that version is confusing and detrimental.
>>>>
>>>> Surely there must be a better way to allow updating documentation more
>>>> often?
>>>>
>>>> Best Regards,
>>>> Martin
>>>>
>>>> ------------------------------
>>>> *From:* Nimrod Ofek <ofek.nim...@gmail.com>
>>>> *Sent:* Wednesday, June 5, 2024 08:26
>>>> *To:* Neil Ramaswamy <n...@ramaswamy.org>
>>>> *Cc:* Praveen Gattu <praveen.ga...@databricks.com.invalid>; dev <
>>>> dev@spark.apache.org>
>>>> *Subject:* Re: [DISCUSS] Versionless Spark Programming Guide Proposal
>>>>
>>>>
>>>> EXTERNAL SENDER. Do not click links or open attachments unless you
>>>> recognize the sender and know the content is safe. DO NOT provide your
>>>> username or password.
>>>>
>>>> Hi Neil,
>>>>
>>>>
>>>> While you wrote you don't mean the api docs (of course), the
>>>> programming guides are also different between versions since features are
>>>> being added, configs are being added/ removed/ changed, defaults are being
>>>> changed etc.
>>>>
>>>> I know of "backport hell" - which is why I wrote that once a version is
>>>> released it's freezed and the documentation will be updated for the new
>>>> version only.
>>>>
>>>> I think of it as facing forward and keeping older versions but focusing
>>>> on the new releases to keep the community updating.
>>>> While spark has support window of 18 months until eol, we can have only
>>>> 6 months support cycle until eol for documentation- there are no major
>>>> security concerns for documentation...
>>>>
>>>> Nimrod
>>>>
>>>> בתאריך יום ד׳, 5 ביוני 2024, 08:28, מאת Neil Ramaswamy ‏<
>>>> n...@ramaswamy.org>:
>>>>
>>>> Hi Nimrod,
>>>>
>>>> Quick clarification—my proposal will not touch API-specific
>>>> documentation for the specific reasons you mentioned (signatures, behavior,
>>>> etc.). It just aims to make the *programming guides *versionless.
>>>> Programming guides should teach fundamentals of Spark, and the fundamentals
>>>> of Spark should not change between releases.
>>>>
>>>> There are a few issues with updating documentation multiple times after
>>>> Spark releases. First, fixes that apply to all existing versions'
>>>> programming guides need backport PRs. For example, this change
>>>> <https://github.com/apache/spark/pull/46797/files> applies to all the
>>>> versions of the SS programming guide, but is likely to be fixed only in
>>>> Spark 4.0. Additionally, any such update within a Spark release will 
>>>> require
>>>> re-building the static sites in the spark repo, and copying those files to
>>>> spark-website via a commit in spark-website. Making a typo fix like the one
>>>> I linked would then require <number of versions we want to update> + 1 PRs,
>>>> opposed to 1 PR in the versionless programming guide world.
>>>>
>>>> Neil
>>>>
>>>> On Tue, Jun 4, 2024 at 1:32 PM Nimrod Ofek <ofek.nim...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> While I think that the documentation needs a lot of improvement and
>>>> important details are missing - and detaching the documentation from the
>>>> main project can help iterating faster on documentation specific tasks, I
>>>> don't think we can nor should move to versionless documentation.
>>>>
>>>> Documentation is version specific: parameters are added and removed,
>>>> new features are added, behaviours sometimes change etc.
>>>>
>>>> I think the documentation should be version specific- but separate from
>>>> spark release cadence - and can be updated multiple times after spark
>>>> release.
>>>> The way I see it is that the documentation should be updated only for
>>>> the latest version and some time before a new release should be archived
>>>> and the updated documentation should reflect the new version.
>>>>
>>>> Thanks,
>>>> Nimrod
>>>>
>>>> בתאריך יום ג׳, 4 ביוני 2024, 18:34, מאת Praveen Gattu
>>>> ‏<praveen.ga...@databricks.com.invalid>:
>>>>
>>>> +1. This helps for greater velocity in improving docs. However, we
>>>> might still need a way to provide version specific information isn't it,
>>>> i.e. what features are available in which version etc.
>>>>
>>>> On Mon, Jun 3, 2024 at 3:08 PM Neil Ramaswamy <n...@ramaswamy.org>
>>>> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> I've written up a proposal to migrate all the Apache Spark programming
>>>> guides to be versionless. You can find the proposal here
>>>> <https://docs.google.com/document/d/1OqeQ71zZleUa1XRZrtaPDFnJ-gVJdGM80o42yJVg9zg/>.
>>>> Please leave comments, or reply in this DISCUSS thread.
>>>>
>>>> TLDR: by making the programming guides versionless, we can make updates
>>>> to them whenever we'd like, instead of at the Spark release cadence. This
>>>> increased update velocity will enable us to make gradual improvements,
>>>> including breaking up the Structured Streaming programming guide into
>>>> smaller sub-guides. The proposal does not break *any *existing URLs,
>>>> and it does not affect our versioned API docs in any way.
>>>>
>>>> Thanks!
>>>> Neil
>>>>
>>>> CONFIDENTIALITY NOTICE: This email message (and any attachment) is
>>>> intended only for the individual or entity to which it is addressed. The
>>>> information in this email is confidential and may contain information that
>>>> is legally privileged or exempt from disclosure under applicable law. If
>>>> you are not the intended recipient, you are strictly prohibited from
>>>> reading, using, publishing or disseminating such information and upon
>>>> receipt, must permanently delete the original and destroy any copies. We
>>>> take steps to protect against viruses and other defects but advise you to
>>>> carry out your own checks and precautions as Kambi does not accept any
>>>> liability for any which remain. Thank you for your co-operation.
>>>>
>>>

Reply via email to