I am +1 on this but as you guys mentioned, we should really be clear on how to address different versions.
On Wed, 5 Jun 2024 at 18:27, Matthew Powers <matthewkevinpow...@gmail.com> wrote: > I am a huge fan of the Apache Spark docs and I regularly look at the > analytics on this page > <https://analytics.apache.org/index.php?module=CoreHome&action=index&date=yesterday&period=day&idSite=40#?period=day&date=yesterday&category=Dashboard_Dashboard&subcategory=1> > to see how well they are doing. Great work to everyone that's contributed > to the docs over the years. > > We've been chipping away with some improvements over the past year and > have made good progress. For example, lots of the pages were missing > canonical links. Canonical links are a special type of link that are > extremely important for any site that has duplicate content. Versioned > documentation sites have lots of duplicate pages, so getting these > canonical links added was important. It wasn't really easy to make this > change though. > > The current site is confusing Google a bit. If you do a "spark rocksdb" > Google search for example, you get the Spark 3.2 Structured Streaming > Programming Guide as the first result (because Google isn't properly > indexing the docs). You need to Control+F and search for "rocksdb" to > navigate to the relevant section which says: "As of Spark 3.2, we add a > new built-in state store implementation...", which is what you'd expect > in a versionless docs site in any case. > > There are two different user experiences: > > * Option A: push Spark 3.1 Structured Streaming users to the Spark 3.1 > Structured Streaming Programming guide that doesn't mention RocksDB > * Option B: push Spark Structured Streaming users to the latest Structure > Streaming Programming guide, which mentions RocksDB, but caveat that this > feature was added in Spark 3.2 > > I think Option B provides Spark 3.1 users a better experience overall. > It's better to let users know they can access RocksDB by upgrading than > hiding this info from them IMO. > > Now if we want Option A, then we'd need to give users a reasonable way to > actually navigate to the Spark 3.1 docs. From what I can tell, the only > way to navigate from the latest Structured Streaming Programming Guide > <https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html> > to a different version is by manually updating the URL. > > I was just skimming over the Structured Streaming Programming guide and > noticing again how lots of the Python code snippets aren't PEP 8 > compliant. It seems like our current docs publishing process would prevent > us from improving the old docs pages. > > In this conversation, let's make sure we distinguish between "programming > guides" and "API documentation". API docs should be versioned and there is > no question there. Programming guides are higher level conceptual > overviews, like the Polars user guide <https://docs.pola.rs/>, and should > be relevant across many versions. > > I would also like to point out the the current programming guides are not > consistent: > > * The Structured Streaming programming guide > <https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html> > is one giant page > * The SQL programming guide > <https://spark.apache.org/docs/latest/sql-programming-guide.html> is > split on many pages > * The PySpark programming guide > <https://spark.apache.org/docs/latest/api/python/getting_started/index.html> > takes you to a whole different URL structure and makes it so you can't even > navigate to the other programming guides anymore > > I am looking forward to collaborating with the community and improving the > docs to 1. delight existing users and 2. attract new users. Docs are a > "website problem" and we're big data people, but I'm confident we'll be > able to work together and find a good path forward here. > > > On Wed, Jun 5, 2024 at 3:22 PM Neil Ramaswamy <n...@ramaswamy.org> wrote: > >> Thanks all for the responses. Let me try to address everything. >> >> > the programming guides are also different between versions since >> features are being added, configs are being added/ removed/ changed, >> defaults are being changed etc. >> >> I agree that this is the case. But I think it's fine to mention what >> version a feature is available in. In fact, I would argue that mentioning >> an improvement that a version brings motivates users to upgrade more than >> keeping docs improvement to "new releases to keep the community updating". >> Users should upgrade to get a better Spark, not better Spark documentation. >> >> > having a programming guide that refers to features or API methods that >> does not exist in that version is confusing and detrimental >> >> I don't think that we'd do this. Again, programming guides should teach >> fundamentals that do not change version-to-version. TypeScript >> <https://www.typescriptlang.org/docs/handbook/typescript-from-scratch.html> >> (which >> has one of the best DX's and docs) does this exceptionally well. Their >> guides are refined, versionless pages, new features are elaborated upon in >> release notes (analogous to our version-specific docs), and for the >> occasional caveat for a version, it is called out in the guides. >> >> I agree with Wenchen's 3 points. I don't think we need to say that they >> *have* to go to the old page, but that if they want to, they can. >> >> Neil >> >> On Wed, Jun 5, 2024 at 12:04 PM Wenchen Fan <cloud0...@gmail.com> wrote: >> >>> I agree with the idea of a versionless programming guide. But one thing >>> we need to make sure of is we give clear messages for things that are only >>> available in a new version. My proposal is: >>> >>> 1. keep the old versions' programming guide unchanged. For example, >>> people can still access >>> https://spark.apache.org/docs/3.3.4/quick-start.html >>> 2. In the new versionless programming guide, we mention at the >>> beginning that for Spark versions before 4.0, go to the versioned doc >>> site >>> to read the programming guide. >>> 3. Revisit the programming guide of Spark 4.0 (compare it with the >>> one of 3.5), and adjust the content to mention version-specific changes >>> (API change, new features, etc.) >>> >>> Then we can have a versionless programming guide starting from Spark >>> 4.0. We can also revisit programming guides of all versions and combine >>> them into one with version-specific notes, but that's probably too much >>> work. >>> >>> Any thoughts? >>> >>> Wenchen >>> >>> On Wed, Jun 5, 2024 at 1:39 AM Martin Andersson < >>> martin.anders...@kambi.com> wrote: >>> >>>> While I have no practical knowledge of how documentation is maintained >>>> in the spark project, I must agree with Nimrod. For users on older >>>> versions, having a programming guide that refers to features or API methods >>>> that does not exist in that version is confusing and detrimental. >>>> >>>> Surely there must be a better way to allow updating documentation more >>>> often? >>>> >>>> Best Regards, >>>> Martin >>>> >>>> ------------------------------ >>>> *From:* Nimrod Ofek <ofek.nim...@gmail.com> >>>> *Sent:* Wednesday, June 5, 2024 08:26 >>>> *To:* Neil Ramaswamy <n...@ramaswamy.org> >>>> *Cc:* Praveen Gattu <praveen.ga...@databricks.com.invalid>; dev < >>>> dev@spark.apache.org> >>>> *Subject:* Re: [DISCUSS] Versionless Spark Programming Guide Proposal >>>> >>>> >>>> EXTERNAL SENDER. Do not click links or open attachments unless you >>>> recognize the sender and know the content is safe. DO NOT provide your >>>> username or password. >>>> >>>> Hi Neil, >>>> >>>> >>>> While you wrote you don't mean the api docs (of course), the >>>> programming guides are also different between versions since features are >>>> being added, configs are being added/ removed/ changed, defaults are being >>>> changed etc. >>>> >>>> I know of "backport hell" - which is why I wrote that once a version is >>>> released it's freezed and the documentation will be updated for the new >>>> version only. >>>> >>>> I think of it as facing forward and keeping older versions but focusing >>>> on the new releases to keep the community updating. >>>> While spark has support window of 18 months until eol, we can have only >>>> 6 months support cycle until eol for documentation- there are no major >>>> security concerns for documentation... >>>> >>>> Nimrod >>>> >>>> בתאריך יום ד׳, 5 ביוני 2024, 08:28, מאת Neil Ramaswamy < >>>> n...@ramaswamy.org>: >>>> >>>> Hi Nimrod, >>>> >>>> Quick clarification—my proposal will not touch API-specific >>>> documentation for the specific reasons you mentioned (signatures, behavior, >>>> etc.). It just aims to make the *programming guides *versionless. >>>> Programming guides should teach fundamentals of Spark, and the fundamentals >>>> of Spark should not change between releases. >>>> >>>> There are a few issues with updating documentation multiple times after >>>> Spark releases. First, fixes that apply to all existing versions' >>>> programming guides need backport PRs. For example, this change >>>> <https://github.com/apache/spark/pull/46797/files> applies to all the >>>> versions of the SS programming guide, but is likely to be fixed only in >>>> Spark 4.0. Additionally, any such update within a Spark release will >>>> require >>>> re-building the static sites in the spark repo, and copying those files to >>>> spark-website via a commit in spark-website. Making a typo fix like the one >>>> I linked would then require <number of versions we want to update> + 1 PRs, >>>> opposed to 1 PR in the versionless programming guide world. >>>> >>>> Neil >>>> >>>> On Tue, Jun 4, 2024 at 1:32 PM Nimrod Ofek <ofek.nim...@gmail.com> >>>> wrote: >>>> >>>> Hi, >>>> >>>> While I think that the documentation needs a lot of improvement and >>>> important details are missing - and detaching the documentation from the >>>> main project can help iterating faster on documentation specific tasks, I >>>> don't think we can nor should move to versionless documentation. >>>> >>>> Documentation is version specific: parameters are added and removed, >>>> new features are added, behaviours sometimes change etc. >>>> >>>> I think the documentation should be version specific- but separate from >>>> spark release cadence - and can be updated multiple times after spark >>>> release. >>>> The way I see it is that the documentation should be updated only for >>>> the latest version and some time before a new release should be archived >>>> and the updated documentation should reflect the new version. >>>> >>>> Thanks, >>>> Nimrod >>>> >>>> בתאריך יום ג׳, 4 ביוני 2024, 18:34, מאת Praveen Gattu >>>> <praveen.ga...@databricks.com.invalid>: >>>> >>>> +1. This helps for greater velocity in improving docs. However, we >>>> might still need a way to provide version specific information isn't it, >>>> i.e. what features are available in which version etc. >>>> >>>> On Mon, Jun 3, 2024 at 3:08 PM Neil Ramaswamy <n...@ramaswamy.org> >>>> wrote: >>>> >>>> Hi all, >>>> >>>> I've written up a proposal to migrate all the Apache Spark programming >>>> guides to be versionless. You can find the proposal here >>>> <https://docs.google.com/document/d/1OqeQ71zZleUa1XRZrtaPDFnJ-gVJdGM80o42yJVg9zg/>. >>>> Please leave comments, or reply in this DISCUSS thread. >>>> >>>> TLDR: by making the programming guides versionless, we can make updates >>>> to them whenever we'd like, instead of at the Spark release cadence. This >>>> increased update velocity will enable us to make gradual improvements, >>>> including breaking up the Structured Streaming programming guide into >>>> smaller sub-guides. The proposal does not break *any *existing URLs, >>>> and it does not affect our versioned API docs in any way. >>>> >>>> Thanks! >>>> Neil >>>> >>>> CONFIDENTIALITY NOTICE: This email message (and any attachment) is >>>> intended only for the individual or entity to which it is addressed. The >>>> information in this email is confidential and may contain information that >>>> is legally privileged or exempt from disclosure under applicable law. If >>>> you are not the intended recipient, you are strictly prohibited from >>>> reading, using, publishing or disseminating such information and upon >>>> receipt, must permanently delete the original and destroy any copies. We >>>> take steps to protect against viruses and other defects but advise you to >>>> carry out your own checks and precautions as Kambi does not accept any >>>> liability for any which remain. Thank you for your co-operation. >>>> >>>