I think some of the issues raised here are not really common.
Examples should follow best practice.
It would be odd to have an example that exploits ansi.enabled=false to e.g. 
overflow an integer.
Instead an example that works with ansi mode will typically work perfectly fine 
in an older version, especially at the level of discussion here, which is 
towards starter guides.
What can happen of course is that best practice in a new version is different 
from best practice in an older version.
But even then we want to bias towards the new version to bring people along.
The old "workaround" and the new best practice can be shown with a disclaimer 
regarding the version they apply to. (I..e. we version WITHIN the page)
Note that for e.g. builtin functions we already do this. We state when a 
function was introduced.

IMHO the value of a unified doc tree can not be overstated when it comes to 
searchability (SEO).


On Jun 11, 2024, at 11:37 AM, Wenchen Fan <cloud0...@gmail.com> wrote:

Shall we decouple these two decisions?

  *   Move the programming guide to the spark-website repo, to allow faster 
iterations and releases
  *   Make programming guide version-less

I think the downside of moving the programming guide to the spark-website repo 
is almost negligible: you may need to have PRs in both the Spark and 
spark-website repo for major features that need to be mentioned in the 
programming guide. The release process may need more steps to build the doc 
site.

We can have more discussions on version-less. Today we upload the full doc site 
for each maintenance release, which is a big waste as the content is almost the 
same with the previous maintenance release. As a result, git operations on the 
spark-website repo are quite slow today, as this repo is too big. I think we 
should at least have a single programming guide for each feature release.


On Tue, Jun 11, 2024 at 10:36 AM Neil Ramaswamy 
<n...@ramaswamy.org<mailto:n...@ramaswamy.org>> wrote:
There are two issues and one main benefit that I see with versioned programming 
guides:

  *   Issue 1: We often retroactively realize that code snippets have bugs and 
explanations are confusing (see examples: 
dropDuplicates<https://github.com/apache/spark/pull/46797>,  
dropDuplicatesWithinWatermark<https://stackoverflow.com/questions/77512507/how-exactly-does-dropduplicateswithinwatermark-work>).
 Without backporting to older guides, I don't think that users can have, as 
Mridul says, "reasonable confidence that features, functionality and examples 
mentioned will work with that released Spark version". In this sense, I 
definitely disagree with Nimrod's position of "working on updated versions and 
not working with old versions anyway." To have confidence in versioned 
programming guides, we must have a system for backporting and re-releasing.
  *   Issue 2: If programming guides live in the Spark website, you now need 
maintenance releases in Spark to get those changes to production (i.e. 
spark-website). Historically, Spark does not create maintenance releases 
frequently, especially not just for a docs change. So, we'd need to break 
precedent (this will create potentially dozens of minor releases, far more than 
what we do today), and the person making docs changes needs to rebuild the docs 
site and create one PR in spark-website for every version they change. Fixing a 
code typo in 4 versions? You need 4 maintenance releases, and 4 more PRs.
  *   Benefit 1: versioned docs don't have to caveat what features are 
available in prose.

Personally, I think it's fine to caveat what features are available in prose. 
For the rare case where we have completely incompatible Spark code (which 
should be exceedingly rare), we can provide different code snippets. As Wenchen 
points out, if we do have 100 mutually incompatible versions, we have an issue, 
but the ANSI SQL default might be one of these rare examples.

(Note: version-specific commentary is already present in the Structured 
Streaming Programming Guide, our most 
popular<https://analytics.apache.org/index.php?module=CoreHome&action=index&date=yesterday&period=day&idSite=40#?idSite=40&period=day&date=yesterday&category=General_Actions&subcategory=General_Pages>
 guide. It flows nicely: for example, we talk about state, and then we say, 
"hey, if you have Spark 4.0, state is more easily debuggable because of the 
state reader." The prose focuses on the stable concept of state—which has been 
unchanged since 2.0.0—and then mentions a feature that can encourage upgrade.)

However, I do see one path forward with versioned guides: 1) guide changes do 
not constitute a maintenance release 2) we create an automation to allow us to 
backport docs changes to old branches 3) once merged in Spark, the automation 
rebuilds all the static sites and creates PRs in spark-website. The downside is 
that backport merge conflicts will force developers to backport changes 
themselves. While I do not want to sign up for that work, is this something 
people are more comfortable with?

Neil


On Tue, Jun 11, 2024 at 8:47 AM Wenchen Fan 
<cloud0...@gmail.com<mailto:cloud0...@gmail.com>> wrote:
Just FYI, the Hive languages manual is also version-less: 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual

It's not a strong data point as this doc is not actively updated, but my 
personal feeling is that it's nice to see the history of a feature: when it was 
introduced, when it got changed, with JIRA ticket linked.

One potential issue is that if a feature has been changed 100 times in history, 
it's too verbose to document all 100 different behaviors for different 
versions. If that happens, I think we can make each major version have its own 
programming guide, assuming we won't change a feature 100 times in Spark 4 :)

On Mon, Jun 10, 2024 at 1:08 PM Nimrod Ofek 
<ofek.nim...@gmail.com<mailto:ofek.nim...@gmail.com>> wrote:
My personal opinion is that having the documents per version (current and 
previous), without fixing previous versions - just keeping them as a snapshot 
in time of the current documentation once the new version was released, should 
be good enough.

Because now Neil would like to change the documentation (personally I think 
it's very needed and it's a great thing to do) - there will be a big gap 
between the old documents and the new ones...
If after rewriting and rearenging the documents someone would feel it can be 
beneficial to port back the documentation for some of the older versions as 
well as a one time thing, that's possible as well of course...

I find this solution to be best of all worlds - versioned, so you can read 
documents which are relevant to the version you use (though I am in favour of 
working on updated versions and not working with old versions anyway), while 
the documentation can be updated many times, after the release and 
independently from the actual release of Spark.

I think that keeping one document to support all versions will soon become hard 
to read and understand with little benefit of having updated documentation for 
old versions.


Regarding SEO and deranking, afaik updating the documentation more frequently 
should only improve ranking so the latest documentation should always be ranked 
high in Google search, but maybe I'm missing something.

Nimrod



בתאריך יום ב׳, 10 ביוני 2024, 21:25, מאת Nicholas Chammas 
‏<nicholas.cham...@gmail.com<mailto:nicholas.cham...@gmail.com>>:
I will let Neil and Matt clarify the details because I believe they understand 
the overall picture better. However, I would like to emphasize something that 
motivated this effort and which may be getting lost in the concerns about 
versioned vs. versionless docs.

The main problem is that some of the guides need major overhauls.

There are people like Neil who are interested in making significant 
contributions to the guides. What is holding them back is that major changes to 
the web docs can trigger wholesale deranking of our site by Google. Since 
versioned docs are tied to Spark releases, which are infrequent, that means 
potentially being nuked in the search rankings for months.

Versionless docs allow for rapid iteration on the guides, which can be driven 
in part by search rankings.

In other words, there is a problem chain here that leads to versionless docs:

1. Several guides need major improvements.
2. We cannot make such improvements because a) that would risk site deranking, 
and b) we are constrained by Spark's release schedule.
3. Versionless guides allow for incremental improvements, which addresses 
problems 2a and 2b.

This is my understanding of the big picture as described to me by Neil and 
Matt. I defer to them to elaborate on the details, especially in relation to 
Google site rankings. If this concern is not valid or not that serious, then we 
can just iterate slowly on the docs with Spark’s existing release schedule and 
there is less need for versionless docs.

Nick


On Jun 10, 2024, at 1:53 PM, Mridul Muralidharan 
<mri...@gmail.com<mailto:mri...@gmail.com>> wrote:


Hi,

  Versioned documentation has the benefit that users can have reasonable 
confidence that features, functionality and examples mentioned will work with 
that released Spark version.
A versionless guide runs into potential issues with deprecation, behavioral 
changes and new features.

My concern is not just around features highlighting their supported versions, 
but examples which reference others features in spark.

For example, sql differences between hive ql and ansi sql when we flip the 
default in 4.0 : we would have 4.x example snippets for some feature (say UDAF) 
which would not work for 3.x and vice versa.

Regards,
Mridul


On Mon, Jun 10, 2024 at 12:03 PM Hyukjin Kwon 
<gurwls...@apache.org<mailto:gurwls...@apache.org>> wrote:
I am +1 on this but as you guys mentioned, we should really be clear on how to 
address different versions.

On Wed, 5 Jun 2024 at 18:27, Matthew Powers 
<matthewkevinpow...@gmail.com<mailto:matthewkevinpow...@gmail.com>> wrote:
I am a huge fan of the Apache Spark docs and I regularly look at the analytics 
on this 
page<https://analytics.apache.org/index.php?module=CoreHome&action=index&date=yesterday&period=day&idSite=40#?period=day&date=yesterday&category=Dashboard_Dashboard&subcategory=1>
 to see how well they are doing.  Great work to everyone that's contributed to 
the docs over the years.

We've been chipping away with some improvements over the past year and have 
made good progress.  For example, lots of the pages were missing canonical 
links.  Canonical links are a special type of link that are extremely important 
for any site that has duplicate content.  Versioned documentation sites have 
lots of duplicate pages, so getting these canonical links added was important.  
It wasn't really easy to make this change though.

The current site is confusing Google a bit.  If you do a "spark rocksdb" Google 
search for example, you get the Spark 3.2 Structured Streaming Programming 
Guide as the first result (because Google isn't properly indexing the docs).  
You need to Control+F and search for "rocksdb" to navigate to the relevant 
section which says: "As of Spark 3.2, we add a new built-in state store 
implementation...", which is what you'd expect in a versionless docs site in 
any case.

There are two different user experiences:

* Option A: push Spark 3.1 Structured Streaming users to the Spark 3.1 
Structured Streaming Programming guide that doesn't mention RocksDB
* Option B: push Spark Structured Streaming users to the latest Structure 
Streaming Programming guide, which mentions RocksDB, but caveat that this 
feature was added in Spark 3.2

I think Option B provides Spark 3.1 users a better experience overall.  It's 
better to let users know they can access RocksDB by upgrading than hiding this 
info from them IMO.

Now if we want Option A, then we'd need to give users a reasonable way to 
actually navigate to the Spark 3.1 docs.  From what I can tell, the only way to 
navigate from the latest Structured Streaming Programming 
Guide<https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html>
 to a different version is by manually updating the URL.

I was just skimming over the Structured Streaming Programming guide and 
noticing again how lots of the Python code snippets aren't PEP 8 compliant.  It 
seems like our current docs publishing process would prevent us from improving 
the old docs pages.

In this conversation, let's make sure we distinguish between "programming 
guides" and "API documentation".  API docs should be versioned and there is no 
question there.  Programming guides are higher level conceptual overviews, like 
the Polars user guide<https://docs.pola.rs/>, and should be relevant across 
many versions.

I would also like to point out the the current programming guides are not 
consistent:

* The Structured Streaming programming 
guide<https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html>
 is one giant page
* The SQL programming 
guide<https://spark.apache.org/docs/latest/sql-programming-guide.html> is split 
on many pages
* The PySpark programming 
guide<https://spark.apache.org/docs/latest/api/python/getting_started/index.html>
 takes you to a whole different URL structure and makes it so you can't even 
navigate to the other programming guides anymore

I am looking forward to collaborating with the community and improving the docs 
to 1. delight existing users and 2. attract new users.  Docs are a "website 
problem" and we're big data people, but I'm confident we'll be able to work 
together and find a good path forward here.


On Wed, Jun 5, 2024 at 3:22 PM Neil Ramaswamy 
<n...@ramaswamy.org<mailto:n...@ramaswamy.org>> wrote:
Thanks all for the responses. Let me try to address everything.

> the programming guides are also different between versions since features are 
> being added, configs are being added/ removed/ changed, defaults are being 
> changed etc.

I agree that this is the case. But I think it's fine to mention what version a 
feature is available in. In fact, I would argue that mentioning an improvement 
that a version brings motivates users to upgrade more than keeping docs 
improvement to "new releases to keep the community updating". Users should 
upgrade to get a better Spark, not better Spark documentation.

> having a programming guide that refers to features or API methods that does 
> not exist in that version is confusing and detrimental

I don't think that we'd do this. Again, programming guides should teach 
fundamentals that do not change version-to-version. 
TypeScript<https://www.typescriptlang.org/docs/handbook/typescript-from-scratch.html>
 (which has one of the best DX's and docs) does this exceptionally well. Their 
guides are refined, versionless pages, new features are elaborated upon in 
release notes (analogous to our version-specific docs), and for the occasional 
caveat for a version, it is called out in the guides.

 I agree with Wenchen's 3 points. I don't think we need to say that they have 
to go to the old page, but that if they want to, they can.

Neil

On Wed, Jun 5, 2024 at 12:04 PM Wenchen Fan 
<cloud0...@gmail.com<mailto:cloud0...@gmail.com>> wrote:
I agree with the idea of a versionless programming guide. But one thing we need 
to make sure of is we give clear messages for things that are only available in 
a new version. My proposal is:

  1.  keep the old versions' programming guide unchanged. For example, people 
can still access https://spark.apache.org/docs/3.3.4/quick-start.html
  2.  In the new versionless programming guide, we mention at the beginning 
that for Spark versions before 4.0, go to the versioned doc site to read the 
programming guide.
  3.  Revisit the programming guide of Spark 4.0 (compare it with the one of 
3.5), and adjust the content to mention version-specific changes (API change, 
new features, etc.)

Then we can have a versionless programming guide starting from Spark 4.0. We 
can also revisit programming guides of all versions and combine them into one 
with version-specific notes, but that's probably too much work.

Any thoughts?

Wenchen

On Wed, Jun 5, 2024 at 1:39 AM Martin Andersson 
<martin.anders...@kambi.com<mailto:martin.anders...@kambi.com>> wrote:
While I have no practical knowledge of how documentation is maintained in the 
spark project, I must agree with Nimrod. For users on older versions, having a 
programming guide that refers to features or API methods that does not exist in 
that version is confusing and detrimental.

Surely there must be a better way to allow updating documentation more often?

Best Regards,
Martin

________________________________
From: Nimrod Ofek <ofek.nim...@gmail.com<mailto:ofek.nim...@gmail.com>>
Sent: Wednesday, June 5, 2024 08:26
To: Neil Ramaswamy <n...@ramaswamy.org<mailto:n...@ramaswamy.org>>
Cc: Praveen Gattu <praveen.ga...@databricks.com.invalid>; dev 
<dev@spark.apache.org<mailto:dev@spark.apache.org>>
Subject: Re: [DISCUSS] Versionless Spark Programming Guide Proposal


EXTERNAL SENDER. Do not click links or open attachments unless you recognize 
the sender and know the content is safe. DO NOT provide your username or 
password.


Hi Neil,


While you wrote you don't mean the api docs (of course), the programming guides 
are also different between versions since features are being added, configs are 
being added/ removed/ changed, defaults are being changed etc.

I know of "backport hell" - which is why I wrote that once a version is 
released it's freezed and the documentation will be updated for the new version 
only.

I think of it as facing forward and keeping older versions but focusing on the 
new releases to keep the community updating.
While spark has support window of 18 months until eol, we can have only 6 
months support cycle until eol for documentation- there are no major security 
concerns for documentation...

Nimrod

בתאריך יום ד׳, 5 ביוני 2024, 08:28, מאת Neil Ramaswamy 
‏<n...@ramaswamy.org<mailto:n...@ramaswamy.org>>:
Hi Nimrod,

Quick clarification—my proposal will not touch API-specific documentation for 
the specific reasons you mentioned (signatures, behavior, etc.). It just aims 
to make the programming guides versionless. Programming guides should teach 
fundamentals of Spark, and the fundamentals of Spark should not change between 
releases.

There are a few issues with updating documentation multiple times after Spark 
releases. First, fixes that apply to all existing versions' programming guides 
need backport PRs. For example, this 
change<https://github.com/apache/spark/pull/46797/files> applies to all the 
versions of the SS programming guide, but is likely to be fixed only in Spark 
4.0. Additionally, any such update within a Spark release will require 
re-building the static sites in the spark repo, and copying those files to 
spark-website via a commit in spark-website. Making a typo fix like the one I 
linked would then require <number of versions we want to update> + 1 PRs, 
opposed to 1 PR in the versionless programming guide world.

Neil

On Tue, Jun 4, 2024 at 1:32 PM Nimrod Ofek 
<ofek.nim...@gmail.com<mailto:ofek.nim...@gmail.com>> wrote:
Hi,

While I think that the documentation needs a lot of improvement and important 
details are missing - and detaching the documentation from the main project can 
help iterating faster on documentation specific tasks, I don't think we can nor 
should move to versionless documentation.

Documentation is version specific: parameters are added and removed, new 
features are added, behaviours sometimes change etc.

I think the documentation should be version specific- but separate from spark 
release cadence - and can be updated multiple times after spark release.
The way I see it is that the documentation should be updated only for the 
latest version and some time before a new release should be archived and the 
updated documentation should reflect the new version.

Thanks,
Nimrod

בתאריך יום ג׳, 4 ביוני 2024, 18:34, מאת Praveen Gattu 
‏<praveen.ga...@databricks.com.invalid>:
+1. This helps for greater velocity in improving docs. However, we might still 
need a way to provide version specific information isn't it, i.e. what features 
are available in which version etc.

On Mon, Jun 3, 2024 at 3:08 PM Neil Ramaswamy 
<n...@ramaswamy.org<mailto:n...@ramaswamy.org>> wrote:
Hi all,

I've written up a proposal to migrate all the Apache Spark programming guides 
to be versionless. You can find the proposal 
here<https://docs.google.com/document/d/1OqeQ71zZleUa1XRZrtaPDFnJ-gVJdGM80o42yJVg9zg/>.
 Please leave comments, or reply in this DISCUSS thread.

TLDR: by making the programming guides versionless, we can make updates to them 
whenever we'd like, instead of at the Spark release cadence. This increased 
update velocity will enable us to make gradual improvements, including breaking 
up the Structured Streaming programming guide into smaller sub-guides. The 
proposal does not break any existing URLs, and it does not affect our versioned 
API docs in any way.

Thanks!
Neil
CONFIDENTIALITY NOTICE: This email message (and any attachment) is intended 
only for the individual or entity to which it is addressed. The information in 
this email is confidential and may contain information that is legally 
privileged or exempt from disclosure under applicable law. If you are not the 
intended recipient, you are strictly prohibited from reading, using, publishing 
or disseminating such information and upon receipt, must permanently delete the 
original and destroy any copies. We take steps to protect against viruses and 
other defects but advise you to carry out your own checks and precautions as 
Kambi does not accept any liability for any which remain. Thank you for your 
co-operation.


Reply via email to