I'd like to discuss the Spark SQL migration / upgrade guides in the Spark
documentation: these are valuable resources and I think we could increase
that value by making these docs easier to discover and by adding a bit more
structure to the existing content.

For folks who aren't familiar with these docs: the Spark docs have a "SQL
Migration Guide" which lists the deprecations and changes of behavior in
each release:

   - Latest published version:
   https://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html
   - Master branch version (will become 3.0):
   
https://github.com/apache/spark/blob/master/docs/sql-migration-guide-upgrade.md

A lot of community work went into crafting this doc and I really appreciate
those efforts.

This doc is a little hard to find, though, because it's not consistently
linked from release notes pages: the 2.4.0 page links it under "Changes of
Behavior" (
https://spark.apache.org/releases/spark-release-2-4-0.html#changes-of-behavior)
but subsequent maintenance releases do not link to it (
https://spark.apache.org/releases/spark-release-2-4-1.html). It's also not
very cross-linked from the rest of the Spark docs (e.g. the Overview doc,
doc drop-down menus, etc).

I'm also concerned that the doc may be overwhelming to end users (as
opposed to Spark developers):

   - *Entries aren't grouped by component*, so users need to read the
   entire document to spot changes relevant to their use of Spark (for
   example, PySpark changes are not grouped together).
   - *Entries aren't ordered by size / risk of change,* e.g. performance
   impact vs. loud behavior change (stopping with an explicit exception) vs.
   silent behavior changes (e.g. changing default rounding behavior). If we
   assume limited reader attention then it may be important to prioritize the
   order in which we list entries, putting the highest-expected-impact /
   lowest-organic-discoverability changes first.
   - *We don't link JIRAs*, forcing users to do their own archaeology to
   learn more about a specific change.

The existing ML migration guide addresses some of these issues, so maybe we
can emulate it in the SQL guide:
https://spark.apache.org/docs/latest/ml-guide.html#migration-guide

I think that documentation clarity is especially important with Spark 3.0
around the corner: many folks will seek out this information when they
upgrade, so improving this guide can be a high-leverage, high-impact
activity.

What do folks think? Does anyone have examples from other projects which do
a notably good job of crafting release notes / migration guides? I'd be
glad to help with pre-release editing after we decide on a structure and
style.

Cheers,
Josh

Reply via email to