Let me just clarify (because that could be unclear) what my +1 was about.

I was not talking (and I believe Ryan was not talking either) about
removing the old docs but about archiving them and serving from elsewhere
(cloud storage).

I think discussing changing to more shared HTML/JS/CSS is also a good idea
to optimise it, but possibly can be handled separately as a longer effort
of redesigning how the docs are built. But by all means we could also work
on that.

Maybe I jumped to conclusions, but the easiest, tactical solution (for the
most acute issue - size) is we just move the old generated HTML docs from
the git repository of "airflow-site" and in the "github_pages" branch we
replace it with redirecting of those pages to the files served from the
cloud storage (and I believe this is what Ryan hinted at).

Those redirects could be automatically generated for all
historical versions and they will be  small. We are already doing it for
individual pages for navigating between versions, but we could easily
replace all the historical docs with "<html><head><meta
http-equiv="refresh" content="0; url=
https://new-archive-docs-airflow-url/airflow/version/document.url";
"/></head></html>". Low-tech, surely and "legacy", but it will solve the
size problem instantly. We currently have 115.148 such files which will go
down to about ~20 MB of files which is peanuts, compared to the current
17GB (!) we have.

We can also inject into the moved "storage" docs, the header that informs
that this is an old/archived documentation with single redirect to
"live"/"stable" site for newer versions of docs (which I believe sparked
Ryan's work). This can be done at least as the "quick" remediation for the
size issue and something that might allow the current scheme to
work without ever-growing repo/size and using space for the build action.
If we have such an automated mechanism in place, we could periodically
archive old docs. All that without changing the build process of ours and
simply keep old "past" docs elsewhere (still accessible for users).

Not much should change for the users IMHO - if they go to the old version
of the docs or use old, archived URLs, they would end up seeing the
same content/navigation they see today (with extra information it's an old
version and served from a different URL).
When they go to the "old" version of documentation they could be redirected
to the new one - same HTML but hosted on cloud storage, fully statically.
We already do that with "redirect" mechanism.

In the meantime, someone could also work on a strategic solution - and
changing the current build process, but this is - I think a different -
and much more complex and requiring a lot of effort - step. And it could
simply end up with regenerating whatever is left as "live" documentation
(leaving the archive docs intact).

That's at least what I see as a possible set of steps to take.

J.

On Thu, Oct 19, 2023 at 2:14 PM utkarsh sharma <utkarshar...@gmail.com>
wrote:

> Hey everyone,
>
> Thanks, Ryan for stating the thread :)
>
> Big +1 For archiving docs older than 18 months. We can still make the older
> docs available in `rst` doc form.
>
> But eventually, we might again run into this problem because of the growing
> no. of providers. I think the main reason for this issue is the generated
> static HTML pages and the way we cater to them using GitHub Pages. The
> generated pages have lots of common code
> HTML(headers/navigation/breadcrumbs/footer etc.) CSS, JS which is repeated
> for every provider and every version of that provider. If we have a more
> dynamic way(Django/Flask Servers) of catering the documents we can save all
> the space for common HTML/CSS/JS.
>
> But the downsides of this approach are:
> 1. We need to have a server
> 2. Also require changes in the existing document build process to only
> produce partial HTML documents.
>
> Thanks,
> Utkarsh Sharma
>
> On Thu, Oct 19, 2023 at 4:08 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>
> > Yes. Moving the old version to somewhere that we can keep/archive static
> > historical versions of those historical docs and publish them from there.
> > What you proposed is exactly the solution I thought might be best as
> well.
> >
> > It would be a great task to contribute to the stability of our docs
> > generation in the future.
> >
> > I don't think it's a matter of discussing in detail how to do it (18
> months
> > is a good start and you can parameterize it), It's the matter of
> > someone committing to it and doing it simply :).
> >
> > So yes I personally am all for it and if I understand correctly that you
> > are looking for agreement on doing it, big +1 from my side - happy to
> help
> > with providing access to our S3 buckets.
> >
> > J.
> >
> > On Thu, Oct 19, 2023 at 5:39 AM Ryan Hatter
> > <ryan.hat...@astronomer.io.invalid> wrote:
> >
> > > *tl;dr*
> > >
> > >    1. The GitHub Action for building docs is running out of space. I
> > think
> > >    we should archive really old documentation for large packages to
> cloud
> > >    storage.
> > >    2. Contributing to and building Airflow docs is hard. We should
> > migrate
> > >    to a framework, preferably one that uses markdown (although I
> > > acknowledge
> > >    rst -> md will be a massive overhaul).
> > >
> > > *Problem Summary*
> > > I recently set out to implement what I thought would be a
> straightforward
> > > feature: warn users when they are viewing documentation for non-current
> > > versions of Airflow and link them to the current/stable version
> > > <https://github.com/apache/airflow/pull/34639>. Jed pointed me to the
> > > airflow-site <https://github.com/apache/airflow-site> repo, which
> > contains
> > > all of the archived docs (that is, documentation for non-current
> > versions),
> > > and from there, I ran into a brick wall.
> > >
> > > I want to raise some concerns that I've developed after trying to
> > > contribute what feel like a couple reasonably small docs updates:
> > >
> > >    1. airflow-site
> > >       1. Elad pointed out the problem posed by the sheer size of
> archived
> > >       docs
> > >       <
> > >
> >
> https://apache-airflow.slack.com/archives/CCPRP7943/p1697009000242369?thread_ts=1696973512.004229&cid=CCPRP7943
> > > >
> > > (more
> > >       on this later).
> > >       2. The airflow-site repo is confusing, and rather poorly
> > documented.
> > >          1. Hugo (static site generator) exists, but appears to only be
> > >          used for the landing pages
> > >          2. In order to view any documentation locally other than the
> > >          landing pages, you'll need to run the site.sh script then
> > > copy the output
> > >          from one dir to another?
> > >       3. All of the archived docs are raw HTML, making migrating to a
> > >       static site generator a significant challenge, which makes it
> > > difficult to
> > >       prevent the archived docs from continuing to grow and grow.
> > > Perhaps this is the
> > >       wheel Khaleesi was referring to
> > >       <https://www.youtube.com/watch?v=J-rxmk6zPxA>?
> > >    2. airflow
> > >       1. Building Airflow docs is a challenge. It takes several minutes
> > and
> > >       doesn't support auto-build, so the slightest issue could require
> > > waiting
> > >       again and again until the changes are just so. I tried
> implementing
> > >       sphinx-autobuild <
> > > https://github.com/executablebooks/sphinx-autobuild>
> > >       to no avail.
> > >       2. Sphinx/restructured text has a steep learning curve.
> > >
> > > *The most acute issue: disk space*
> > > The size of the archived docs is causing the docs build GitHub Action
> to
> > > almost run out of space. From the "Build site" Action from a couple
> weeks
> > > ago
> > > <
> > >
> >
> https://github.com/apache/airflow-site/actions/runs/6419529645/job/17432628458
> > > >
> > > (expand
> > > the build site step, scroll all the way to the bottom, expand the `df
> -h`
> > > command), we can see the GitHub Action runner (or whatever it's called)
> > is
> > > nearly running out of space:
> > >
> > > df -h
> > >   *Filesystem      Size  Used Avail Use% Mounted on*
> > >   /dev/root        84G   82G  2.1G  98% /
> > >
> > >
> > > The available space is down to 1.8G on the most recent Action
> > > <
> > >
> >
> https://github.com/apache/airflow-site/actions/runs/6564727255/job/17831714176
> > > >.
> > > If we assume that trend is accurate, we have about two months before
> the
> > > Action runner runs out of disk space. Here's a breakdown of the space
> > > consumed by the 10 largest package documentation directories:
> > >
> > > du -h -d 1 docs-archive/ | sort -h -r
> > > * 14G* docs-archive/
> > > *4.0G* docs-archive//apache-airflow-providers-google
> > > *3.2G* docs-archive//apache-airflow
> > > *1.7G* docs-archive//apache-airflow-providers-amazon
> > > *560M* docs-archive//apache-airflow-providers-microsoft-azure
> > > *254M* docs-archive//apache-airflow-providers-cncf-kubernetes
> > > *192M* docs-archive//apache-airflow-providers-apache-hive
> > > *153M* docs-archive//apache-airflow-providers-snowflake
> > > *139M* docs-archive//apache-airflow-providers-databricks
> > > *104M* docs-archive//apache-airflow-providers-docker
> > > *101M* docs-archive//apache-airflow-providers-mysql
> > >
> > >
> > > *Proposed solution: Archive old docs html for large packages to cloud
> > > storage*
> > > I'm wondering if it would be reasonable to truly archive the docs for
> > some
> > > of the older versions of these packages. Perhaps the last 18 months?
> > Maybe
> > > we could drop the html in a blob storage bucket with instructions for
> > > building the docs if absolutely necessary?
> > >
> > > *Improving docs building moving forward*
> > > There's an open Issue <
> https://github.com/apache/airflow-site/issues/719
> > >
> > > for
> > > migrating the docs to a framework, but it's not at all a
> straightforward
> > > task for the archived docs. I think that we should institute a policy
> of
> > > archiving old documentation to cloud storage after X time and use a
> > > framework for building docs in a scalable and sustainable way moving
> > > forward. Maybe we could chat with iceberg folks about how they moved
> from
> > > mkdocs to hugo? <https://github.com/apache/iceberg/issues/3616>
> > >
> > >
> > > Shoutout to Utkarsh for helping me through all this!
> > >
> >
>

Reply via email to