Hey everyone,

Thanks, Ryan for stating the thread :)

Big +1 For archiving docs older than 18 months. We can still make the older
docs available in `rst` doc form.

But eventually, we might again run into this problem because of the growing
no. of providers. I think the main reason for this issue is the generated
static HTML pages and the way we cater to them using GitHub Pages. The
generated pages have lots of common code
HTML(headers/navigation/breadcrumbs/footer etc.) CSS, JS which is repeated
for every provider and every version of that provider. If we have a more
dynamic way(Django/Flask Servers) of catering the documents we can save all
the space for common HTML/CSS/JS.

But the downsides of this approach are:
1. We need to have a server
2. Also require changes in the existing document build process to only
produce partial HTML documents.

Thanks,
Utkarsh Sharma

On Thu, Oct 19, 2023 at 4:08 PM Jarek Potiuk <ja...@potiuk.com> wrote:

> Yes. Moving the old version to somewhere that we can keep/archive static
> historical versions of those historical docs and publish them from there.
> What you proposed is exactly the solution I thought might be best as well.
>
> It would be a great task to contribute to the stability of our docs
> generation in the future.
>
> I don't think it's a matter of discussing in detail how to do it (18 months
> is a good start and you can parameterize it), It's the matter of
> someone committing to it and doing it simply :).
>
> So yes I personally am all for it and if I understand correctly that you
> are looking for agreement on doing it, big +1 from my side - happy to help
> with providing access to our S3 buckets.
>
> J.
>
> On Thu, Oct 19, 2023 at 5:39 AM Ryan Hatter
> <ryan.hat...@astronomer.io.invalid> wrote:
>
> > *tl;dr*
> >
> >    1. The GitHub Action for building docs is running out of space. I
> think
> >    we should archive really old documentation for large packages to cloud
> >    storage.
> >    2. Contributing to and building Airflow docs is hard. We should
> migrate
> >    to a framework, preferably one that uses markdown (although I
> > acknowledge
> >    rst -> md will be a massive overhaul).
> >
> > *Problem Summary*
> > I recently set out to implement what I thought would be a straightforward
> > feature: warn users when they are viewing documentation for non-current
> > versions of Airflow and link them to the current/stable version
> > <https://github.com/apache/airflow/pull/34639>. Jed pointed me to the
> > airflow-site <https://github.com/apache/airflow-site> repo, which
> contains
> > all of the archived docs (that is, documentation for non-current
> versions),
> > and from there, I ran into a brick wall.
> >
> > I want to raise some concerns that I've developed after trying to
> > contribute what feel like a couple reasonably small docs updates:
> >
> >    1. airflow-site
> >       1. Elad pointed out the problem posed by the sheer size of archived
> >       docs
> >       <
> >
> https://apache-airflow.slack.com/archives/CCPRP7943/p1697009000242369?thread_ts=1696973512.004229&cid=CCPRP7943
> > >
> > (more
> >       on this later).
> >       2. The airflow-site repo is confusing, and rather poorly
> documented.
> >          1. Hugo (static site generator) exists, but appears to only be
> >          used for the landing pages
> >          2. In order to view any documentation locally other than the
> >          landing pages, you'll need to run the site.sh script then
> > copy the output
> >          from one dir to another?
> >       3. All of the archived docs are raw HTML, making migrating to a
> >       static site generator a significant challenge, which makes it
> > difficult to
> >       prevent the archived docs from continuing to grow and grow.
> > Perhaps this is the
> >       wheel Khaleesi was referring to
> >       <https://www.youtube.com/watch?v=J-rxmk6zPxA>?
> >    2. airflow
> >       1. Building Airflow docs is a challenge. It takes several minutes
> and
> >       doesn't support auto-build, so the slightest issue could require
> > waiting
> >       again and again until the changes are just so. I tried implementing
> >       sphinx-autobuild <
> > https://github.com/executablebooks/sphinx-autobuild>
> >       to no avail.
> >       2. Sphinx/restructured text has a steep learning curve.
> >
> > *The most acute issue: disk space*
> > The size of the archived docs is causing the docs build GitHub Action to
> > almost run out of space. From the "Build site" Action from a couple weeks
> > ago
> > <
> >
> https://github.com/apache/airflow-site/actions/runs/6419529645/job/17432628458
> > >
> > (expand
> > the build site step, scroll all the way to the bottom, expand the `df -h`
> > command), we can see the GitHub Action runner (or whatever it's called)
> is
> > nearly running out of space:
> >
> > df -h
> >   *Filesystem      Size  Used Avail Use% Mounted on*
> >   /dev/root        84G   82G  2.1G  98% /
> >
> >
> > The available space is down to 1.8G on the most recent Action
> > <
> >
> https://github.com/apache/airflow-site/actions/runs/6564727255/job/17831714176
> > >.
> > If we assume that trend is accurate, we have about two months before the
> > Action runner runs out of disk space. Here's a breakdown of the space
> > consumed by the 10 largest package documentation directories:
> >
> > du -h -d 1 docs-archive/ | sort -h -r
> > * 14G* docs-archive/
> > *4.0G* docs-archive//apache-airflow-providers-google
> > *3.2G* docs-archive//apache-airflow
> > *1.7G* docs-archive//apache-airflow-providers-amazon
> > *560M* docs-archive//apache-airflow-providers-microsoft-azure
> > *254M* docs-archive//apache-airflow-providers-cncf-kubernetes
> > *192M* docs-archive//apache-airflow-providers-apache-hive
> > *153M* docs-archive//apache-airflow-providers-snowflake
> > *139M* docs-archive//apache-airflow-providers-databricks
> > *104M* docs-archive//apache-airflow-providers-docker
> > *101M* docs-archive//apache-airflow-providers-mysql
> >
> >
> > *Proposed solution: Archive old docs html for large packages to cloud
> > storage*
> > I'm wondering if it would be reasonable to truly archive the docs for
> some
> > of the older versions of these packages. Perhaps the last 18 months?
> Maybe
> > we could drop the html in a blob storage bucket with instructions for
> > building the docs if absolutely necessary?
> >
> > *Improving docs building moving forward*
> > There's an open Issue <https://github.com/apache/airflow-site/issues/719
> >
> > for
> > migrating the docs to a framework, but it's not at all a straightforward
> > task for the archived docs. I think that we should institute a policy of
> > archiving old documentation to cloud storage after X time and use a
> > framework for building docs in a scalable and sustainable way moving
> > forward. Maybe we could chat with iceberg folks about how they moved from
> > mkdocs to hugo? <https://github.com/apache/iceberg/issues/3616>
> >
> >
> > Shoutout to Utkarsh for helping me through all this!
> >
>

Reply via email to