This definitely sounds like something that needs doing sooner rather than
later.

While I'd love to help, I'm not too experienced with this area so I might
not be able to actually propose what changes need doing, but if someone has
a path forward on this I can definitely contribute some time to help out
given some guidance on what is needed.

--
Regards,
Aritra Basu

On Mon, Oct 23, 2023, 2:19 AM Jarek Potiuk <ja...@potiuk.com> wrote:

> Some news here.
>
> I caught up with some infra changes that happened while I was travelling -
> and I have just (with https://github.com/apache/airflow-site/pull/879)
> switched the "airflow-site" building to the new, self-hosted "asf-runners".
> This is a new option that ASF infra has given to test for the ASF projects
> - rather than relying on "public runners", we can switch to self-hosted
> runners donated by Microsoft to the ASF. More info here:
>
> https://cwiki.apache.org/confluence/pages/viewpage.action?spaceKey=INFRA&title=ASF+Infra+provided+self-hosted+runners
>
> The most important result is that we now have a lot more "breathing space"
> for the docs building job. During the build we are using max 59% of the
> disk space - with 73GB used and 52GB free.
>
>  Filesystem      Size  Used Avail Use% Mounted on
>   overlay         124G   73G   52G  59% /
>
> This is - on one hand - good news (disk space is not an "acute" issue any
> more), I think if someone would like to work on improving the docs building
> of ours, they have much more breathing space to do so.
> But - clearly - it might mean that the incentive to work on it decreased -
> because it "just works"). That's the bad effect of it. And I think it's not
> good, though the most I can do is to reiterate Ryan's concerns and hope we
> will get someone committing to improving this.
>
> I would strongly encourage those who want to improve it, to do so. I think
> - as Ryan stated - contributing to our docs is more complex than it should
> be and anyone who would like to contribute there is most welcome. I very
> much share all the points that Ryan made and I think we should welcome any
> efforts to make it better. The lack of incremental/auto-build support is
> especially troublesome for anyone who wants to contribute their docs. Happy
> to help anyone who would like to take on the task.
>
> Still - if we would like to move old docs outside as a first step, I am
> happy to help anyone who would like to commit to doing it.
>
> J.
>
> On Fri, Oct 20, 2023 at 3:27 PM Pierre Jeambrun <pierrejb...@gmail.com>
> wrote:
>
> > +1 from moving archived docs outside of airflow-site.
> >
> > Even if that might mean a little more maintenance in case we need to
> > propagate changes to all historical versions, we would have to handle 2
> > repositories, but that seems like a minor downside compared to the
> quality
> > of life improvement that it would bring for airflow-site contributions.
> >
> > Le jeu. 19 oct. 2023 à 16:11, Jarek Potiuk <ja...@potiuk.com> a écrit :
> >
> > > Let me just clarify (because that could be unclear) what my +1 was
> about.
> > >
> > > I was not talking (and I believe Ryan was not talking either) about
> > > removing the old docs but about archiving them and serving from
> elsewhere
> > > (cloud storage).
> > >
> > > I think discussing changing to more shared HTML/JS/CSS is also a good
> > idea
> > > to optimise it, but possibly can be handled separately as a longer
> effort
> > > of redesigning how the docs are built. But by all means we could also
> > work
> > > on that.
> > >
> > > Maybe I jumped to conclusions, but the easiest, tactical solution (for
> > the
> > > most acute issue - size) is we just move the old generated HTML docs
> from
> > > the git repository of "airflow-site" and in the "github_pages" branch
> we
> > > replace it with redirecting of those pages to the files served from the
> > > cloud storage (and I believe this is what Ryan hinted at).
> > >
> > > Those redirects could be automatically generated for all
> > > historical versions and they will be  small. We are already doing it
> for
> > > individual pages for navigating between versions, but we could easily
> > > replace all the historical docs with "<html><head><meta
> > > http-equiv="refresh" content="0; url=
> > > https://new-archive-docs-airflow-url/airflow/version/document.url";
> > > "/></head></html>". Low-tech, surely and "legacy", but it will solve
> the
> > > size problem instantly. We currently have 115.148 such files which will
> > go
> > > down to about ~20 MB of files which is peanuts, compared to the current
> > > 17GB (!) we have.
> > >
> > > We can also inject into the moved "storage" docs, the header that
> informs
> > > that this is an old/archived documentation with single redirect to
> > > "live"/"stable" site for newer versions of docs (which I believe
> sparked
> > > Ryan's work). This can be done at least as the "quick" remediation for
> > the
> > > size issue and something that might allow the current scheme to
> > > work without ever-growing repo/size and using space for the build
> action.
> > > If we have such an automated mechanism in place, we could periodically
> > > archive old docs. All that without changing the build process of ours
> and
> > > simply keep old "past" docs elsewhere (still accessible for users).
> > >
> > > Not much should change for the users IMHO - if they go to the old
> version
> > > of the docs or use old, archived URLs, they would end up seeing the
> > > same content/navigation they see today (with extra information it's an
> > old
> > > version and served from a different URL).
> > > When they go to the "old" version of documentation they could be
> > redirected
> > > to the new one - same HTML but hosted on cloud storage, fully
> statically.
> > > We already do that with "redirect" mechanism.
> > >
> > > In the meantime, someone could also work on a strategic solution - and
> > > changing the current build process, but this is - I think a different -
> > > and much more complex and requiring a lot of effort - step. And it
> could
> > > simply end up with regenerating whatever is left as "live"
> documentation
> > > (leaving the archive docs intact).
> > >
> > > That's at least what I see as a possible set of steps to take.
> > >
> > > J.
> > >
> > > On Thu, Oct 19, 2023 at 2:14 PM utkarsh sharma <utkarshar...@gmail.com
> >
> > > wrote:
> > >
> > > > Hey everyone,
> > > >
> > > > Thanks, Ryan for stating the thread :)
> > > >
> > > > Big +1 For archiving docs older than 18 months. We can still make the
> > > older
> > > > docs available in `rst` doc form.
> > > >
> > > > But eventually, we might again run into this problem because of the
> > > growing
> > > > no. of providers. I think the main reason for this issue is the
> > generated
> > > > static HTML pages and the way we cater to them using GitHub Pages.
> The
> > > > generated pages have lots of common code
> > > > HTML(headers/navigation/breadcrumbs/footer etc.) CSS, JS which is
> > > repeated
> > > > for every provider and every version of that provider. If we have a
> > more
> > > > dynamic way(Django/Flask Servers) of catering the documents we can
> save
> > > all
> > > > the space for common HTML/CSS/JS.
> > > >
> > > > But the downsides of this approach are:
> > > > 1. We need to have a server
> > > > 2. Also require changes in the existing document build process to
> only
> > > > produce partial HTML documents.
> > > >
> > > > Thanks,
> > > > Utkarsh Sharma
> > > >
> > > > On Thu, Oct 19, 2023 at 4:08 PM Jarek Potiuk <ja...@potiuk.com>
> wrote:
> > > >
> > > > > Yes. Moving the old version to somewhere that we can keep/archive
> > > static
> > > > > historical versions of those historical docs and publish them from
> > > there.
> > > > > What you proposed is exactly the solution I thought might be best
> as
> > > > well.
> > > > >
> > > > > It would be a great task to contribute to the stability of our docs
> > > > > generation in the future.
> > > > >
> > > > > I don't think it's a matter of discussing in detail how to do it
> (18
> > > > months
> > > > > is a good start and you can parameterize it), It's the matter of
> > > > > someone committing to it and doing it simply :).
> > > > >
> > > > > So yes I personally am all for it and if I understand correctly
> that
> > > you
> > > > > are looking for agreement on doing it, big +1 from my side - happy
> to
> > > > help
> > > > > with providing access to our S3 buckets.
> > > > >
> > > > > J.
> > > > >
> > > > > On Thu, Oct 19, 2023 at 5:39 AM Ryan Hatter
> > > > > <ryan.hat...@astronomer.io.invalid> wrote:
> > > > >
> > > > > > *tl;dr*
> > > > > >
> > > > > >    1. The GitHub Action for building docs is running out of
> space.
> > I
> > > > > think
> > > > > >    we should archive really old documentation for large packages
> to
> > > > cloud
> > > > > >    storage.
> > > > > >    2. Contributing to and building Airflow docs is hard. We
> should
> > > > > migrate
> > > > > >    to a framework, preferably one that uses markdown (although I
> > > > > > acknowledge
> > > > > >    rst -> md will be a massive overhaul).
> > > > > >
> > > > > > *Problem Summary*
> > > > > > I recently set out to implement what I thought would be a
> > > > straightforward
> > > > > > feature: warn users when they are viewing documentation for
> > > non-current
> > > > > > versions of Airflow and link them to the current/stable version
> > > > > > <https://github.com/apache/airflow/pull/34639>. Jed pointed me
> to
> > > the
> > > > > > airflow-site <https://github.com/apache/airflow-site> repo,
> which
> > > > > contains
> > > > > > all of the archived docs (that is, documentation for non-current
> > > > > versions),
> > > > > > and from there, I ran into a brick wall.
> > > > > >
> > > > > > I want to raise some concerns that I've developed after trying to
> > > > > > contribute what feel like a couple reasonably small docs updates:
> > > > > >
> > > > > >    1. airflow-site
> > > > > >       1. Elad pointed out the problem posed by the sheer size of
> > > > archived
> > > > > >       docs
> > > > > >       <
> > > > > >
> > > > >
> > > >
> > >
> >
> https://apache-airflow.slack.com/archives/CCPRP7943/p1697009000242369?thread_ts=1696973512.004229&cid=CCPRP7943
> > > > > > >
> > > > > > (more
> > > > > >       on this later).
> > > > > >       2. The airflow-site repo is confusing, and rather poorly
> > > > > documented.
> > > > > >          1. Hugo (static site generator) exists, but appears to
> > only
> > > be
> > > > > >          used for the landing pages
> > > > > >          2. In order to view any documentation locally other than
> > the
> > > > > >          landing pages, you'll need to run the site.sh script
> then
> > > > > > copy the output
> > > > > >          from one dir to another?
> > > > > >       3. All of the archived docs are raw HTML, making migrating
> > to a
> > > > > >       static site generator a significant challenge, which makes
> it
> > > > > > difficult to
> > > > > >       prevent the archived docs from continuing to grow and grow.
> > > > > > Perhaps this is the
> > > > > >       wheel Khaleesi was referring to
> > > > > >       <https://www.youtube.com/watch?v=J-rxmk6zPxA>?
> > > > > >    2. airflow
> > > > > >       1. Building Airflow docs is a challenge. It takes several
> > > minutes
> > > > > and
> > > > > >       doesn't support auto-build, so the slightest issue could
> > > require
> > > > > > waiting
> > > > > >       again and again until the changes are just so. I tried
> > > > implementing
> > > > > >       sphinx-autobuild <
> > > > > > https://github.com/executablebooks/sphinx-autobuild>
> > > > > >       to no avail.
> > > > > >       2. Sphinx/restructured text has a steep learning curve.
> > > > > >
> > > > > > *The most acute issue: disk space*
> > > > > > The size of the archived docs is causing the docs build GitHub
> > Action
> > > > to
> > > > > > almost run out of space. From the "Build site" Action from a
> couple
> > > > weeks
> > > > > > ago
> > > > > > <
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/airflow-site/actions/runs/6419529645/job/17432628458
> > > > > > >
> > > > > > (expand
> > > > > > the build site step, scroll all the way to the bottom, expand the
> > `df
> > > > -h`
> > > > > > command), we can see the GitHub Action runner (or whatever it's
> > > called)
> > > > > is
> > > > > > nearly running out of space:
> > > > > >
> > > > > > df -h
> > > > > >   *Filesystem      Size  Used Avail Use% Mounted on*
> > > > > >   /dev/root        84G   82G  2.1G  98% /
> > > > > >
> > > > > >
> > > > > > The available space is down to 1.8G on the most recent Action
> > > > > > <
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/airflow-site/actions/runs/6564727255/job/17831714176
> > > > > > >.
> > > > > > If we assume that trend is accurate, we have about two months
> > before
> > > > the
> > > > > > Action runner runs out of disk space. Here's a breakdown of the
> > space
> > > > > > consumed by the 10 largest package documentation directories:
> > > > > >
> > > > > > du -h -d 1 docs-archive/ | sort -h -r
> > > > > > * 14G* docs-archive/
> > > > > > *4.0G* docs-archive//apache-airflow-providers-google
> > > > > > *3.2G* docs-archive//apache-airflow
> > > > > > *1.7G* docs-archive//apache-airflow-providers-amazon
> > > > > > *560M* docs-archive//apache-airflow-providers-microsoft-azure
> > > > > > *254M* docs-archive//apache-airflow-providers-cncf-kubernetes
> > > > > > *192M* docs-archive//apache-airflow-providers-apache-hive
> > > > > > *153M* docs-archive//apache-airflow-providers-snowflake
> > > > > > *139M* docs-archive//apache-airflow-providers-databricks
> > > > > > *104M* docs-archive//apache-airflow-providers-docker
> > > > > > *101M* docs-archive//apache-airflow-providers-mysql
> > > > > >
> > > > > >
> > > > > > *Proposed solution: Archive old docs html for large packages to
> > cloud
> > > > > > storage*
> > > > > > I'm wondering if it would be reasonable to truly archive the docs
> > for
> > > > > some
> > > > > > of the older versions of these packages. Perhaps the last 18
> > months?
> > > > > Maybe
> > > > > > we could drop the html in a blob storage bucket with instructions
> > for
> > > > > > building the docs if absolutely necessary?
> > > > > >
> > > > > > *Improving docs building moving forward*
> > > > > > There's an open Issue <
> > > > https://github.com/apache/airflow-site/issues/719
> > > > > >
> > > > > > for
> > > > > > migrating the docs to a framework, but it's not at all a
> > > > straightforward
> > > > > > task for the archived docs. I think that we should institute a
> > policy
> > > > of
> > > > > > archiving old documentation to cloud storage after X time and
> use a
> > > > > > framework for building docs in a scalable and sustainable way
> > moving
> > > > > > forward. Maybe we could chat with iceberg folks about how they
> > moved
> > > > from
> > > > > > mkdocs to hugo? <https://github.com/apache/iceberg/issues/3616>
> > > > > >
> > > > > >
> > > > > > Shoutout to Utkarsh for helping me through all this!
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to