Let me just clarify (because that could be unclear) what my +1 was about. I was not talking (and I believe Ryan was not talking either) about removing the old docs but about archiving them and serving from elsewhere (cloud storage).
I think discussing changing to more shared HTML/JS/CSS is also a good idea to optimise it, but possibly can be handled separately as a longer effort of redesigning how the docs are built. But by all means we could also work on that. Maybe I jumped to conclusions, but the easiest, tactical solution (for the most acute issue - size) is we just move the old generated HTML docs from the git repository of "airflow-site" and in the "github_pages" branch we replace it with redirecting of those pages to the files served from the cloud storage (and I believe this is what Ryan hinted at). Those redirects could be automatically generated for all historical versions and they will be small. We are already doing it for individual pages for navigating between versions, but we could easily replace all the historical docs with "<html><head><meta http-equiv="refresh" content="0; url= https://new-archive-docs-airflow-url/airflow/version/document.url" "/></head></html>". Low-tech, surely and "legacy", but it will solve the size problem instantly. We currently have 115.148 such files which will go down to about ~20 MB of files which is peanuts, compared to the current 17GB (!) we have. We can also inject into the moved "storage" docs, the header that informs that this is an old/archived documentation with single redirect to "live"/"stable" site for newer versions of docs (which I believe sparked Ryan's work). This can be done at least as the "quick" remediation for the size issue and something that might allow the current scheme to work without ever-growing repo/size and using space for the build action. If we have such an automated mechanism in place, we could periodically archive old docs. All that without changing the build process of ours and simply keep old "past" docs elsewhere (still accessible for users). Not much should change for the users IMHO - if they go to the old version of the docs or use old, archived URLs, they would end up seeing the same content/navigation they see today (with extra information it's an old version and served from a different URL). When they go to the "old" version of documentation they could be redirected to the new one - same HTML but hosted on cloud storage, fully statically. We already do that with "redirect" mechanism. In the meantime, someone could also work on a strategic solution - and changing the current build process, but this is - I think a different - and much more complex and requiring a lot of effort - step. And it could simply end up with regenerating whatever is left as "live" documentation (leaving the archive docs intact). That's at least what I see as a possible set of steps to take. J. On Thu, Oct 19, 2023 at 2:14 PM utkarsh sharma <utkarshar...@gmail.com> wrote: > Hey everyone, > > Thanks, Ryan for stating the thread :) > > Big +1 For archiving docs older than 18 months. We can still make the older > docs available in `rst` doc form. > > But eventually, we might again run into this problem because of the growing > no. of providers. I think the main reason for this issue is the generated > static HTML pages and the way we cater to them using GitHub Pages. The > generated pages have lots of common code > HTML(headers/navigation/breadcrumbs/footer etc.) CSS, JS which is repeated > for every provider and every version of that provider. If we have a more > dynamic way(Django/Flask Servers) of catering the documents we can save all > the space for common HTML/CSS/JS. > > But the downsides of this approach are: > 1. We need to have a server > 2. Also require changes in the existing document build process to only > produce partial HTML documents. > > Thanks, > Utkarsh Sharma > > On Thu, Oct 19, 2023 at 4:08 PM Jarek Potiuk <ja...@potiuk.com> wrote: > > > Yes. Moving the old version to somewhere that we can keep/archive static > > historical versions of those historical docs and publish them from there. > > What you proposed is exactly the solution I thought might be best as > well. > > > > It would be a great task to contribute to the stability of our docs > > generation in the future. > > > > I don't think it's a matter of discussing in detail how to do it (18 > months > > is a good start and you can parameterize it), It's the matter of > > someone committing to it and doing it simply :). > > > > So yes I personally am all for it and if I understand correctly that you > > are looking for agreement on doing it, big +1 from my side - happy to > help > > with providing access to our S3 buckets. > > > > J. > > > > On Thu, Oct 19, 2023 at 5:39 AM Ryan Hatter > > <ryan.hat...@astronomer.io.invalid> wrote: > > > > > *tl;dr* > > > > > > 1. The GitHub Action for building docs is running out of space. I > > think > > > we should archive really old documentation for large packages to > cloud > > > storage. > > > 2. Contributing to and building Airflow docs is hard. We should > > migrate > > > to a framework, preferably one that uses markdown (although I > > > acknowledge > > > rst -> md will be a massive overhaul). > > > > > > *Problem Summary* > > > I recently set out to implement what I thought would be a > straightforward > > > feature: warn users when they are viewing documentation for non-current > > > versions of Airflow and link them to the current/stable version > > > <https://github.com/apache/airflow/pull/34639>. Jed pointed me to the > > > airflow-site <https://github.com/apache/airflow-site> repo, which > > contains > > > all of the archived docs (that is, documentation for non-current > > versions), > > > and from there, I ran into a brick wall. > > > > > > I want to raise some concerns that I've developed after trying to > > > contribute what feel like a couple reasonably small docs updates: > > > > > > 1. airflow-site > > > 1. Elad pointed out the problem posed by the sheer size of > archived > > > docs > > > < > > > > > > https://apache-airflow.slack.com/archives/CCPRP7943/p1697009000242369?thread_ts=1696973512.004229&cid=CCPRP7943 > > > > > > > (more > > > on this later). > > > 2. The airflow-site repo is confusing, and rather poorly > > documented. > > > 1. Hugo (static site generator) exists, but appears to only be > > > used for the landing pages > > > 2. In order to view any documentation locally other than the > > > landing pages, you'll need to run the site.sh script then > > > copy the output > > > from one dir to another? > > > 3. All of the archived docs are raw HTML, making migrating to a > > > static site generator a significant challenge, which makes it > > > difficult to > > > prevent the archived docs from continuing to grow and grow. > > > Perhaps this is the > > > wheel Khaleesi was referring to > > > <https://www.youtube.com/watch?v=J-rxmk6zPxA>? > > > 2. airflow > > > 1. Building Airflow docs is a challenge. It takes several minutes > > and > > > doesn't support auto-build, so the slightest issue could require > > > waiting > > > again and again until the changes are just so. I tried > implementing > > > sphinx-autobuild < > > > https://github.com/executablebooks/sphinx-autobuild> > > > to no avail. > > > 2. Sphinx/restructured text has a steep learning curve. > > > > > > *The most acute issue: disk space* > > > The size of the archived docs is causing the docs build GitHub Action > to > > > almost run out of space. From the "Build site" Action from a couple > weeks > > > ago > > > < > > > > > > https://github.com/apache/airflow-site/actions/runs/6419529645/job/17432628458 > > > > > > > (expand > > > the build site step, scroll all the way to the bottom, expand the `df > -h` > > > command), we can see the GitHub Action runner (or whatever it's called) > > is > > > nearly running out of space: > > > > > > df -h > > > *Filesystem Size Used Avail Use% Mounted on* > > > /dev/root 84G 82G 2.1G 98% / > > > > > > > > > The available space is down to 1.8G on the most recent Action > > > < > > > > > > https://github.com/apache/airflow-site/actions/runs/6564727255/job/17831714176 > > > >. > > > If we assume that trend is accurate, we have about two months before > the > > > Action runner runs out of disk space. Here's a breakdown of the space > > > consumed by the 10 largest package documentation directories: > > > > > > du -h -d 1 docs-archive/ | sort -h -r > > > * 14G* docs-archive/ > > > *4.0G* docs-archive//apache-airflow-providers-google > > > *3.2G* docs-archive//apache-airflow > > > *1.7G* docs-archive//apache-airflow-providers-amazon > > > *560M* docs-archive//apache-airflow-providers-microsoft-azure > > > *254M* docs-archive//apache-airflow-providers-cncf-kubernetes > > > *192M* docs-archive//apache-airflow-providers-apache-hive > > > *153M* docs-archive//apache-airflow-providers-snowflake > > > *139M* docs-archive//apache-airflow-providers-databricks > > > *104M* docs-archive//apache-airflow-providers-docker > > > *101M* docs-archive//apache-airflow-providers-mysql > > > > > > > > > *Proposed solution: Archive old docs html for large packages to cloud > > > storage* > > > I'm wondering if it would be reasonable to truly archive the docs for > > some > > > of the older versions of these packages. Perhaps the last 18 months? > > Maybe > > > we could drop the html in a blob storage bucket with instructions for > > > building the docs if absolutely necessary? > > > > > > *Improving docs building moving forward* > > > There's an open Issue < > https://github.com/apache/airflow-site/issues/719 > > > > > > for > > > migrating the docs to a framework, but it's not at all a > straightforward > > > task for the archived docs. I think that we should institute a policy > of > > > archiving old documentation to cloud storage after X time and use a > > > framework for building docs in a scalable and sustainable way moving > > > forward. Maybe we could chat with iceberg folks about how they moved > from > > > mkdocs to hugo? <https://github.com/apache/iceberg/issues/3616> > > > > > > > > > Shoutout to Utkarsh for helping me through all this! > > > > > >