Just forgot to mention in my previous mail, that I'm suggesting the above changes since the storage is not the primary concern right now but I'm happy to contribute either way. :)
On Tue, Oct 24, 2023 at 7:43 PM utkarsh sharma <utkarshar...@gmail.com> wrote: > Hey everyone, > > I have a couple of tasks in mind, that might aid in reducing the efforts > while working with docs. Right now tasks listed below are difficult to > achieve. > > 1. Adding a warning based on a specific provider/version of a > provider/range of providers. Which was also the task that Ryan was working > on. > 2. Altering a page layout or CSS for a specific provider. > > The issue while trying to achieve the above tasks is because of the > pre-prepared static files we get as a final product of building documents > with *breeze build-docs* in folder docs/_build. The files we get are > self-sufficient to be hosted and they are really just used directly leaving > no room for customization of any sort. > > > My proposal would be to break down this process as follows: > > 1. We can prepare partial documents as part of *breeze build-docs* which > are only responsible for providing HTML to be populated within the Body tag > for a specific provider, and not the layout of the entire page. > 2. We then copy partial static files to the Airflow-site repo within > landing pages/site/layouts/docs. Where the layout of the page will be > provided by `single.html`, a listing of all the providers will be provided > by `list.html`, which are standard hugo > <https://gohugo.io/about/what-is-hugo/> features. Also, using static > files from `sphinx_airflow_theme` which lives in the same repo, makes the > changes on the CSS easy. > 3. We can then use Hugo to generate static > <https://gohugo.io/getting-started/quick-start/#publish-the-site> files > and push them to the `gh-pages` branch to publish them using GitHub pages. > > > Doing the above changes will enable us to do the following: > > 1. Will give us more control to work on a specific > provider/provider-version if we want by providing templates - > https://gohugo.io/templates/lookup-order/ > 2. We will have a specific code to look at depending on the changes one > intends to make, right now if you don't know the flow it's a bit difficult > to pinpoint the code to change. > 1. If we want to make changes to a specific provider's content we can do > it Airflow's repo docs/<provider>/*.rst file. > 2. If we have a change that affects multiple providers or versions we can > do it on Airflow Website's repo. > > > Thanks, > Utkarsh Sharma > > On Tue, Oct 24, 2023 at 3:45 PM Jarek Potiuk <ja...@potiuk.com> wrote: > >> So it looks like we have some helping hands and we need someone to lead it >> :) (just saying). >> >> On Tue, Oct 24, 2023 at 8:15 AM Amogh Desai <amoghdesai....@gmail.com> >> wrote: >> >> > +1 (non binding) from me on the thought of moving the older docs (~18 >> > months seems ok) to an archive instead of the repository. >> > >> > Coming to the other problem of copying the built docs into airflow-site >> for >> > releases, maybe we can fix that using a script? Open for thoughts here. >> > >> > I would be very happy to help when we start taking this forward, I have >> > some experience in airflow-site and docs side as well. Feel free to >> reach >> > out over email or slack :) >> > >> > Thanks & Regards, >> > Amogh Desai >> > >> > On Mon, Oct 23, 2023 at 3:08 AM Aritra Basu <aritrabasu1...@gmail.com> >> > wrote: >> > >> > > This definitely sounds like something that needs doing sooner rather >> than >> > > later. >> > > >> > > While I'd love to help, I'm not too experienced with this area so I >> might >> > > not be able to actually propose what changes need doing, but if >> someone >> > has >> > > a path forward on this I can definitely contribute some time to help >> out >> > > given some guidance on what is needed. >> > > >> > > -- >> > > Regards, >> > > Aritra Basu >> > > >> > > On Mon, Oct 23, 2023, 2:19 AM Jarek Potiuk <ja...@potiuk.com> wrote: >> > > >> > > > Some news here. >> > > > >> > > > I caught up with some infra changes that happened while I was >> > travelling >> > > - >> > > > and I have just (with >> https://github.com/apache/airflow-site/pull/879) >> > > > switched the "airflow-site" building to the new, self-hosted >> > > "asf-runners". >> > > > This is a new option that ASF infra has given to test for the ASF >> > > projects >> > > > - rather than relying on "public runners", we can switch to >> self-hosted >> > > > runners donated by Microsoft to the ASF. More info here: >> > > > >> > > > >> > > >> > >> https://cwiki.apache.org/confluence/pages/viewpage.action?spaceKey=INFRA&title=ASF+Infra+provided+self-hosted+runners >> > > > >> > > > The most important result is that we now have a lot more "breathing >> > > space" >> > > > for the docs building job. During the build we are using max 59% of >> the >> > > > disk space - with 73GB used and 52GB free. >> > > > >> > > > Filesystem Size Used Avail Use% Mounted on >> > > > overlay 124G 73G 52G 59% / >> > > > >> > > > This is - on one hand - good news (disk space is not an "acute" >> issue >> > any >> > > > more), I think if someone would like to work on improving the docs >> > > building >> > > > of ours, they have much more breathing space to do so. >> > > > But - clearly - it might mean that the incentive to work on it >> > decreased >> > > - >> > > > because it "just works"). That's the bad effect of it. And I think >> it's >> > > not >> > > > good, though the most I can do is to reiterate Ryan's concerns and >> hope >> > > we >> > > > will get someone committing to improving this. >> > > > >> > > > I would strongly encourage those who want to improve it, to do so. I >> > > think >> > > > - as Ryan stated - contributing to our docs is more complex than it >> > > should >> > > > be and anyone who would like to contribute there is most welcome. I >> > very >> > > > much share all the points that Ryan made and I think we should >> welcome >> > > any >> > > > efforts to make it better. The lack of incremental/auto-build >> support >> > is >> > > > especially troublesome for anyone who wants to contribute their >> docs. >> > > Happy >> > > > to help anyone who would like to take on the task. >> > > > >> > > > Still - if we would like to move old docs outside as a first step, >> I am >> > > > happy to help anyone who would like to commit to doing it. >> > > > >> > > > J. >> > > > >> > > > On Fri, Oct 20, 2023 at 3:27 PM Pierre Jeambrun < >> pierrejb...@gmail.com >> > > >> > > > wrote: >> > > > >> > > > > +1 from moving archived docs outside of airflow-site. >> > > > > >> > > > > Even if that might mean a little more maintenance in case we need >> to >> > > > > propagate changes to all historical versions, we would have to >> > handle 2 >> > > > > repositories, but that seems like a minor downside compared to the >> > > > quality >> > > > > of life improvement that it would bring for airflow-site >> > contributions. >> > > > > >> > > > > Le jeu. 19 oct. 2023 à 16:11, Jarek Potiuk <ja...@potiuk.com> a >> > écrit >> > > : >> > > > > >> > > > > > Let me just clarify (because that could be unclear) what my +1 >> was >> > > > about. >> > > > > > >> > > > > > I was not talking (and I believe Ryan was not talking either) >> about >> > > > > > removing the old docs but about archiving them and serving from >> > > > elsewhere >> > > > > > (cloud storage). >> > > > > > >> > > > > > I think discussing changing to more shared HTML/JS/CSS is also a >> > good >> > > > > idea >> > > > > > to optimise it, but possibly can be handled separately as a >> longer >> > > > effort >> > > > > > of redesigning how the docs are built. But by all means we could >> > also >> > > > > work >> > > > > > on that. >> > > > > > >> > > > > > Maybe I jumped to conclusions, but the easiest, tactical >> solution >> > > (for >> > > > > the >> > > > > > most acute issue - size) is we just move the old generated HTML >> > docs >> > > > from >> > > > > > the git repository of "airflow-site" and in the "github_pages" >> > branch >> > > > we >> > > > > > replace it with redirecting of those pages to the files served >> from >> > > the >> > > > > > cloud storage (and I believe this is what Ryan hinted at). >> > > > > > >> > > > > > Those redirects could be automatically generated for all >> > > > > > historical versions and they will be small. We are already >> doing >> > it >> > > > for >> > > > > > individual pages for navigating between versions, but we could >> > easily >> > > > > > replace all the historical docs with "<html><head><meta >> > > > > > http-equiv="refresh" content="0; url= >> > > > > > >> https://new-archive-docs-airflow-url/airflow/version/document.url" >> > > > > > "/></head></html>". Low-tech, surely and "legacy", but it will >> > solve >> > > > the >> > > > > > size problem instantly. We currently have 115.148 such files >> which >> > > will >> > > > > go >> > > > > > down to about ~20 MB of files which is peanuts, compared to the >> > > current >> > > > > > 17GB (!) we have. >> > > > > > >> > > > > > We can also inject into the moved "storage" docs, the header >> that >> > > > informs >> > > > > > that this is an old/archived documentation with single redirect >> to >> > > > > > "live"/"stable" site for newer versions of docs (which I believe >> > > > sparked >> > > > > > Ryan's work). This can be done at least as the "quick" >> remediation >> > > for >> > > > > the >> > > > > > size issue and something that might allow the current scheme to >> > > > > > work without ever-growing repo/size and using space for the >> build >> > > > action. >> > > > > > If we have such an automated mechanism in place, we could >> > > periodically >> > > > > > archive old docs. All that without changing the build process of >> > ours >> > > > and >> > > > > > simply keep old "past" docs elsewhere (still accessible for >> users). >> > > > > > >> > > > > > Not much should change for the users IMHO - if they go to the >> old >> > > > version >> > > > > > of the docs or use old, archived URLs, they would end up seeing >> the >> > > > > > same content/navigation they see today (with extra information >> it's >> > > an >> > > > > old >> > > > > > version and served from a different URL). >> > > > > > When they go to the "old" version of documentation they could be >> > > > > redirected >> > > > > > to the new one - same HTML but hosted on cloud storage, fully >> > > > statically. >> > > > > > We already do that with "redirect" mechanism. >> > > > > > >> > > > > > In the meantime, someone could also work on a strategic >> solution - >> > > and >> > > > > > changing the current build process, but this is - I think a >> > > different - >> > > > > > and much more complex and requiring a lot of effort - step. And >> it >> > > > could >> > > > > > simply end up with regenerating whatever is left as "live" >> > > > documentation >> > > > > > (leaving the archive docs intact). >> > > > > > >> > > > > > That's at least what I see as a possible set of steps to take. >> > > > > > >> > > > > > J. >> > > > > > >> > > > > > On Thu, Oct 19, 2023 at 2:14 PM utkarsh sharma < >> > > utkarshar...@gmail.com >> > > > > >> > > > > > wrote: >> > > > > > >> > > > > > > Hey everyone, >> > > > > > > >> > > > > > > Thanks, Ryan for stating the thread :) >> > > > > > > >> > > > > > > Big +1 For archiving docs older than 18 months. We can still >> make >> > > the >> > > > > > older >> > > > > > > docs available in `rst` doc form. >> > > > > > > >> > > > > > > But eventually, we might again run into this problem because >> of >> > the >> > > > > > growing >> > > > > > > no. of providers. I think the main reason for this issue is >> the >> > > > > generated >> > > > > > > static HTML pages and the way we cater to them using GitHub >> > Pages. >> > > > The >> > > > > > > generated pages have lots of common code >> > > > > > > HTML(headers/navigation/breadcrumbs/footer etc.) CSS, JS >> which is >> > > > > > repeated >> > > > > > > for every provider and every version of that provider. If we >> > have a >> > > > > more >> > > > > > > dynamic way(Django/Flask Servers) of catering the documents we >> > can >> > > > save >> > > > > > all >> > > > > > > the space for common HTML/CSS/JS. >> > > > > > > >> > > > > > > But the downsides of this approach are: >> > > > > > > 1. We need to have a server >> > > > > > > 2. Also require changes in the existing document build >> process to >> > > > only >> > > > > > > produce partial HTML documents. >> > > > > > > >> > > > > > > Thanks, >> > > > > > > Utkarsh Sharma >> > > > > > > >> > > > > > > On Thu, Oct 19, 2023 at 4:08 PM Jarek Potiuk < >> ja...@potiuk.com> >> > > > wrote: >> > > > > > > >> > > > > > > > Yes. Moving the old version to somewhere that we can >> > keep/archive >> > > > > > static >> > > > > > > > historical versions of those historical docs and publish >> them >> > > from >> > > > > > there. >> > > > > > > > What you proposed is exactly the solution I thought might be >> > best >> > > > as >> > > > > > > well. >> > > > > > > > >> > > > > > > > It would be a great task to contribute to the stability of >> our >> > > docs >> > > > > > > > generation in the future. >> > > > > > > > >> > > > > > > > I don't think it's a matter of discussing in detail how to >> do >> > it >> > > > (18 >> > > > > > > months >> > > > > > > > is a good start and you can parameterize it), It's the >> matter >> > of >> > > > > > > > someone committing to it and doing it simply :). >> > > > > > > > >> > > > > > > > So yes I personally am all for it and if I understand >> correctly >> > > > that >> > > > > > you >> > > > > > > > are looking for agreement on doing it, big +1 from my side - >> > > happy >> > > > to >> > > > > > > help >> > > > > > > > with providing access to our S3 buckets. >> > > > > > > > >> > > > > > > > J. >> > > > > > > > >> > > > > > > > On Thu, Oct 19, 2023 at 5:39 AM Ryan Hatter >> > > > > > > > <ryan.hat...@astronomer.io.invalid> wrote: >> > > > > > > > >> > > > > > > > > *tl;dr* >> > > > > > > > > >> > > > > > > > > 1. The GitHub Action for building docs is running out >> of >> > > > space. >> > > > > I >> > > > > > > > think >> > > > > > > > > we should archive really old documentation for large >> > > packages >> > > > to >> > > > > > > cloud >> > > > > > > > > storage. >> > > > > > > > > 2. Contributing to and building Airflow docs is hard. >> We >> > > > should >> > > > > > > > migrate >> > > > > > > > > to a framework, preferably one that uses markdown >> > (although >> > > I >> > > > > > > > > acknowledge >> > > > > > > > > rst -> md will be a massive overhaul). >> > > > > > > > > >> > > > > > > > > *Problem Summary* >> > > > > > > > > I recently set out to implement what I thought would be a >> > > > > > > straightforward >> > > > > > > > > feature: warn users when they are viewing documentation >> for >> > > > > > non-current >> > > > > > > > > versions of Airflow and link them to the current/stable >> > version >> > > > > > > > > <https://github.com/apache/airflow/pull/34639>. Jed >> pointed >> > me >> > > > to >> > > > > > the >> > > > > > > > > airflow-site <https://github.com/apache/airflow-site> >> repo, >> > > > which >> > > > > > > > contains >> > > > > > > > > all of the archived docs (that is, documentation for >> > > non-current >> > > > > > > > versions), >> > > > > > > > > and from there, I ran into a brick wall. >> > > > > > > > > >> > > > > > > > > I want to raise some concerns that I've developed after >> > trying >> > > to >> > > > > > > > > contribute what feel like a couple reasonably small docs >> > > updates: >> > > > > > > > > >> > > > > > > > > 1. airflow-site >> > > > > > > > > 1. Elad pointed out the problem posed by the sheer >> size >> > > of >> > > > > > > archived >> > > > > > > > > docs >> > > > > > > > > < >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> https://apache-airflow.slack.com/archives/CCPRP7943/p1697009000242369?thread_ts=1696973512.004229&cid=CCPRP7943 >> > > > > > > > > > >> > > > > > > > > (more >> > > > > > > > > on this later). >> > > > > > > > > 2. The airflow-site repo is confusing, and rather >> > poorly >> > > > > > > > documented. >> > > > > > > > > 1. Hugo (static site generator) exists, but >> appears >> > to >> > > > > only >> > > > > > be >> > > > > > > > > used for the landing pages >> > > > > > > > > 2. In order to view any documentation locally >> other >> > > than >> > > > > the >> > > > > > > > > landing pages, you'll need to run the site.sh >> script >> > > > then >> > > > > > > > > copy the output >> > > > > > > > > from one dir to another? >> > > > > > > > > 3. All of the archived docs are raw HTML, making >> > > migrating >> > > > > to a >> > > > > > > > > static site generator a significant challenge, which >> > > makes >> > > > it >> > > > > > > > > difficult to >> > > > > > > > > prevent the archived docs from continuing to grow >> and >> > > grow. >> > > > > > > > > Perhaps this is the >> > > > > > > > > wheel Khaleesi was referring to >> > > > > > > > > <https://www.youtube.com/watch?v=J-rxmk6zPxA>? >> > > > > > > > > 2. airflow >> > > > > > > > > 1. Building Airflow docs is a challenge. It takes >> > several >> > > > > > minutes >> > > > > > > > and >> > > > > > > > > doesn't support auto-build, so the slightest issue >> > could >> > > > > > require >> > > > > > > > > waiting >> > > > > > > > > again and again until the changes are just so. I >> tried >> > > > > > > implementing >> > > > > > > > > sphinx-autobuild < >> > > > > > > > > https://github.com/executablebooks/sphinx-autobuild> >> > > > > > > > > to no avail. >> > > > > > > > > 2. Sphinx/restructured text has a steep learning >> curve. >> > > > > > > > > >> > > > > > > > > *The most acute issue: disk space* >> > > > > > > > > The size of the archived docs is causing the docs build >> > GitHub >> > > > > Action >> > > > > > > to >> > > > > > > > > almost run out of space. From the "Build site" Action >> from a >> > > > couple >> > > > > > > weeks >> > > > > > > > > ago >> > > > > > > > > < >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> https://github.com/apache/airflow-site/actions/runs/6419529645/job/17432628458 >> > > > > > > > > > >> > > > > > > > > (expand >> > > > > > > > > the build site step, scroll all the way to the bottom, >> expand >> > > the >> > > > > `df >> > > > > > > -h` >> > > > > > > > > command), we can see the GitHub Action runner (or whatever >> > it's >> > > > > > called) >> > > > > > > > is >> > > > > > > > > nearly running out of space: >> > > > > > > > > >> > > > > > > > > df -h >> > > > > > > > > *Filesystem Size Used Avail Use% Mounted on* >> > > > > > > > > /dev/root 84G 82G 2.1G 98% / >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > The available space is down to 1.8G on the most recent >> Action >> > > > > > > > > < >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> https://github.com/apache/airflow-site/actions/runs/6564727255/job/17831714176 >> > > > > > > > > >. >> > > > > > > > > If we assume that trend is accurate, we have about two >> months >> > > > > before >> > > > > > > the >> > > > > > > > > Action runner runs out of disk space. Here's a breakdown >> of >> > the >> > > > > space >> > > > > > > > > consumed by the 10 largest package documentation >> directories: >> > > > > > > > > >> > > > > > > > > du -h -d 1 docs-archive/ | sort -h -r >> > > > > > > > > * 14G* docs-archive/ >> > > > > > > > > *4.0G* docs-archive//apache-airflow-providers-google >> > > > > > > > > *3.2G* docs-archive//apache-airflow >> > > > > > > > > *1.7G* docs-archive//apache-airflow-providers-amazon >> > > > > > > > > *560M* >> docs-archive//apache-airflow-providers-microsoft-azure >> > > > > > > > > *254M* >> docs-archive//apache-airflow-providers-cncf-kubernetes >> > > > > > > > > *192M* docs-archive//apache-airflow-providers-apache-hive >> > > > > > > > > *153M* docs-archive//apache-airflow-providers-snowflake >> > > > > > > > > *139M* docs-archive//apache-airflow-providers-databricks >> > > > > > > > > *104M* docs-archive//apache-airflow-providers-docker >> > > > > > > > > *101M* docs-archive//apache-airflow-providers-mysql >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > *Proposed solution: Archive old docs html for large >> packages >> > to >> > > > > cloud >> > > > > > > > > storage* >> > > > > > > > > I'm wondering if it would be reasonable to truly archive >> the >> > > docs >> > > > > for >> > > > > > > > some >> > > > > > > > > of the older versions of these packages. Perhaps the last >> 18 >> > > > > months? >> > > > > > > > Maybe >> > > > > > > > > we could drop the html in a blob storage bucket with >> > > instructions >> > > > > for >> > > > > > > > > building the docs if absolutely necessary? >> > > > > > > > > >> > > > > > > > > *Improving docs building moving forward* >> > > > > > > > > There's an open Issue < >> > > > > > > https://github.com/apache/airflow-site/issues/719 >> > > > > > > > > >> > > > > > > > > for >> > > > > > > > > migrating the docs to a framework, but it's not at all a >> > > > > > > straightforward >> > > > > > > > > task for the archived docs. I think that we should >> institute >> > a >> > > > > policy >> > > > > > > of >> > > > > > > > > archiving old documentation to cloud storage after X time >> and >> > > > use a >> > > > > > > > > framework for building docs in a scalable and sustainable >> way >> > > > > moving >> > > > > > > > > forward. Maybe we could chat with iceberg folks about how >> > they >> > > > > moved >> > > > > > > from >> > > > > > > > > mkdocs to hugo? < >> > https://github.com/apache/iceberg/issues/3616 >> > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > Shoutout to Utkarsh for helping me through all this! >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> >