Whoa! Is there any clear reason why 3.5 docs are so big? 1GB of docs / 10x jump seems crazy. Maybe we need to investigate and fix that also.
I take it that the problem is the size of the repo once it's cloned into the docker container. Removing the .html files helps that, but, then we don't have .html docs in the published site! We can generate them in the build process, but I presume it's waaay too long to rebuild docs for every release every time. I do support at *least* tarring up old .html docs from old releases (<3.0?) and making them available somehow on the site, so that they're accessible if needed. Analytics says that page views for docs before 3.1 are quite minimal, probably hundreds of views this year at best vs 10M total views: https://analytics.apache.org/index.php?module=CoreHome&action=index&date=yesterday&period=day&idSite=40#?idSite=40&period=year&date=2024-08-07&category=General_Actions&subcategory=General_Pages On Thu, Aug 8, 2024 at 12:42 PM Dongjoon Hyun <dongjoon.h...@gmail.com> wrote: > The culprit seems to be PySpark 3.5 documentation which grows 11x times at > 3.5+ > > $ du -h 3.4.3/api/python | tail -n1 > 84M 3.4.3/api/python > > $ du -h 3.5.1/api/python | tail -n1 > 943M 3.5.1/api/python > > Since we will generate big documents for 3.5.x, 4.0.0-preview, 4.0.x, > 4.1.x, the proposed tarball idea sounds promising to me too. > > $ ls -alh 3.5.1.tgz > -rw-r--r-- 1 dongjoon staff 103M Aug 8 10:22 3.5.1.tgz > > Specifically, shall we keep HTML files for only the latest version of live > releases, e.g. 3.4.3, 3.5.1, and 4.0.0-preview1? > > In other words, all 0.x ~ 3.4.2 and 3.5.1 will be tarball files in the > current status. > > Dongjoon. > > > On Thu, Aug 8, 2024 at 10:01 AM Sean Owen <sro...@gmail.com> wrote: > >> I agree with 'archiving', but what does that mean? delete from the repo >> and site? >> While I really doubt people are looking for docs for, say, 0.5.0, it'd be >> a big jump to totally remove it. >> >> What if we made a compressed tarball of old docs and put that in the >> repo, linked to it, and removed the docs files for many old releases? >> It's still in the repo and will be in the container when docs are built, >> but, compressed would be much smaller. >> That could buy a significant amount of time. >> >> On Thu, Aug 8, 2024 at 7:06 AM Kent Yao <y...@apache.org> wrote: >> >>> Hi dev, >>> >>> The current size of the spark-website repository is approximately 16GB, >>> exceeding the storage limit of GitHub-hosted runners. The GitHub actions >>> have been failing recently in the actions/checkout step caused by >>> 'No space left on device' errors. >>> >>> Filesystem Size Used Avail Use% Mounted on >>> overlay 73G 58G 16G 80% / >>> tmpfs 64M 0 64M 0% /dev >>> tmpfs 7.9G 0 7.9G 0% /sys/fs/cgroup >>> shm 64M 0 64M 0% /dev/shm >>> /dev/root 73G 58G 16G 80% /__w >>> tmpfs 1.6G 1.2M 1.6G 1% /run/docker.sock >>> tmpfs 7.9G 0 7.9G 0% /proc/acpi >>> tmpfs 7.9G 0 7.9G 0% /proc/scsi >>> tmpfs 7.9G 0 7.9G 0% /sys/firmware >>> >>> >>> The documentation for each version contributes the most volume. Since >>> version >>> 3.5.0, the documentation size has grown 3-4 times larger than the >>> size of 3.4.x, >>> with more than 1GB. >>> >>> >>> 9.9M ./0.6.0 >>> 10M ./0.6.1 >>> 10M ./0.6.2 >>> 15M ./0.7.0 >>> 16M ./0.7.2 >>> 16M ./0.7.3 >>> 20M ./0.8.0 >>> 20M ./0.8.1 >>> 38M ./0.9.0 >>> 38M ./0.9.1 >>> 38M ./0.9.2 >>> 36M ./1.0.0 >>> 38M ./1.0.1 >>> 38M ./1.0.2 >>> 48M ./1.1.0 >>> 48M ./1.1.1 >>> 73M ./1.2.0 >>> 73M ./1.2.1 >>> 74M ./1.2.2 >>> 69M ./1.3.0 >>> 73M ./1.3.1 >>> 68M ./1.4.0 >>> 70M ./1.4.1 >>> 80M ./1.5.0 >>> 78M ./1.5.1 >>> 78M ./1.5.2 >>> 87M ./1.6.0 >>> 87M ./1.6.1 >>> 87M ./1.6.2 >>> 86M ./1.6.3 >>> 117M ./2.0.0 >>> 119M ./2.0.0-preview >>> 118M ./2.0.1 >>> 118M ./2.0.2 >>> 121M ./2.1.0 >>> 121M ./2.1.1 >>> 122M ./2.1.2 >>> 122M ./2.1.3 >>> 130M ./2.2.0 >>> 131M ./2.2.1 >>> 132M ./2.2.2 >>> 131M ./2.2.3 >>> 141M ./2.3.0 >>> 141M ./2.3.1 >>> 141M ./2.3.2 >>> 142M ./2.3.3 >>> 142M ./2.3.4 >>> 145M ./2.4.0 >>> 146M ./2.4.1 >>> 145M ./2.4.2 >>> 144M ./2.4.3 >>> 145M ./2.4.4 >>> 143M ./2.4.5 >>> 143M ./2.4.6 >>> 143M ./2.4.7 >>> 143M ./2.4.8 >>> 197M ./3.0.0 >>> 185M ./3.0.0-preview >>> 197M ./3.0.0-preview2 >>> 198M ./3.0.1 >>> 198M ./3.0.2 >>> 205M ./3.0.3 >>> 239M ./3.1.1 >>> 239M ./3.1.2 >>> 239M ./3.1.3 >>> 840M ./3.2.0 >>> 842M ./3.2.1 >>> 282M ./3.2.2 >>> 244M ./3.2.3 >>> 282M ./3.2.4 >>> 295M ./3.3.0 >>> 297M ./3.3.1 >>> 297M ./3.3.2 >>> 297M ./3.3.3 >>> 297M ./3.3.4 >>> 314M ./3.4.0 >>> 314M ./3.4.1 >>> 328M ./3.4.2 >>> 324M ./3.4.3 >>> 1.1G ./3.5.0 >>> 1.2G ./3.5.1 >>> 1.1G ./4.0.0-preview1 >>> >>> I'm concerned about publishing the documentation for version 3.5.2 >>> to the asf-site. So, I have merged PR[2] to eliminate this potential >>> blocker. >>> >>> Considering that the problem still exists, should we temporarily archive >>> some of the outdated version documents? For example, only keep >>> the latest version for each feature release in the asf-site branch. Or, >>> Do you have any other suggestions? >>> >>> >>> Bests, >>> Kent Yao >>> >>> >>> [1] >>> https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners/about-github-hosted-runners#standard-github-hosted-runners-for-public-repositories >>> [2] https://github.com/apache/spark-website/pull/543 >>> >>> --------------------------------------------------------------------- >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> >>>