Ya, I agree that we need to investigate what happened at PySpark 3.5+ docs.
For old Spark docs, it seems to be negligible. - All Spark 0.x docs: 231M - All Spark 1.x docs: 1.3G - All Spark 2.x docs: 3.4G For example, the total size of above all old Spark docs is less than the following 4 releases docs. 1.1G ./3.5.0 1.2G ./3.5.1 1.2G ./3.5.2 RC2 1.1G ./4.0.0-preview1 So, if we do start something, we had better focus on the latest doc first in the reverse order. Dongjoon On Thu, Aug 8, 2024 at 11:22 AM Sean Owen <sro...@gmail.com> wrote: > Whoa! Is there any clear reason why 3.5 docs are so big? 1GB of docs / 10x > jump seems crazy. Maybe we need to investigate and fix that also. > > I take it that the problem is the size of the repo once it's cloned into > the docker container. Removing the .html files helps that, but, then we > don't have .html docs in the published site! > We can generate them in the build process, but I presume it's waaay too > long to rebuild docs for every release every time. > > I do support at *least* tarring up old .html docs from old releases > (<3.0?) and making them available somehow on the site, so that they're > accessible if needed. > > Analytics says that page views for docs before 3.1 are quite minimal, > probably hundreds of views this year at best vs 10M total views: > > https://analytics.apache.org/index.php?module=CoreHome&action=index&date=yesterday&period=day&idSite=40#?idSite=40&period=year&date=2024-08-07&category=General_Actions&subcategory=General_Pages > > On Thu, Aug 8, 2024 at 12:42 PM Dongjoon Hyun <dongjoon.h...@gmail.com> > wrote: > >> The culprit seems to be PySpark 3.5 documentation which grows 11x times >> at 3.5+ >> >> $ du -h 3.4.3/api/python | tail -n1 >> 84M 3.4.3/api/python >> >> $ du -h 3.5.1/api/python | tail -n1 >> 943M 3.5.1/api/python >> >> Since we will generate big documents for 3.5.x, 4.0.0-preview, 4.0.x, >> 4.1.x, the proposed tarball idea sounds promising to me too. >> >> $ ls -alh 3.5.1.tgz >> -rw-r--r-- 1 dongjoon staff 103M Aug 8 10:22 3.5.1.tgz >> >> Specifically, shall we keep HTML files for only the latest version of >> live releases, e.g. 3.4.3, 3.5.1, and 4.0.0-preview1? >> >> In other words, all 0.x ~ 3.4.2 and 3.5.1 will be tarball files in the >> current status. >> >> Dongjoon. >> >> >> On Thu, Aug 8, 2024 at 10:01 AM Sean Owen <sro...@gmail.com> wrote: >> >>> I agree with 'archiving', but what does that mean? delete from the repo >>> and site? >>> While I really doubt people are looking for docs for, say, 0.5.0, it'd >>> be a big jump to totally remove it. >>> >>> What if we made a compressed tarball of old docs and put that in the >>> repo, linked to it, and removed the docs files for many old releases? >>> It's still in the repo and will be in the container when docs are built, >>> but, compressed would be much smaller. >>> That could buy a significant amount of time. >>> >>> On Thu, Aug 8, 2024 at 7:06 AM Kent Yao <y...@apache.org> wrote: >>> >>>> Hi dev, >>>> >>>> The current size of the spark-website repository is approximately 16GB, >>>> exceeding the storage limit of GitHub-hosted runners. The GitHub >>>> actions >>>> have been failing recently in the actions/checkout step caused by >>>> 'No space left on device' errors. >>>> >>>> Filesystem Size Used Avail Use% Mounted on >>>> overlay 73G 58G 16G 80% / >>>> tmpfs 64M 0 64M 0% /dev >>>> tmpfs 7.9G 0 7.9G 0% /sys/fs/cgroup >>>> shm 64M 0 64M 0% /dev/shm >>>> /dev/root 73G 58G 16G 80% /__w >>>> tmpfs 1.6G 1.2M 1.6G 1% /run/docker.sock >>>> tmpfs 7.9G 0 7.9G 0% /proc/acpi >>>> tmpfs 7.9G 0 7.9G 0% /proc/scsi >>>> tmpfs 7.9G 0 7.9G 0% /sys/firmware >>>> >>>> >>>> The documentation for each version contributes the most volume. Since >>>> version >>>> 3.5.0, the documentation size has grown 3-4 times larger than the >>>> size of 3.4.x, >>>> with more than 1GB. >>>> >>>> >>>> 9.9M ./0.6.0 >>>> 10M ./0.6.1 >>>> 10M ./0.6.2 >>>> 15M ./0.7.0 >>>> 16M ./0.7.2 >>>> 16M ./0.7.3 >>>> 20M ./0.8.0 >>>> 20M ./0.8.1 >>>> 38M ./0.9.0 >>>> 38M ./0.9.1 >>>> 38M ./0.9.2 >>>> 36M ./1.0.0 >>>> 38M ./1.0.1 >>>> 38M ./1.0.2 >>>> 48M ./1.1.0 >>>> 48M ./1.1.1 >>>> 73M ./1.2.0 >>>> 73M ./1.2.1 >>>> 74M ./1.2.2 >>>> 69M ./1.3.0 >>>> 73M ./1.3.1 >>>> 68M ./1.4.0 >>>> 70M ./1.4.1 >>>> 80M ./1.5.0 >>>> 78M ./1.5.1 >>>> 78M ./1.5.2 >>>> 87M ./1.6.0 >>>> 87M ./1.6.1 >>>> 87M ./1.6.2 >>>> 86M ./1.6.3 >>>> 117M ./2.0.0 >>>> 119M ./2.0.0-preview >>>> 118M ./2.0.1 >>>> 118M ./2.0.2 >>>> 121M ./2.1.0 >>>> 121M ./2.1.1 >>>> 122M ./2.1.2 >>>> 122M ./2.1.3 >>>> 130M ./2.2.0 >>>> 131M ./2.2.1 >>>> 132M ./2.2.2 >>>> 131M ./2.2.3 >>>> 141M ./2.3.0 >>>> 141M ./2.3.1 >>>> 141M ./2.3.2 >>>> 142M ./2.3.3 >>>> 142M ./2.3.4 >>>> 145M ./2.4.0 >>>> 146M ./2.4.1 >>>> 145M ./2.4.2 >>>> 144M ./2.4.3 >>>> 145M ./2.4.4 >>>> 143M ./2.4.5 >>>> 143M ./2.4.6 >>>> 143M ./2.4.7 >>>> 143M ./2.4.8 >>>> 197M ./3.0.0 >>>> 185M ./3.0.0-preview >>>> 197M ./3.0.0-preview2 >>>> 198M ./3.0.1 >>>> 198M ./3.0.2 >>>> 205M ./3.0.3 >>>> 239M ./3.1.1 >>>> 239M ./3.1.2 >>>> 239M ./3.1.3 >>>> 840M ./3.2.0 >>>> 842M ./3.2.1 >>>> 282M ./3.2.2 >>>> 244M ./3.2.3 >>>> 282M ./3.2.4 >>>> 295M ./3.3.0 >>>> 297M ./3.3.1 >>>> 297M ./3.3.2 >>>> 297M ./3.3.3 >>>> 297M ./3.3.4 >>>> 314M ./3.4.0 >>>> 314M ./3.4.1 >>>> 328M ./3.4.2 >>>> 324M ./3.4.3 >>>> 1.1G ./3.5.0 >>>> 1.2G ./3.5.1 >>>> 1.1G ./4.0.0-preview1 >>>> >>>> I'm concerned about publishing the documentation for version 3.5.2 >>>> to the asf-site. So, I have merged PR[2] to eliminate this potential >>>> blocker. >>>> >>>> Considering that the problem still exists, should we temporarily archive >>>> some of the outdated version documents? For example, only keep >>>> the latest version for each feature release in the asf-site branch. Or, >>>> Do you have any other suggestions? >>>> >>>> >>>> Bests, >>>> Kent Yao >>>> >>>> >>>> [1] >>>> https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners/about-github-hosted-runners#standard-github-hosted-runners-for-public-repositories >>>> [2] https://github.com/apache/spark-website/pull/543 >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>> >>>>