The culprit seems to be PySpark 3.5 documentation which grows 11x times at 3.5+
$ du -h 3.4.3/api/python | tail -n1 84M 3.4.3/api/python $ du -h 3.5.1/api/python | tail -n1 943M 3.5.1/api/python Since we will generate big documents for 3.5.x, 4.0.0-preview, 4.0.x, 4.1.x, the proposed tarball idea sounds promising to me too. $ ls -alh 3.5.1.tgz -rw-r--r-- 1 dongjoon staff 103M Aug 8 10:22 3.5.1.tgz Specifically, shall we keep HTML files for only the latest version of live releases, e.g. 3.4.3, 3.5.1, and 4.0.0-preview1? In other words, all 0.x ~ 3.4.2 and 3.5.1 will be tarball files in the current status. Dongjoon. On Thu, Aug 8, 2024 at 10:01 AM Sean Owen <sro...@gmail.com> wrote: > I agree with 'archiving', but what does that mean? delete from the repo > and site? > While I really doubt people are looking for docs for, say, 0.5.0, it'd be > a big jump to totally remove it. > > What if we made a compressed tarball of old docs and put that in the repo, > linked to it, and removed the docs files for many old releases? > It's still in the repo and will be in the container when docs are built, > but, compressed would be much smaller. > That could buy a significant amount of time. > > On Thu, Aug 8, 2024 at 7:06 AM Kent Yao <y...@apache.org> wrote: > >> Hi dev, >> >> The current size of the spark-website repository is approximately 16GB, >> exceeding the storage limit of GitHub-hosted runners. The GitHub actions >> have been failing recently in the actions/checkout step caused by >> 'No space left on device' errors. >> >> Filesystem Size Used Avail Use% Mounted on >> overlay 73G 58G 16G 80% / >> tmpfs 64M 0 64M 0% /dev >> tmpfs 7.9G 0 7.9G 0% /sys/fs/cgroup >> shm 64M 0 64M 0% /dev/shm >> /dev/root 73G 58G 16G 80% /__w >> tmpfs 1.6G 1.2M 1.6G 1% /run/docker.sock >> tmpfs 7.9G 0 7.9G 0% /proc/acpi >> tmpfs 7.9G 0 7.9G 0% /proc/scsi >> tmpfs 7.9G 0 7.9G 0% /sys/firmware >> >> >> The documentation for each version contributes the most volume. Since >> version >> 3.5.0, the documentation size has grown 3-4 times larger than the >> size of 3.4.x, >> with more than 1GB. >> >> >> 9.9M ./0.6.0 >> 10M ./0.6.1 >> 10M ./0.6.2 >> 15M ./0.7.0 >> 16M ./0.7.2 >> 16M ./0.7.3 >> 20M ./0.8.0 >> 20M ./0.8.1 >> 38M ./0.9.0 >> 38M ./0.9.1 >> 38M ./0.9.2 >> 36M ./1.0.0 >> 38M ./1.0.1 >> 38M ./1.0.2 >> 48M ./1.1.0 >> 48M ./1.1.1 >> 73M ./1.2.0 >> 73M ./1.2.1 >> 74M ./1.2.2 >> 69M ./1.3.0 >> 73M ./1.3.1 >> 68M ./1.4.0 >> 70M ./1.4.1 >> 80M ./1.5.0 >> 78M ./1.5.1 >> 78M ./1.5.2 >> 87M ./1.6.0 >> 87M ./1.6.1 >> 87M ./1.6.2 >> 86M ./1.6.3 >> 117M ./2.0.0 >> 119M ./2.0.0-preview >> 118M ./2.0.1 >> 118M ./2.0.2 >> 121M ./2.1.0 >> 121M ./2.1.1 >> 122M ./2.1.2 >> 122M ./2.1.3 >> 130M ./2.2.0 >> 131M ./2.2.1 >> 132M ./2.2.2 >> 131M ./2.2.3 >> 141M ./2.3.0 >> 141M ./2.3.1 >> 141M ./2.3.2 >> 142M ./2.3.3 >> 142M ./2.3.4 >> 145M ./2.4.0 >> 146M ./2.4.1 >> 145M ./2.4.2 >> 144M ./2.4.3 >> 145M ./2.4.4 >> 143M ./2.4.5 >> 143M ./2.4.6 >> 143M ./2.4.7 >> 143M ./2.4.8 >> 197M ./3.0.0 >> 185M ./3.0.0-preview >> 197M ./3.0.0-preview2 >> 198M ./3.0.1 >> 198M ./3.0.2 >> 205M ./3.0.3 >> 239M ./3.1.1 >> 239M ./3.1.2 >> 239M ./3.1.3 >> 840M ./3.2.0 >> 842M ./3.2.1 >> 282M ./3.2.2 >> 244M ./3.2.3 >> 282M ./3.2.4 >> 295M ./3.3.0 >> 297M ./3.3.1 >> 297M ./3.3.2 >> 297M ./3.3.3 >> 297M ./3.3.4 >> 314M ./3.4.0 >> 314M ./3.4.1 >> 328M ./3.4.2 >> 324M ./3.4.3 >> 1.1G ./3.5.0 >> 1.2G ./3.5.1 >> 1.1G ./4.0.0-preview1 >> >> I'm concerned about publishing the documentation for version 3.5.2 >> to the asf-site. So, I have merged PR[2] to eliminate this potential >> blocker. >> >> Considering that the problem still exists, should we temporarily archive >> some of the outdated version documents? For example, only keep >> the latest version for each feature release in the asf-site branch. Or, >> Do you have any other suggestions? >> >> >> Bests, >> Kent Yao >> >> >> [1] >> https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners/about-github-hosted-runners#standard-github-hosted-runners-for-public-repositories >> [2] https://github.com/apache/spark-website/pull/543 >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >>