The culprit seems to be PySpark 3.5 documentation which grows 11x times at
3.5+

$ du -h 3.4.3/api/python | tail -n1
 84M 3.4.3/api/python

$ du -h 3.5.1/api/python | tail -n1
943M 3.5.1/api/python

Since we will generate big documents for 3.5.x, 4.0.0-preview, 4.0.x,
4.1.x, the proposed tarball idea sounds promising to me too.

$ ls -alh 3.5.1.tgz
-rw-r--r--  1 dongjoon  staff   103M Aug  8 10:22 3.5.1.tgz

Specifically, shall we keep HTML files for only the latest version of live
releases, e.g. 3.4.3, 3.5.1, and 4.0.0-preview1?

In other words, all 0.x ~ 3.4.2 and 3.5.1 will be tarball files in the
current status.

Dongjoon.


On Thu, Aug 8, 2024 at 10:01 AM Sean Owen <sro...@gmail.com> wrote:

> I agree with 'archiving', but what does that mean? delete from the repo
> and site?
> While I really doubt people are looking for docs for, say, 0.5.0, it'd be
> a big jump to totally remove it.
>
> What if we made a compressed tarball of old docs and put that in the repo,
> linked to it, and removed the docs files for many old releases?
> It's still in the repo and will be in the container when docs are built,
> but, compressed would be much smaller.
> That could buy a significant amount of time.
>
> On Thu, Aug 8, 2024 at 7:06 AM Kent Yao <y...@apache.org> wrote:
>
>> Hi dev,
>>
>> The current size of the spark-website repository is approximately 16GB,
>> exceeding the storage limit of GitHub-hosted runners.  The GitHub actions
>> have been failing recently in the actions/checkout step caused by
>> 'No space left on device' errors.
>>
>> Filesystem      Size  Used Avail Use% Mounted on
>> overlay          73G   58G   16G  80% /
>> tmpfs            64M     0   64M   0% /dev
>> tmpfs           7.9G     0  7.9G   0% /sys/fs/cgroup
>> shm              64M     0   64M   0% /dev/shm
>> /dev/root        73G   58G   16G  80% /__w
>> tmpfs           1.6G  1.2M  1.6G   1% /run/docker.sock
>> tmpfs           7.9G     0  7.9G   0% /proc/acpi
>> tmpfs           7.9G     0  7.9G   0% /proc/scsi
>> tmpfs           7.9G     0  7.9G   0% /sys/firmware
>>
>>
>> The documentation for each version contributes the most volume. Since
>> version
>>  3.5.0, the documentation size has grown 3-4 times larger than the
>> size of 3.4.x,
>>  with more than 1GB.
>>
>>
>> 9.9M ./0.6.0
>>  10M ./0.6.1
>>  10M ./0.6.2
>>  15M ./0.7.0
>>  16M ./0.7.2
>>  16M ./0.7.3
>>  20M ./0.8.0
>>  20M ./0.8.1
>>  38M ./0.9.0
>>  38M ./0.9.1
>>  38M ./0.9.2
>>  36M ./1.0.0
>>  38M ./1.0.1
>>  38M ./1.0.2
>>  48M ./1.1.0
>>  48M ./1.1.1
>>  73M ./1.2.0
>>  73M ./1.2.1
>>  74M ./1.2.2
>>  69M ./1.3.0
>>  73M ./1.3.1
>>  68M ./1.4.0
>>  70M ./1.4.1
>>  80M ./1.5.0
>>  78M ./1.5.1
>>  78M ./1.5.2
>>  87M ./1.6.0
>>  87M ./1.6.1
>>  87M ./1.6.2
>>  86M ./1.6.3
>> 117M ./2.0.0
>> 119M ./2.0.0-preview
>> 118M ./2.0.1
>> 118M ./2.0.2
>> 121M ./2.1.0
>> 121M ./2.1.1
>> 122M ./2.1.2
>> 122M ./2.1.3
>> 130M ./2.2.0
>> 131M ./2.2.1
>> 132M ./2.2.2
>> 131M ./2.2.3
>> 141M ./2.3.0
>> 141M ./2.3.1
>> 141M ./2.3.2
>> 142M ./2.3.3
>> 142M ./2.3.4
>> 145M ./2.4.0
>> 146M ./2.4.1
>> 145M ./2.4.2
>> 144M ./2.4.3
>> 145M ./2.4.4
>> 143M ./2.4.5
>> 143M ./2.4.6
>> 143M ./2.4.7
>> 143M ./2.4.8
>> 197M ./3.0.0
>> 185M ./3.0.0-preview
>> 197M ./3.0.0-preview2
>> 198M ./3.0.1
>> 198M ./3.0.2
>> 205M ./3.0.3
>> 239M ./3.1.1
>> 239M ./3.1.2
>> 239M ./3.1.3
>> 840M ./3.2.0
>> 842M ./3.2.1
>> 282M ./3.2.2
>> 244M ./3.2.3
>> 282M ./3.2.4
>> 295M ./3.3.0
>> 297M ./3.3.1
>> 297M ./3.3.2
>> 297M ./3.3.3
>> 297M ./3.3.4
>> 314M ./3.4.0
>> 314M ./3.4.1
>> 328M ./3.4.2
>> 324M ./3.4.3
>> 1.1G ./3.5.0
>> 1.2G ./3.5.1
>> 1.1G ./4.0.0-preview1
>>
>> I'm concerned about publishing the documentation for version 3.5.2
>> to the asf-site. So, I have merged PR[2] to eliminate this potential
>> blocker.
>>
>> Considering that the problem still exists, should we temporarily archive
>> some of the outdated version documents? For example, only keep
>> the latest version for each feature release in the asf-site branch. Or,
>> Do you have any other suggestions?
>>
>>
>> Bests,
>> Kent Yao
>>
>>
>> [1]
>> https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners/about-github-hosted-runners#standard-github-hosted-runners-for-public-repositories
>> [2] https://github.com/apache/spark-website/pull/543
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Reply via email to