Whoa! Is there any clear reason why 3.5 docs are so big? 1GB of docs / 10x
jump seems crazy. Maybe we need to investigate and fix that also.

I take it that the problem is the size of the repo once it's cloned into
the docker container. Removing the .html files helps that, but, then we
don't have .html docs in the published site!
We can generate them in the build process, but I presume it's waaay too
long to rebuild docs for every release every time.

I do support at *least* tarring up old .html docs from old releases (<3.0?)
and making them available somehow on the site, so that they're accessible
if needed.

Analytics says that page views for docs before 3.1 are quite minimal,
probably hundreds of views this year at best vs 10M total views:
https://analytics.apache.org/index.php?module=CoreHome&action=index&date=yesterday&period=day&idSite=40#?idSite=40&period=year&date=2024-08-07&category=General_Actions&subcategory=General_Pages

On Thu, Aug 8, 2024 at 12:42 PM Dongjoon Hyun <dongjoon.h...@gmail.com>
wrote:

> The culprit seems to be PySpark 3.5 documentation which grows 11x times at
> 3.5+
>
> $ du -h 3.4.3/api/python | tail -n1
>  84M 3.4.3/api/python
>
> $ du -h 3.5.1/api/python | tail -n1
> 943M 3.5.1/api/python
>
> Since we will generate big documents for 3.5.x, 4.0.0-preview, 4.0.x,
> 4.1.x, the proposed tarball idea sounds promising to me too.
>
> $ ls -alh 3.5.1.tgz
> -rw-r--r--  1 dongjoon  staff   103M Aug  8 10:22 3.5.1.tgz
>
> Specifically, shall we keep HTML files for only the latest version of live
> releases, e.g. 3.4.3, 3.5.1, and 4.0.0-preview1?
>
> In other words, all 0.x ~ 3.4.2 and 3.5.1 will be tarball files in the
> current status.
>
> Dongjoon.
>
>
> On Thu, Aug 8, 2024 at 10:01 AM Sean Owen <sro...@gmail.com> wrote:
>
>> I agree with 'archiving', but what does that mean? delete from the repo
>> and site?
>> While I really doubt people are looking for docs for, say, 0.5.0, it'd be
>> a big jump to totally remove it.
>>
>> What if we made a compressed tarball of old docs and put that in the
>> repo, linked to it, and removed the docs files for many old releases?
>> It's still in the repo and will be in the container when docs are built,
>> but, compressed would be much smaller.
>> That could buy a significant amount of time.
>>
>> On Thu, Aug 8, 2024 at 7:06 AM Kent Yao <y...@apache.org> wrote:
>>
>>> Hi dev,
>>>
>>> The current size of the spark-website repository is approximately 16GB,
>>> exceeding the storage limit of GitHub-hosted runners.  The GitHub actions
>>> have been failing recently in the actions/checkout step caused by
>>> 'No space left on device' errors.
>>>
>>> Filesystem      Size  Used Avail Use% Mounted on
>>> overlay          73G   58G   16G  80% /
>>> tmpfs            64M     0   64M   0% /dev
>>> tmpfs           7.9G     0  7.9G   0% /sys/fs/cgroup
>>> shm              64M     0   64M   0% /dev/shm
>>> /dev/root        73G   58G   16G  80% /__w
>>> tmpfs           1.6G  1.2M  1.6G   1% /run/docker.sock
>>> tmpfs           7.9G     0  7.9G   0% /proc/acpi
>>> tmpfs           7.9G     0  7.9G   0% /proc/scsi
>>> tmpfs           7.9G     0  7.9G   0% /sys/firmware
>>>
>>>
>>> The documentation for each version contributes the most volume. Since
>>> version
>>>  3.5.0, the documentation size has grown 3-4 times larger than the
>>> size of 3.4.x,
>>>  with more than 1GB.
>>>
>>>
>>> 9.9M ./0.6.0
>>>  10M ./0.6.1
>>>  10M ./0.6.2
>>>  15M ./0.7.0
>>>  16M ./0.7.2
>>>  16M ./0.7.3
>>>  20M ./0.8.0
>>>  20M ./0.8.1
>>>  38M ./0.9.0
>>>  38M ./0.9.1
>>>  38M ./0.9.2
>>>  36M ./1.0.0
>>>  38M ./1.0.1
>>>  38M ./1.0.2
>>>  48M ./1.1.0
>>>  48M ./1.1.1
>>>  73M ./1.2.0
>>>  73M ./1.2.1
>>>  74M ./1.2.2
>>>  69M ./1.3.0
>>>  73M ./1.3.1
>>>  68M ./1.4.0
>>>  70M ./1.4.1
>>>  80M ./1.5.0
>>>  78M ./1.5.1
>>>  78M ./1.5.2
>>>  87M ./1.6.0
>>>  87M ./1.6.1
>>>  87M ./1.6.2
>>>  86M ./1.6.3
>>> 117M ./2.0.0
>>> 119M ./2.0.0-preview
>>> 118M ./2.0.1
>>> 118M ./2.0.2
>>> 121M ./2.1.0
>>> 121M ./2.1.1
>>> 122M ./2.1.2
>>> 122M ./2.1.3
>>> 130M ./2.2.0
>>> 131M ./2.2.1
>>> 132M ./2.2.2
>>> 131M ./2.2.3
>>> 141M ./2.3.0
>>> 141M ./2.3.1
>>> 141M ./2.3.2
>>> 142M ./2.3.3
>>> 142M ./2.3.4
>>> 145M ./2.4.0
>>> 146M ./2.4.1
>>> 145M ./2.4.2
>>> 144M ./2.4.3
>>> 145M ./2.4.4
>>> 143M ./2.4.5
>>> 143M ./2.4.6
>>> 143M ./2.4.7
>>> 143M ./2.4.8
>>> 197M ./3.0.0
>>> 185M ./3.0.0-preview
>>> 197M ./3.0.0-preview2
>>> 198M ./3.0.1
>>> 198M ./3.0.2
>>> 205M ./3.0.3
>>> 239M ./3.1.1
>>> 239M ./3.1.2
>>> 239M ./3.1.3
>>> 840M ./3.2.0
>>> 842M ./3.2.1
>>> 282M ./3.2.2
>>> 244M ./3.2.3
>>> 282M ./3.2.4
>>> 295M ./3.3.0
>>> 297M ./3.3.1
>>> 297M ./3.3.2
>>> 297M ./3.3.3
>>> 297M ./3.3.4
>>> 314M ./3.4.0
>>> 314M ./3.4.1
>>> 328M ./3.4.2
>>> 324M ./3.4.3
>>> 1.1G ./3.5.0
>>> 1.2G ./3.5.1
>>> 1.1G ./4.0.0-preview1
>>>
>>> I'm concerned about publishing the documentation for version 3.5.2
>>> to the asf-site. So, I have merged PR[2] to eliminate this potential
>>> blocker.
>>>
>>> Considering that the problem still exists, should we temporarily archive
>>> some of the outdated version documents? For example, only keep
>>> the latest version for each feature release in the asf-site branch. Or,
>>> Do you have any other suggestions?
>>>
>>>
>>> Bests,
>>> Kent Yao
>>>
>>>
>>> [1]
>>> https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners/about-github-hosted-runners#standard-github-hosted-runners-for-public-repositories
>>> [2] https://github.com/apache/spark-website/pull/543
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>

Reply via email to