Ya, I agree that we need to investigate what happened at PySpark 3.5+ docs.

For old Spark docs, it seems to be negligible.

- All Spark 0.x docs:  231M
- All Spark 1.x docs: 1.3G
- All Spark 2.x docs: 3.4G

For example, the total size of above all old Spark docs is less than the
following 4 releases docs.

1.1G ./3.5.0
1.2G ./3.5.1
1.2G ./3.5.2 RC2
1.1G ./4.0.0-preview1

So, if we do start something, we had better focus on the latest doc first
in the reverse order.

Dongjoon

On Thu, Aug 8, 2024 at 11:22 AM Sean Owen <sro...@gmail.com> wrote:

> Whoa! Is there any clear reason why 3.5 docs are so big? 1GB of docs / 10x
> jump seems crazy. Maybe we need to investigate and fix that also.
>
> I take it that the problem is the size of the repo once it's cloned into
> the docker container. Removing the .html files helps that, but, then we
> don't have .html docs in the published site!
> We can generate them in the build process, but I presume it's waaay too
> long to rebuild docs for every release every time.
>
> I do support at *least* tarring up old .html docs from old releases
> (<3.0?) and making them available somehow on the site, so that they're
> accessible if needed.
>
> Analytics says that page views for docs before 3.1 are quite minimal,
> probably hundreds of views this year at best vs 10M total views:
>
> https://analytics.apache.org/index.php?module=CoreHome&action=index&date=yesterday&period=day&idSite=40#?idSite=40&period=year&date=2024-08-07&category=General_Actions&subcategory=General_Pages
>
> On Thu, Aug 8, 2024 at 12:42 PM Dongjoon Hyun <dongjoon.h...@gmail.com>
> wrote:
>
>> The culprit seems to be PySpark 3.5 documentation which grows 11x times
>> at 3.5+
>>
>> $ du -h 3.4.3/api/python | tail -n1
>>  84M 3.4.3/api/python
>>
>> $ du -h 3.5.1/api/python | tail -n1
>> 943M 3.5.1/api/python
>>
>> Since we will generate big documents for 3.5.x, 4.0.0-preview, 4.0.x,
>> 4.1.x, the proposed tarball idea sounds promising to me too.
>>
>> $ ls -alh 3.5.1.tgz
>> -rw-r--r--  1 dongjoon  staff   103M Aug  8 10:22 3.5.1.tgz
>>
>> Specifically, shall we keep HTML files for only the latest version of
>> live releases, e.g. 3.4.3, 3.5.1, and 4.0.0-preview1?
>>
>> In other words, all 0.x ~ 3.4.2 and 3.5.1 will be tarball files in the
>> current status.
>>
>> Dongjoon.
>>
>>
>> On Thu, Aug 8, 2024 at 10:01 AM Sean Owen <sro...@gmail.com> wrote:
>>
>>> I agree with 'archiving', but what does that mean? delete from the repo
>>> and site?
>>> While I really doubt people are looking for docs for, say, 0.5.0, it'd
>>> be a big jump to totally remove it.
>>>
>>> What if we made a compressed tarball of old docs and put that in the
>>> repo, linked to it, and removed the docs files for many old releases?
>>> It's still in the repo and will be in the container when docs are built,
>>> but, compressed would be much smaller.
>>> That could buy a significant amount of time.
>>>
>>> On Thu, Aug 8, 2024 at 7:06 AM Kent Yao <y...@apache.org> wrote:
>>>
>>>> Hi dev,
>>>>
>>>> The current size of the spark-website repository is approximately 16GB,
>>>> exceeding the storage limit of GitHub-hosted runners.  The GitHub
>>>> actions
>>>> have been failing recently in the actions/checkout step caused by
>>>> 'No space left on device' errors.
>>>>
>>>> Filesystem      Size  Used Avail Use% Mounted on
>>>> overlay          73G   58G   16G  80% /
>>>> tmpfs            64M     0   64M   0% /dev
>>>> tmpfs           7.9G     0  7.9G   0% /sys/fs/cgroup
>>>> shm              64M     0   64M   0% /dev/shm
>>>> /dev/root        73G   58G   16G  80% /__w
>>>> tmpfs           1.6G  1.2M  1.6G   1% /run/docker.sock
>>>> tmpfs           7.9G     0  7.9G   0% /proc/acpi
>>>> tmpfs           7.9G     0  7.9G   0% /proc/scsi
>>>> tmpfs           7.9G     0  7.9G   0% /sys/firmware
>>>>
>>>>
>>>> The documentation for each version contributes the most volume. Since
>>>> version
>>>>  3.5.0, the documentation size has grown 3-4 times larger than the
>>>> size of 3.4.x,
>>>>  with more than 1GB.
>>>>
>>>>
>>>> 9.9M ./0.6.0
>>>>  10M ./0.6.1
>>>>  10M ./0.6.2
>>>>  15M ./0.7.0
>>>>  16M ./0.7.2
>>>>  16M ./0.7.3
>>>>  20M ./0.8.0
>>>>  20M ./0.8.1
>>>>  38M ./0.9.0
>>>>  38M ./0.9.1
>>>>  38M ./0.9.2
>>>>  36M ./1.0.0
>>>>  38M ./1.0.1
>>>>  38M ./1.0.2
>>>>  48M ./1.1.0
>>>>  48M ./1.1.1
>>>>  73M ./1.2.0
>>>>  73M ./1.2.1
>>>>  74M ./1.2.2
>>>>  69M ./1.3.0
>>>>  73M ./1.3.1
>>>>  68M ./1.4.0
>>>>  70M ./1.4.1
>>>>  80M ./1.5.0
>>>>  78M ./1.5.1
>>>>  78M ./1.5.2
>>>>  87M ./1.6.0
>>>>  87M ./1.6.1
>>>>  87M ./1.6.2
>>>>  86M ./1.6.3
>>>> 117M ./2.0.0
>>>> 119M ./2.0.0-preview
>>>> 118M ./2.0.1
>>>> 118M ./2.0.2
>>>> 121M ./2.1.0
>>>> 121M ./2.1.1
>>>> 122M ./2.1.2
>>>> 122M ./2.1.3
>>>> 130M ./2.2.0
>>>> 131M ./2.2.1
>>>> 132M ./2.2.2
>>>> 131M ./2.2.3
>>>> 141M ./2.3.0
>>>> 141M ./2.3.1
>>>> 141M ./2.3.2
>>>> 142M ./2.3.3
>>>> 142M ./2.3.4
>>>> 145M ./2.4.0
>>>> 146M ./2.4.1
>>>> 145M ./2.4.2
>>>> 144M ./2.4.3
>>>> 145M ./2.4.4
>>>> 143M ./2.4.5
>>>> 143M ./2.4.6
>>>> 143M ./2.4.7
>>>> 143M ./2.4.8
>>>> 197M ./3.0.0
>>>> 185M ./3.0.0-preview
>>>> 197M ./3.0.0-preview2
>>>> 198M ./3.0.1
>>>> 198M ./3.0.2
>>>> 205M ./3.0.3
>>>> 239M ./3.1.1
>>>> 239M ./3.1.2
>>>> 239M ./3.1.3
>>>> 840M ./3.2.0
>>>> 842M ./3.2.1
>>>> 282M ./3.2.2
>>>> 244M ./3.2.3
>>>> 282M ./3.2.4
>>>> 295M ./3.3.0
>>>> 297M ./3.3.1
>>>> 297M ./3.3.2
>>>> 297M ./3.3.3
>>>> 297M ./3.3.4
>>>> 314M ./3.4.0
>>>> 314M ./3.4.1
>>>> 328M ./3.4.2
>>>> 324M ./3.4.3
>>>> 1.1G ./3.5.0
>>>> 1.2G ./3.5.1
>>>> 1.1G ./4.0.0-preview1
>>>>
>>>> I'm concerned about publishing the documentation for version 3.5.2
>>>> to the asf-site. So, I have merged PR[2] to eliminate this potential
>>>> blocker.
>>>>
>>>> Considering that the problem still exists, should we temporarily archive
>>>> some of the outdated version documents? For example, only keep
>>>> the latest version for each feature release in the asf-site branch. Or,
>>>> Do you have any other suggestions?
>>>>
>>>>
>>>> Bests,
>>>> Kent Yao
>>>>
>>>>
>>>> [1]
>>>> https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners/about-github-hosted-runners#standard-github-hosted-runners-for-public-repositories
>>>> [2] https://github.com/apache/spark-website/pull/543
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>
>>>>

Reply via email to