Re: Spark website repo size hits the storage limit of GitHub-hosted runners

2024-08-08 Thread Kent Yao
Hi Ruifeng, We have already ported those scripts Ruifeng Zheng 于2024年8月9日周五 12:41写道: > Hi Kent, I remember that we have some scripts to free disk in spark repo, > maybe we can reuse them for spark-website. > > On Fri, Aug 9, 2024 at 9:57 AM Sean Owen wrote: > >> I don't think that's the issue

Re: Spark website repo size hits the storage limit of GitHub-hosted runners

2024-08-08 Thread Ruifeng Zheng
Hi Kent, I remember that we have some scripts to free disk in spark repo, maybe we can reuse them for spark-website. On Fri, Aug 9, 2024 at 9:57 AM Sean Owen wrote: > I don't think that's the issue - it's the size of what is cloned into a > container during the GitHub actions runs. Doesnt matter

Re: Spark website repo size hits the storage limit of GitHub-hosted runners

2024-08-08 Thread Sean Owen
I don't think that's the issue - it's the size of what is cloned into a container during the GitHub actions runs. Doesnt matter how it is stored. They are not large files, either. On Thu, Aug 8, 2024, 4:34 PM Mich Talebzadeh wrote: > > Maybe you should look into deploying GitHub Large File Stor

Re: Spark website repo size hits the storage limit of GitHub-hosted runners

2024-08-08 Thread Mich Talebzadeh
Maybe you should look into deploying GitHub Large File Storage (LFS). If applicable, store large documentation files in LFS to reduce the repository size. HTH Mich Talebzadeh, Ar

Re: Spark website repo size hits the storage limit of GitHub-hosted runners

2024-08-08 Thread Dongjoon Hyun
Ya, I agree that we need to investigate what happened at PySpark 3.5+ docs. For old Spark docs, it seems to be negligible. - All Spark 0.x docs: 231M - All Spark 1.x docs: 1.3G - All Spark 2.x docs: 3.4G For example, the total size of above all old Spark docs is less than the following 4 releas

Re: Spark website repo size hits the storage limit of GitHub-hosted runners

2024-08-08 Thread Sean Owen
That seems a ltle bit too much to me. I could see people still on a recent version that just want to see docs or compare/contrast docs for changes. Removing the versions that seem to have ~0 traffic would remove, it seems, like 80% of the .html files (and replace them with a compressed archive

Re: Spark website repo size hits the storage limit of GitHub-hosted runners

2024-08-08 Thread Sean Owen
Whoa! Is there any clear reason why 3.5 docs are so big? 1GB of docs / 10x jump seems crazy. Maybe we need to investigate and fix that also. I take it that the problem is the size of the repo once it's cloned into the docker container. Removing the .html files helps that, but, then we don't have .

Re: Spark website repo size hits the storage limit of GitHub-hosted runners

2024-08-08 Thread Dongjoon Hyun
The culprit seems to be PySpark 3.5 documentation which grows 11x times at 3.5+ $ du -h 3.4.3/api/python | tail -n1 84M 3.4.3/api/python $ du -h 3.5.1/api/python | tail -n1 943M 3.5.1/api/python Since we will generate big documents for 3.5.x, 4.0.0-preview, 4.0.x, 4.1.x, the proposed tarball id

Re: Spark website repo size hits the storage limit of GitHub-hosted runners

2024-08-08 Thread Wenchen Fan
It makes sense to me to only keep the doc files for the latest maintenance release. i.e. remove the docs for 3.5.0 and only keep 3.5.1. On Thu, Aug 8, 2024 at 8:06 PM Kent Yao wrote: > Hi dev, > > The current size of the spark-website repository is approximately 16GB, > exceeding the storage lim

Re: Spark website repo size hits the storage limit of GitHub-hosted runners

2024-08-08 Thread Nicholas Chammas
How big of a change would it be to have the repo only contain the Markdown source and not the rendered HTML (which should perhaps be moved to an object store)? > On Aug 8, 2024, at 8:06 AM, Kent Yao wrote: > > Hi dev, > > The current size of the spark-website repository is approximately 16GB

Re: Spark website repo size hits the storage limit of GitHub-hosted runners

2024-08-08 Thread Sean Owen
I agree with 'archiving', but what does that mean? delete from the repo and site? While I really doubt people are looking for docs for, say, 0.5.0, it'd be a big jump to totally remove it. What if we made a compressed tarball of old docs and put that in the repo, linked to it, and removed the docs