Re: [DISCUSS] Spark - How to improve our release processes

Nimrod Ofek Thu, 06 Feb 2025 06:31:03 -0800

Hi,

I'll start with a disclaimer: I am mostly a Java / Scala developer so I am
not that well oriented with Python best practices.
Having said that, here are some thoughts I have about the subject, hope
they make sense :)


   1. I think that we need to differentiate between code and dependencies
   for testing purposes, code and dependencies for internal use (tools, build
   etc.), and actual code that is for the users - like pySpark itself. The
   dependencies of those should be differentiated since tests and tools should
   not pose a problem for the end users on what packages they are using and of
   what versions.
   2. As a follow up to the previous note, the actual Python code that runs
   on the driver (or connect server, depends on the deployment) - have a big
   impact on the users who use Python - and since shading is not a practice in
   Python (unlike in JVM languages) - we should strive to use as minimal
   dependencies as possible - so we won't impose restrictions on our users.
   3. We should evaluate a way to avoid conflicts between test and
   production dependencies.
   For instance - test dependencies can be "calculated" from a list of
   regular dependencies + test only dependencies- to make sure we are testing
   what we are actually shipping...
   That means that we should probably have some script to delete
   requirements.txt and create it from files as needed (for instance
   generalRequirements.txt, testRequirements.txt or something like that).
   4. All python dependencies should not be installed locally - one
   approach I know can be used with PyCharm is using an interpreter using
   Docker - to make sure no local changes are impacting which run is
   successful and which one fails due to local packages installed... (e.g -
   
https://www.jetbrains.com/help/pycharm/using-docker-as-a-remote-interpreter.html
   )
   5. Docker build should be in one central location for all purposes -
   with layers over that Dockerfile for other Dockerfiles.
   This means that the test versions - should be based on the regular
   Docker image, python requirements should not be written directly within
   many different Dockerfiles - but in one - or at least in a requirement.txt
   file used by all of them, etc.
   6. Same scripts should be used locally - and within Github actions -
   build pipelines - because only that will make sure that what we test and
   run locally and what we publish and let others build will have the same
   exact quality and will be constant.
   7. Consider using Multi-Release JAR Files ( https://openjdk.org/jeps/238
   ) - a feature that was added in Java 9 - to "Extend the JAR file format to
   allow multiple, Java-release-specific versions of class files to coexist in
   a single archive".
   In short - it lets you have multiple implementations of the same class
   with several different Java versions.
   This feature is meant to help libraries adopt new language features more
   easily while keeping backward compatibility - so if there is a new
   implementation or feature that can help improve performance that we avoid
   from using because we still support older version of Java - this help
   mitigate it by letting us implement for users with newer JDK versions using
   the new API - and not using it with older Java versions.
   8. Of course another option is to just support the latest LTS JDK
   version on each release. In general I know there are those who are afraid
   of this option, but since Spark applications are usually self contained and
   not used as a library within some other Java project, I think that is also
   a viable option that will let us use newer features as they will be
   available and fit - for instance Virtual Threads (which can enable us
   running more threads per machine - and in I/O intensive and network
   operations can provide better parallelism), *Vector API
   <https://openjdk.org/jeps/489>* - that can boost performance in a
   similar way to what Databricks' Photon and Velox lib does- just directly
   within Java and not using C++, Ahead-of-Time Class Loading & Linking
   <https://openjdk.org/jeps/483> - for faster startup times, Value Objects
   <https://openjdk.org/jeps/8277163>, FFM <https://openjdk.org/jeps/454>
   instead of JNI and many more.

Is there a document that shows what the current release managers do to
actually build and release a version? Step by step?

Thanks,
Nimrod


On Tue, Feb 4, 2025 at 6:31 PM Nicholas Chammas <nicholas.cham...@gmail.com>
wrote:

> I still believe that the way to solve this is by splitting our Python
> build requirements into two:
>
> 1. *Abstract dependencies*: These capture the most open/flexible set of
> dependencies for the project. They are posted to PyPI.
> 2. *Concrete build dependencies*: These are derived automatically from
> the abstract dependencies. The dependencies and transitive dependencies are
> fully enumerated and pinned to specific versions. We use and reference a
> single set of concrete build dependencies across GitHub Actions, Docker,
> and local test environments.
>
> All modern Python packaging approaches follow this pattern. The abstract
> dependencies go in your pyproject.toml and the concrete dependencies go in
> a lock file.
>
> Adopting modern Python packaging tooling (like uv, Poetry, or Hatch) might
> be too big of a change for us right now, which is why when I last tried
> to do this <https://github.com/apache/spark/pull/27928> I used pip-tools
> <https://github.com/jazzband/pip-tools>, which lets us stick to plain pip
> but adopt this modern pattern.
>
> I’m willing to take another stab at this, but I believe it needs buy-in
> from Hyukjin, who was opposed to the idea last we discussed it.
>
> > My understanding is that, in the PySpark CI we do not use fixed Python
> library versions as we want to test with the latest library versions as
> soon as possible.
>
> This is my understanding too, but I believe testing against unpinned
> dependencies causes us so much wasted time as we play whack-a-mole with
> build problems. And every problem eventually gets solved by pinning a
> dependency, but because we are not pinning them in a consistent or
> automated way, we end up with a single library being specified and pinned
> to different versions across 10+ files
> <https://lists.apache.org/thread/hrs8kw31163v7tydjwm9cx5yktpvdjnj>.
>
> I don’t think whatever benefit we are getting from this approach outweighs
> this cost in complexity and management overhead.
>
> Nick
>
>
> On Feb 4, 2025, at 10:30 AM, Wenchen Fan <cloud0...@gmail.com> wrote:
>
> + @Hyukjin Kwon <gurwls...@gmail.com>
>
> My understanding is that, in the PySpark CI we do not use fixed Python
> library versions as we want to test with the latest library versions as
> soon as possible. However, the release scripts use fixed Python library
> versions to make sure it's stable. This means that for almost every major
> release we need to update the release scripts to sync the Python library
> versions with the CI, as the PySpark code or doc generation code may not be
> compatible with the old versions after 6 months.
>
> It would be better if we automate this process, but I don't have a good
> idea now.
>
> On Tue, Feb 4, 2025 at 6:32 PM Nimrod Ofek <ofek.nim...@gmail.com> wrote:
>
>> Hi all,
>>
>> I am trying to revive this thread - to work towards a better release
>> process, and making sure we have no conflicts in the used artifacts like
>> nicholas.cham...@gmail.com mentioned.
>> @Wenchen Fan <cloud0...@gmail.com> - can you please clarify - you state
>> that the release scripts are using a different build and Docker than Github
>> Actions.
>> The release scripts are releasing the artifacts that are actually being
>> used... What are the other ones which are created by Github Actios today
>> used for? Only testing?
>>
>> Me personally - I believe that "release is king" - meaning what actually
>> is being used by all the users is the "correct" build and we should align
>> ourselves to it.
>>
>> What do you think are the needed next steps for us to take in order to
>> make the release process fully automated and simple?
>>
>> Thanks,
>> Nimrod
>>
>>
>> On Mon, May 13, 2024 at 2:31 PM Wenchen Fan <cloud0...@gmail.com> wrote:
>>
>>> Hi Nicholas,
>>>
>>> Thanks for your help! I'm definitely interested in participating in this
>>> unification work. Let me know how I can help.
>>>
>>> Wenchen
>>>
>>> On Mon, May 13, 2024 at 1:41 PM Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>> Re: unification
>>>>
>>>> We also have a long-standing problem with how we manage Python
>>>> dependencies, something I’ve tried (unsuccessfully
>>>> <https://github.com/apache/spark/pull/27928>) to fix in the past.
>>>>
>>>> Consider, for example, how many separate places this numpy dependency
>>>> is installed:
>>>>
>>>> 1.
>>>> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L277
>>>> 2.
>>>> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L733
>>>> 3.
>>>> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L853
>>>> 4.
>>>> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L871
>>>> 5.
>>>> https://github.com/apache/spark/blob/8094535973f19e9f0543535a97254e8ebffc1b23/.github/workflows/build_python_connect35.yml#L70
>>>> 6.
>>>> https://github.com/apache/spark/blob/553e1b85c42a60c082d33f7b9df53b0495893286/.github/workflows/maven_test.yml#L181
>>>> 7.
>>>> https://github.com/apache/spark/blob/6e5d1db9058de62a45f35d3f41e028a72f688b70/dev/requirements.txt#L5
>>>> 8.
>>>> https://github.com/apache/spark/blob/678aeb7ef7086bd962df7ac6d1c5f39151a0515b/dev/run-pip-tests#L90
>>>> 9.
>>>> https://github.com/apache/spark/blob/678aeb7ef7086bd962df7ac6d1c5f39151a0515b/dev/run-pip-tests#L99
>>>> 10.
>>>> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/dev/create-release/spark-rm/Dockerfile#L40
>>>> 11.
>>>> https://github.com/apache/spark/blob/9a42610d5ad8ae0ded92fb68c7617861cfe975e1/dev/infra/Dockerfile#L89
>>>> 12.
>>>> https://github.com/apache/spark/blob/9a42610d5ad8ae0ded92fb68c7617861cfe975e1/dev/infra/Dockerfile#L92
>>>>
>>>> None of those installations reference a unified version requirement, so
>>>> naturally they are inconsistent across all these different lines. Some say
>>>> `>=1.21`, others say `>=1.20.0`, and still others say `==1.20.3`. In
>>>> several cases there is no version requirement specified at all.
>>>>
>>>> I’m interested in trying again to fix this problem, but it needs to be
>>>> in collaboration with a committer since I cannot fully test the release
>>>> scripts. (This testing gap is what doomed my last attempt at fixing this
>>>> problem.)
>>>>
>>>> Nick
>>>>
>>>>
>>>> On May 13, 2024, at 12:18 AM, Wenchen Fan <cloud0...@gmail.com> wrote:
>>>>
>>>> After finishing the 4.0.0-preview1 RC1, I have more experience with
>>>> this topic now.
>>>>
>>>> In fact, the main job of the release process: building packages and
>>>> documents, is tested in Github Action jobs. However, the way we test them
>>>> is different from what we do in the release scripts.
>>>>
>>>> 1. the execution environment is different:
>>>> The release scripts define the execution environment with this
>>>> Dockerfile:
>>>> https://github.com/apache/spark/blob/master/dev/create-release/spark-rm/Dockerfile
>>>> However, Github Action jobs use a different Dockerfile:
>>>> https://github.com/apache/spark/blob/master/dev/infra/Dockerfile
>>>> We should figure out a way to unify it. The docker image for the
>>>> release process needs to set up more things so it may not be viable to use
>>>> a single Dockerfile for both.
>>>>
>>>> 2. the execution code is different. Use building documents as an
>>>> example:
>>>> The release scripts:
>>>> https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh#L404-L411
>>>> The Github Action job:
>>>> https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L883-L895
>>>> I don't know which one is more correct, but we should definitely unify
>>>> them.
>>>>
>>>> It's better if we can run the release scripts as Github Action jobs,
>>>> but I think it's more important to do the unification now.
>>>>
>>>> Thanks,
>>>> Wenchen
>>>>
>>>>
>>>> On Fri, May 10, 2024 at 12:34 AM Hussein Awala <huss...@awala.fr>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I can answer some of your common questions with other Apache projects.
>>>>>
>>>>> > Who currently has permissions for Github actions? Is there a
>>>>> specific owner for that today or a different volunteer each time?
>>>>>
>>>>> The Apache organization owns Github Actions, and committers
>>>>> (contributors with write permissions) can retrigger/cancel a Github 
>>>>> Actions
>>>>> workflow, but Github Actions runners are managed by the Apache infra team.
>>>>>
>>>>> > What are the current limits of GitHub Actions, who set them - and
>>>>> what is the process to change those (if possible at all, but I presume not
>>>>> all Apache projects have the same limits)?
>>>>>
>>>>> For limits, I don't think there is any significant limit, especially
>>>>> since the Apache organization has 900 donated runners used by its 
>>>>> projects,
>>>>> and there is an initiative from the Infra team to add self-hosted runners
>>>>> running on Kubernetes (document
>>>>> <https://cwiki.apache.org/confluence/display/INFRA/ASF+Infra+provided+self-hosted+runners>
>>>>> ).
>>>>>
>>>>> > Where should the artifacts be stored?
>>>>>
>>>>> Usually, we use Maven for jars, DockerHub for Docker images, and
>>>>> Github cache for workflow cache. But we can use Github artifacts to store
>>>>> any kind of package (even Docker images in the ghcr), which is fully
>>>>> accepted by Apache policies. Also if the project has a cloud account (AWS,
>>>>> GCP, Azure, ...), a bucket can be used to store some of the packages.
>>>>>
>>>>>
>>>>>  > Who should be permitted to sign a version - and what is the process
>>>>> for that?
>>>>>
>>>>> The Apache documentation is clear about this, by default only PMC
>>>>> members can be release managers, but we can contact the infra team to add
>>>>> one of the committers as a release manager (document
>>>>> <https://infra.apache.org/release-publishing.html#releasemanager>).
>>>>> The process of creating a new version is described in this document
>>>>> <https://www.apache.org/legal/release-policy.html#policy>.
>>>>>
>>>>>
>>>>> On Thu, May 9, 2024 at 10:45 AM Nimrod Ofek <ofek.nim...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Following the conversation started with Spark 4.0.0 release, this is
>>>>>> a thread to discuss improvements to our release processes.
>>>>>>
>>>>>> I'll Start by raising some questions that probably should have
>>>>>> answers to start the discussion:
>>>>>>
>>>>>>
>>>>>>    1. What is currently running in GitHub Actions?
>>>>>>    2. Who currently has permissions for Github actions? Is there a
>>>>>>    specific owner for that today or a different volunteer each time?
>>>>>>    3. What are the current limits of GitHub Actions, who set them -
>>>>>>    and what is the process to change those (if possible at all, but I 
>>>>>> presume
>>>>>>    not all Apache projects have the same limits)?
>>>>>>    4. What versions should we support as an output for the build?
>>>>>>    5. Where should the artifacts be stored?
>>>>>>    6. What should be the output? only tar or also a docker image
>>>>>>    published somewhere?
>>>>>>    7. Do we want to have a release on fixed dates or a manual
>>>>>>    release upon request?
>>>>>>    8. Who should be permitted to sign a version - and what is the
>>>>>>    process for that?
>>>>>>
>>>>>>
>>>>>> Thanks!
>>>>>> Nimrod
>>>>>>
>>>>>
>>>>
>

Re: [DISCUSS] Spark - How to improve our release processes

Reply via email to