Your first several points align with what I explained for Python regarding abstract vs. concrete dependencies.
As I noted, the blocker for progress on reorganizing and cleaning up our Python dependencies in this way is committer alignment. > On Feb 6, 2025, at 9:30 AM, Nimrod Ofek <ofek.nim...@gmail.com> wrote: > > Hi, > > I'll start with a disclaimer: I am mostly a Java / Scala developer so I am > not that well oriented with Python best practices. > Having said that, here are some thoughts I have about the subject, hope they > make sense :) > I think that we need to differentiate between code and dependencies for > testing purposes, code and dependencies for internal use (tools, build etc.), > and actual code that is for the users - like pySpark itself. The dependencies > of those should be differentiated since tests and tools should not pose a > problem for the end users on what packages they are using and of what > versions. > As a follow up to the previous note, the actual Python code that runs on the > driver (or connect server, depends on the deployment) - have a big impact on > the users who use Python - and since shading is not a practice in Python > (unlike in JVM languages) - we should strive to use as minimal dependencies > as possible - so we won't impose restrictions on our users. > We should evaluate a way to avoid conflicts between test and production > dependencies. > For instance - test dependencies can be "calculated" from a list of regular > dependencies + test only dependencies- to make sure we are testing what we > are actually shipping... > That means that we should probably have some script to delete > requirements.txt and create it from files as needed (for instance > generalRequirements.txt, testRequirements.txt or something like that). > All python dependencies should not be installed locally - one approach I know > can be used with PyCharm is using an interpreter using Docker - to make sure > no local changes are impacting which run is successful and which one fails > due to local packages installed... (e.g - > https://www.jetbrains.com/help/pycharm/using-docker-as-a-remote-interpreter.html) > Docker build should be in one central location for all purposes - with layers > over that Dockerfile for other Dockerfiles. > This means that the test versions - should be based on the regular Docker > image, python requirements should not be written directly within many > different Dockerfiles - but in one - or at least in a requirement.txt file > used by all of them, etc. > Same scripts should be used locally - and within Github actions - build > pipelines - because only that will make sure that what we test and run > locally and what we publish and let others build will have the same exact > quality and will be constant. > Consider using Multi-Release JAR Files ( https://openjdk.org/jeps/238 ) - a > feature that was added in Java 9 - to "Extend the JAR file format to allow > multiple, Java-release-specific versions of class files to coexist in a > single archive". > In short - it lets you have multiple implementations of the same class with > several different Java versions. > This feature is meant to help libraries adopt new language features more > easily while keeping backward compatibility - so if there is a new > implementation or feature that can help improve performance that we avoid > from using because we still support older version of Java - this help > mitigate it by letting us implement for users with newer JDK versions using > the new API - and not using it with older Java versions. > Of course another option is to just support the latest LTS JDK version on > each release. In general I know there are those who are afraid of this > option, but since Spark applications are usually self contained and not used > as a library within some other Java project, I think that is also a viable > option that will let us use newer features as they will be available and fit > - for instance Virtual Threads (which can enable us running more threads per > machine - and in I/O intensive and network operations can provide better > parallelism), Vector API <https://openjdk.org/jeps/489> - that can boost > performance in a similar way to what Databricks' Photon and Velox lib does- > just directly within Java and not using C++, Ahead-of-Time Class Loading & > Linking <https://openjdk.org/jeps/483> - for faster startup times, Value > Objects <https://openjdk.org/jeps/8277163>, FFM > <https://openjdk.org/jeps/454> instead of JNI and many more. > Is there a document that shows what the current release managers do to > actually build and release a version? Step by step? > > Thanks, > Nimrod > > > On Tue, Feb 4, 2025 at 6:31 PM Nicholas Chammas <nicholas.cham...@gmail.com > <mailto:nicholas.cham...@gmail.com>> wrote: >> I still believe that the way to solve this is by splitting our Python build >> requirements into two: >> >> 1. Abstract dependencies: These capture the most open/flexible set of >> dependencies for the project. They are posted to PyPI. >> 2. Concrete build dependencies: These are derived automatically from the >> abstract dependencies. The dependencies and transitive dependencies are >> fully enumerated and pinned to specific versions. We use and reference a >> single set of concrete build dependencies across GitHub Actions, Docker, and >> local test environments. >> >> All modern Python packaging approaches follow this pattern. The abstract >> dependencies go in your pyproject.toml and the concrete dependencies go in a >> lock file. >> >> Adopting modern Python packaging tooling (like uv, Poetry, or Hatch) might >> be too big of a change for us right now, which is why when I last tried to >> do this <https://github.com/apache/spark/pull/27928> I used pip-tools >> <https://github.com/jazzband/pip-tools>, which lets us stick to plain pip >> but adopt this modern pattern. >> >> I’m willing to take another stab at this, but I believe it needs buy-in from >> Hyukjin, who was opposed to the idea last we discussed it. >> >> > My understanding is that, in the PySpark CI we do not use fixed Python >> > library versions as we want to test with the latest library versions as >> > soon as possible. >> >> This is my understanding too, but I believe testing against unpinned >> dependencies causes us so much wasted time as we play whack-a-mole with >> build problems. And every problem eventually gets solved by pinning a >> dependency, but because we are not pinning them in a consistent or automated >> way, we end up with a single library being specified and pinned to different >> versions across 10+ files >> <https://lists.apache.org/thread/hrs8kw31163v7tydjwm9cx5yktpvdjnj>. >> >> I don’t think whatever benefit we are getting from this approach outweighs >> this cost in complexity and management overhead. >> >> Nick >> >> >>> On Feb 4, 2025, at 10:30 AM, Wenchen Fan <cloud0...@gmail.com >>> <mailto:cloud0...@gmail.com>> wrote: >>> >>> + @Hyukjin Kwon <mailto:gurwls...@gmail.com> >>> >>> My understanding is that, in the PySpark CI we do not use fixed Python >>> library versions as we want to test with the latest library versions as >>> soon as possible. However, the release scripts use fixed Python library >>> versions to make sure it's stable. This means that for almost every major >>> release we need to update the release scripts to sync the Python library >>> versions with the CI, as the PySpark code or doc generation code may not be >>> compatible with the old versions after 6 months. >>> >>> It would be better if we automate this process, but I don't have a good >>> idea now. >>> >>> On Tue, Feb 4, 2025 at 6:32 PM Nimrod Ofek <ofek.nim...@gmail.com >>> <mailto:ofek.nim...@gmail.com>> wrote: >>>> Hi all, >>>> >>>> I am trying to revive this thread - to work towards a better release >>>> process, and making sure we have no conflicts in the used artifacts like >>>> nicholas.cham...@gmail.com <mailto:nicholas.cham...@gmail.com> mentioned. >>>> @Wenchen Fan <mailto:cloud0...@gmail.com> - can you please clarify - you >>>> state that the release scripts are using a different build and Docker than >>>> Github Actions. >>>> The release scripts are releasing the artifacts that are actually being >>>> used... What are the other ones which are created by Github Actios today >>>> used for? Only testing? >>>> >>>> Me personally - I believe that "release is king" - meaning what actually >>>> is being used by all the users is the "correct" build and we should align >>>> ourselves to it. >>>> >>>> What do you think are the needed next steps for us to take in order to >>>> make the release process fully automated and simple? >>>> >>>> Thanks, >>>> Nimrod >>>> >>>> >>>> On Mon, May 13, 2024 at 2:31 PM Wenchen Fan <cloud0...@gmail.com >>>> <mailto:cloud0...@gmail.com>> wrote: >>>>> Hi Nicholas, >>>>> >>>>> Thanks for your help! I'm definitely interested in participating in this >>>>> unification work. Let me know how I can help. >>>>> >>>>> Wenchen >>>>> >>>>> On Mon, May 13, 2024 at 1:41 PM Nicholas Chammas >>>>> <nicholas.cham...@gmail.com <mailto:nicholas.cham...@gmail.com>> wrote: >>>>>> Re: unification >>>>>> >>>>>> We also have a long-standing problem with how we manage Python >>>>>> dependencies, something I’ve tried (unsuccessfully >>>>>> <https://github.com/apache/spark/pull/27928>) to fix in the past. >>>>>> >>>>>> Consider, for example, how many separate places this numpy dependency is >>>>>> installed: >>>>>> >>>>>> 1. >>>>>> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L277 >>>>>> 2. >>>>>> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L733 >>>>>> 3. >>>>>> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L853 >>>>>> 4. >>>>>> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L871 >>>>>> 5. >>>>>> https://github.com/apache/spark/blob/8094535973f19e9f0543535a97254e8ebffc1b23/.github/workflows/build_python_connect35.yml#L70 >>>>>> 6. >>>>>> https://github.com/apache/spark/blob/553e1b85c42a60c082d33f7b9df53b0495893286/.github/workflows/maven_test.yml#L181 >>>>>> 7. >>>>>> https://github.com/apache/spark/blob/6e5d1db9058de62a45f35d3f41e028a72f688b70/dev/requirements.txt#L5 >>>>>> 8. >>>>>> https://github.com/apache/spark/blob/678aeb7ef7086bd962df7ac6d1c5f39151a0515b/dev/run-pip-tests#L90 >>>>>> 9. >>>>>> https://github.com/apache/spark/blob/678aeb7ef7086bd962df7ac6d1c5f39151a0515b/dev/run-pip-tests#L99 >>>>>> 10. >>>>>> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/dev/create-release/spark-rm/Dockerfile#L40 >>>>>> 11. >>>>>> https://github.com/apache/spark/blob/9a42610d5ad8ae0ded92fb68c7617861cfe975e1/dev/infra/Dockerfile#L89 >>>>>> 12. >>>>>> https://github.com/apache/spark/blob/9a42610d5ad8ae0ded92fb68c7617861cfe975e1/dev/infra/Dockerfile#L92 >>>>>> >>>>>> None of those installations reference a unified version requirement, so >>>>>> naturally they are inconsistent across all these different lines. Some >>>>>> say `>=1.21`, others say `>=1.20.0`, and still others say `==1.20.3`. In >>>>>> several cases there is no version requirement specified at all. >>>>>> >>>>>> I’m interested in trying again to fix this problem, but it needs to be >>>>>> in collaboration with a committer since I cannot fully test the release >>>>>> scripts. (This testing gap is what doomed my last attempt at fixing this >>>>>> problem.) >>>>>> >>>>>> Nick >>>>>> >>>>>> >>>>>>> On May 13, 2024, at 12:18 AM, Wenchen Fan <cloud0...@gmail.com >>>>>>> <mailto:cloud0...@gmail.com>> wrote: >>>>>>> >>>>>>> After finishing the 4.0.0-preview1 RC1, I have more experience with >>>>>>> this topic now. >>>>>>> >>>>>>> In fact, the main job of the release process: building packages and >>>>>>> documents, is tested in Github Action jobs. However, the way we test >>>>>>> them is different from what we do in the release scripts. >>>>>>> >>>>>>> 1. the execution environment is different: >>>>>>> The release scripts define the execution environment with this >>>>>>> Dockerfile: >>>>>>> https://github.com/apache/spark/blob/master/dev/create-release/spark-rm/Dockerfile >>>>>>> However, Github Action jobs use a different Dockerfile: >>>>>>> https://github.com/apache/spark/blob/master/dev/infra/Dockerfile >>>>>>> We should figure out a way to unify it. The docker image for the >>>>>>> release process needs to set up more things so it may not be viable to >>>>>>> use a single Dockerfile for both. >>>>>>> >>>>>>> 2. the execution code is different. Use building documents as an >>>>>>> example: >>>>>>> The release scripts: >>>>>>> https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh#L404-L411 >>>>>>> The Github Action job: >>>>>>> https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L883-L895 >>>>>>> I don't know which one is more correct, but we should definitely unify >>>>>>> them. >>>>>>> >>>>>>> It's better if we can run the release scripts as Github Action jobs, >>>>>>> but I think it's more important to do the unification now. >>>>>>> >>>>>>> Thanks, >>>>>>> Wenchen >>>>>>> >>>>>>> >>>>>>> On Fri, May 10, 2024 at 12:34 AM Hussein Awala <huss...@awala.fr >>>>>>> <mailto:huss...@awala.fr>> wrote: >>>>>>>> Hello, >>>>>>>> >>>>>>>> I can answer some of your common questions with other Apache projects. >>>>>>>> >>>>>>>> > Who currently has permissions for Github actions? Is there a >>>>>>>> > specific owner for that today or a different volunteer each time? >>>>>>>> >>>>>>>> The Apache organization owns Github Actions, and committers >>>>>>>> (contributors with write permissions) can retrigger/cancel a Github >>>>>>>> Actions workflow, but Github Actions runners are managed by the Apache >>>>>>>> infra team. >>>>>>>> >>>>>>>> > What are the current limits of GitHub Actions, who set them - and >>>>>>>> > what is the process to change those (if possible at all, but I >>>>>>>> > presume not all Apache projects have the same limits)? >>>>>>>> >>>>>>>> For limits, I don't think there is any significant limit, especially >>>>>>>> since the Apache organization has 900 donated runners used by its >>>>>>>> projects, and there is an initiative from the Infra team to add >>>>>>>> self-hosted runners running on Kubernetes (document >>>>>>>> <https://cwiki.apache.org/confluence/display/INFRA/ASF+Infra+provided+self-hosted+runners>). >>>>>>>> >>>>>>>> > Where should the artifacts be stored? >>>>>>>> >>>>>>>> Usually, we use Maven for jars, DockerHub for Docker images, and >>>>>>>> Github cache for workflow cache. But we can use Github artifacts to >>>>>>>> store any kind of package (even Docker images in the ghcr), which is >>>>>>>> fully accepted by Apache policies. Also if the project has a cloud >>>>>>>> account (AWS, GCP, Azure, ...), a bucket can be used to store some of >>>>>>>> the packages. >>>>>>>> >>>>>>>> >>>>>>>> > Who should be permitted to sign a version - and what is the process >>>>>>>> for that? >>>>>>>> >>>>>>>> The Apache documentation is clear about this, by default only PMC >>>>>>>> members can be release managers, but we can contact the infra team to >>>>>>>> add one of the committers as a release manager (document >>>>>>>> <https://infra.apache.org/release-publishing.html#releasemanager>). >>>>>>>> The process of creating a new version is described in this document >>>>>>>> <https://www.apache.org/legal/release-policy.html#policy>. >>>>>>>> >>>>>>>> >>>>>>>> On Thu, May 9, 2024 at 10:45 AM Nimrod Ofek <ofek.nim...@gmail.com >>>>>>>> <mailto:ofek.nim...@gmail.com>> wrote: >>>>>>>>> Following the conversation started with Spark 4.0.0 release, this is >>>>>>>>> a thread to discuss improvements to our release processes. >>>>>>>>> >>>>>>>>> I'll Start by raising some questions that probably should have >>>>>>>>> answers to start the discussion: >>>>>>>>> >>>>>>>>> What is currently running in GitHub Actions? >>>>>>>>> Who currently has permissions for Github actions? Is there a specific >>>>>>>>> owner for that today or a different volunteer each time? >>>>>>>>> What are the current limits of GitHub Actions, who set them - and >>>>>>>>> what is the process to change those (if possible at all, but I >>>>>>>>> presume not all Apache projects have the same limits)? >>>>>>>>> What versions should we support as an output for the build? >>>>>>>>> Where should the artifacts be stored? >>>>>>>>> What should be the output? only tar or also a docker image published >>>>>>>>> somewhere? >>>>>>>>> Do we want to have a release on fixed dates or a manual release upon >>>>>>>>> request? >>>>>>>>> Who should be permitted to sign a version - and what is the process >>>>>>>>> for that? >>>>>>>>> >>>>>>>>> Thanks! >>>>>>>>> Nimrod >>>>>> >>