In this context, I think the docker images are similar to the binaries rather than an extension. It's packaging the compiled distribution to save people the effort of building one themselves, akin to binaries or the python package.
For reference, this is the base dockerfile <https://github.com/apache-spark-on-k8s/spark/tree/branch-2.2-kubernetes/resource-managers/kubernetes/docker-minimal-bundle/src/main/docker/spark-base> for the main image that we intend to publish. It's not particularly complicated. The driver <https://github.com/apache-spark-on-k8s/spark/blob/branch-2.2-kubernetes/resource-managers/kubernetes/docker-minimal-bundle/src/main/docker/driver/Dockerfile> and executor <https://github.com/apache-spark-on-k8s/spark/blob/branch-2.2-kubernetes/resource-managers/kubernetes/docker-minimal-bundle/src/main/docker/executor/Dockerfile> images are based on said base image and only customize the CMD (any file/directory inclusions are extraneous and will be removed). Is there only one way to build it? That's a bit harder to reason about. The base image I'd argue is likely going to always be built that way. The driver and executor images, there may be cases where people want to customize it - (like putting all dependencies into it for example). In those cases, as long as our images are bare bones, they can use the spark-driver/spark-executor images we publish as the base, and build their customization as a layer on top of it. I think the composability of docker images, makes this a bit different from say - debian packages. We can publish canonical images that serve as both - a complete image for most Spark applications, as well as a stable substrate to build customization upon. On Wed, Nov 29, 2017 at 7:38 AM, Mark Hamstra <m...@clearstorydata.com> wrote: > It's probably also worth considering whether there is only one, > well-defined, correct way to create such an image or whether this is a > reasonable avenue for customization. Part of why we don't do something like > maintain and publish canonical Debian packages for Spark is because > different organizations doing packaging and distribution of infrastructures > or operating systems can reasonably want to do this in a custom (or > non-customary) way. If there is really only one reasonable way to do a > docker image, then my bias starts to tend more toward the Spark PMC taking > on the responsibility to maintain and publish that image. If there is more > than one way to do it and publishing a particular image is more just a > convenience, then my bias tends more away from maintaining and publish it. > > On Wed, Nov 29, 2017 at 5:14 AM, Sean Owen <so...@cloudera.com> wrote: > >> Source code is the primary release; compiled binary releases are >> conveniences that are also released. A docker image sounds fairly different >> though. To the extent it's the standard delivery mechanism for some >> artifact (think: pyspark on PyPI as well) that makes sense, but is that the >> situation? if it's more of an extension or alternate presentation of Spark >> components, that typically wouldn't be part of a Spark release. The ones >> the PMC takes responsibility for maintaining ought to be the core, critical >> means of distribution alone. >> >> On Wed, Nov 29, 2017 at 2:52 AM Anirudh Ramanathan < >> ramanath...@google.com.invalid> wrote: >> >>> Hi all, >>> >>> We're all working towards the Kubernetes scheduler backend (full steam >>> ahead!) that's targeted towards Spark 2.3. One of the questions that comes >>> up often is docker images. >>> >>> While we're making available dockerfiles to allow people to create their >>> own docker images from source, ideally, we'd want to publish official >>> docker images as part of the release process. >>> >>> I understand that the ASF has procedure around this, and we would want >>> to get that started to help us get these artifacts published by 2.3. I'd >>> love to get a discussion around this started, and the thoughts of the >>> community regarding this. >>> >>> -- >>> Thanks, >>> Anirudh Ramanathan >>> >> > -- Anirudh Ramanathan