Re: Toward an "API" for spark images used by the Kubernetes back-end

Rob Vesse Thu, 22 Mar 2018 03:11:00 -0700

The difficulty with a custom Spark config is that you need to be careful that 
the Spark config the user provides does not conflict with the auto-generated 
portions of the Spark config necessary to make Spark on K8S work.  So part of 
any “API” definition might need to be what Spark config is considered “managed” 
by the Kubernetes scheduler backend.


 

For more controlled environments - i.e. security conscious - allowing end users 
to provide custom images may be a non-starter so the more we can do at the 
“API” level without customising the containers the better.  A practical example 
of this is managing Python dependencies, one option we’re considering is having 
a base image with Anaconda included and then simply projecting a Conda 
environment spec into the containers (via volume mounts) and then having the 
container recreate that Conda environment on startup.  That won’t work for all 
possible environments e.g. those that use non-standard Conda channels but it 
would provide a lot of capability without customising the images.

 

Rob

 

From: Felix Cheung <[email protected]>
Date: Thursday, 22 March 2018 at 06:21
To: Holden Karau <[email protected]>, Erik Erlandson <[email protected]>
Cc: dev <[email protected]>
Subject: Re: Toward an "API" for spark images used by the Kubernetes back-end

 

I like being able to customize the docker image itself - but I realize this 
thread is more about “API” for the stock image.

 

Environment is nice. Probably we need a way to set custom spark config (as a 
file??)

 

 

From: Holden Karau <[email protected]>
Sent: Wednesday, March 21, 2018 10:44:20 PM
To: Erik Erlandson
Cc: dev
Subject: Re: Toward an "API" for spark images used by the Kubernetes back-end 

 

I’m glad this discussion is happening on dev@ :)

 

Personally I like customizing with shell env variables during rolling my own 
image, but definitely documentation the expectations/usage of the variables is 
needed before we can really call it an API.

 

On the related question I suspect two of the more “common” likely 
customizations is adding additional jars for bootstrapping fetching from a DFS 
& also similarity complicated Python dependencies (although given the Pythons 
support isn’t merged yet it’s hard to say what exactly this would look like).

 

I could also see some vendors wanting to add some bootstrap/setup scripts to 
fetch keys or other things.

 

What other ways do folks foresee customizing their Spark docker containers? 

 

On Wed, Mar 21, 2018 at 5:04 PM Erik Erlandson <[email protected]> wrote:

During the review of the recent PR to remove use of the init_container from 
kube pods as created by the Kubernetes back-end, the topic of documenting the 
"API" for these container images also came up. What information does the 
back-end provide to these containers? In what form? What assumptions does the 
back-end make about the structure of these containers?  This information is 
important in a scenario where a user wants to create custom images, 
particularly if these are not based on the reference dockerfiles.

 

A related topic is deciding what such an API should look like.  For example, 
early incarnations were based more purely on environment variables, which could 
have advantages in terms of an API that is easy to describe in a document.  If 
we document the current API, should we annotate it as Experimental?  If not, 
does that effectively freeze the API?

 

We are interested in community input about possible customization use cases and 
opinions on possible API designs!

Cheers,

Erik

-- 

Twitter: https://twitter.com/holdenkarau

Re: Toward an "API" for spark images used by the Kubernetes back-end

Reply via email to