Re: [SPARK-34738] issues w/k8s+minikube and PV tests

2021-04-15 Thread Rob Vesse
There’s at least one test (the persistent volumes one) that relies on some 
Minikube functionality because we run integration tests for our $dayjob Spark 
image builds using Docker for Desktop instead and that one test fails because 
it relies on some minikube specific functionality.  That test could be 
refactored because I think it’s just adding a minimal Ceph cluster to the K8S 
cluster which can be done to any K8S cluster in principal

 

Rob

 

From: shane knapp ☠ 
Date: Wednesday, 14 April 2021 at 18:56
To: Frank Luo 
Cc: dev , Brian K Shiratsuki 
Subject: Re: [SPARK-34738] issues w/k8s+minikube and PV tests

 

On Wed, Apr 14, 2021 at 10:32 AM Frank Luo  wrote:

Is there any hard dependency on minkube? (i.e, GPU setting), kind 
(https://kind.sigs.k8s.io/) is a stabler and simpler k8s cluster env on a 
single machine (only requires docker) , it been widely used by k8s projects 
testing.

 

there are no hard deps on minikube...  it installs happily and successfully 
runs every integration test except for persistent volumes.

 

i haven't tried kind yet, but my time is super limited on this and i'd rather 
not venture down another rabbit hole unless we absolutely have to.

 



Re: docker image distribution in Kubernetes cluster

2021-12-08 Thread Rob Vesse
So the point Khalid was trying to make is that there are legitimate reasons you 
might use different container images for the driver pod vs the executor pod.  
It has nothing to do with Docker versions.

 

Since the bulk of the actual work happens on the executor you may want 
additional libraries, tools or software in that image that your job code can 
call.  This same software may be entirely unnecessary on the driver allowing 
you to use a smaller image for that versus the executor image.

 

As a practical example for a ML use case you might want to have the optional 
Intel MKL or OpenBLAS dependencies which can significantly bloat the size of 
your container image (by hundreds of megabytes) and would only be needed by the 
executor pods.

 

Rob

 

From: Mich Talebzadeh 
Date: Wednesday, 8 December 2021 at 17:42
To: Khalid Mammadov 
Cc: "user @spark" , Spark dev list 
Subject: Re: docker image distribution in Kubernetes cluster

 

Thanks Khalid for your notes
 

I have not come across a use case where the docker version on the driver and 
executors need to be different.

 

My thinking is that spark.kubernetes.executor.container.image is the correct 
reference as in the Kubernetes where container is the correct terminology and 
also both driver and executors are spark specific.

 

cheers

 

 

   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction. 

 

 

 

On Wed, 8 Dec 2021 at 11:47, Khalid Mammadov  wrote:

Hi Mitch

 

IMO, it's done to provide most flexibility. So, some users can have 
limited/restricted version of the image or with some additional software that 
they use on the executors that is used during processing. 

 

So, in your case you only need to provide the first one since the other two 
configs will be copied from it

 

Regards

Khalid

 

On Wed, 8 Dec 2021, 10:41 Mich Talebzadeh,  wrote:

Just a correction that in Spark 3.2 documentation it states that 

 

Property NameDefaultMeaning
spark.kubernetes.container.image(none)Container image to use for the Spark 
application. This is usually of the form example.com/repo/spark:v1.0.0. This 
configuration is required and must be provided by the user, unless explicit 
images are provided for each different container type.2.3.0
spark.kubernetes.driver.container.image(value of 
spark.kubernetes.container.image)Custom container image to use for the 
driver.2.3.0
spark.kubernetes.executor.container.image(value of 
spark.kubernetes.container.image)Custom container image to use for executors.
So both driver and executor images are mapped to the container image. In my 
opinion, they are redundant and will potentially add confusion so they should 
be removed?

 

   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction. 

 

 

 

On Wed, 8 Dec 2021 at 10:15, Mich Talebzadeh  wrote:

Hi,

 

We have three conf parameters to distribute the docker image with spark-sumit 
in Kubernetes cluster.

 

These are

 

spark-submit --verbose \

  --conf spark.kubernetes.driver.docker.image=${IMAGEGCP} \

   --conf spark.kubernetes.executor.docker.image=${IMAGEGCP} \

   --conf spark.kubernetes.container.image=${IMAGEGCP} \

 

when the above is run, it shows

 

(spark.kubernetes.driver.docker.image,eu.gcr.io/axial-glow-224522/spark-py:3.1.1-scala_2.12-8-jre-slim-buster-addedpackages)

(spark.kubernetes.executor.docker.image,eu.gcr.io/axial-glow-224522/spark-py:3.1.1-scala_2.12-8-jre-slim-buster-addedpackages)

(spark.kubernetes.container.image,eu.gcr.io/axial-glow-224522/spark-py:3.1.1-scala_2.12-8-jre-slim-buster-addedpackages)

 

You notice that I am using the same docker image for driver, executor and 
container. In Spark 3.2 (actually in recent spark versions), I cannot see 
reference to driver or executor. Are these depreciated? It appears that Spark 
still accepts them?

 

Thanks


 

 

   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction. 

 h

 

 

 

 



Re: In Kubernetes Must specify the driver container image

2021-12-10 Thread Rob Vesse
Mich

 

I think you may just have a typo in your configuration.

 

These properties all have container in the name, e.g. 
spark.kubernetes.driver.container.image, BUT you seem to be replacing container 
with docker in your configuration files so Spark doesn’t recognise the property 
(i.e. you have spark.kubernetes.driver.docker.image which isn’t a valid 
property)

 

Hope this helps,

 

Rob

 

From: Mich Talebzadeh 
Date: Friday, 10 December 2021 at 08:57
To: Spark dev list 
Subject: In Kubernetes Must specify the driver container image

 


Hi,

 

In the following under Spark Kubernetes configuration , it states

 

spark.kubernetes.container.image, default None, meaning:Container image to use 
for the Spark application. This is usually of the form 
example.com/repo/spark:v1.0.0. This configuration is required and must be 
provided by the user, unless explicit images are provided for each different 
container type.

 

I interpret this as if you specify both the driver and executor container 
images, then you don't need to specify the container image itself. However, if 
both the driver and executor images are provided with NO container image, the 
job fails.

 

Spark config:

(spark.kubernetes.driver.docker.image,eu.gcr.io/axial-glow-224522/spark-py:3.1.1-scala_2.12-8-jre-slim-buster-container)

(spark.kubernetes.executor.docker.image,eu.gcr.io/axial-glow-224522/spark-py:3.1.1-scala_2.12-8-jre-slim-buster-container)

 

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties

21/12/10 08:24:03 INFO SparkKubernetesClientFactory: Auto-configuring K8S 
client using current context from users K8S config file

Exception in thread "main" org.apache.spark.SparkException: Must specify the 
driver container image

 

Sounds like that regardless you still have to specify the container image 
explicitly

 

HTH

 

   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction. 

 



JDK vs JRE in Docker Images

2019-04-17 Thread Rob Vesse
Folks

 

For those using the Kubernetes support and building custom images are you using 
a JDK or a JRE in the container images?

 

Using a JRE saves a reasonable chunk of image size (about 50MB with our 
preferred Linux distro) but I didn’t want to make this change if there was a 
reason to have a JDK available.  Certainly the official project integration 
tests run just fine with a JRE based image

 

Currently the projects official Docker files use openjdk:8-alpine as a base 
which includes a full JDK so didn’t know if that was intentional or just 
convenience?

 

Thanks,

 

Rob



FW: JDK vs JRE in Docker Images

2019-04-18 Thread Rob Vesse
Sean

Thanks for the pointers.

Janino specifically says it only requires a JRE - 
https://janino-compiler.github.io/janino/#requirements

As for scalac can't find a specific reference anywhere, appears to be 
self-contained AFAICT

Rob

On 17/04/2019, 18:56, "Sean Owen"  wrote:

I confess I don't know, but I don't think scalac or janino need javac
and related tools, and those are the only things that come to mind. If
the tests pass without a JDK, that's good evidence.

On Wed, Apr 17, 2019 at 8:49 AM Rob Vesse  wrote:
>
> Folks
>
>
>
> For those using the Kubernetes support and building custom images are 
you using a JDK or a JRE in the container images?
>
>
>
> Using a JRE saves a reasonable chunk of image size (about 50MB with 
our preferred Linux distro) but I didn’t want to make this change if there was 
a reason to have a JDK available.  Certainly the official project integration 
tests run just fine with a JRE based image
>
>
>
> Currently the projects official Docker files use openjdk:8-alpine as 
a base which includes a full JDK so didn’t know if that was intentional or just 
convenience?
>
>
>
> Thanks,
>
>
>
> Rob

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org












-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark build can't find javac

2019-04-30 Thread Rob Vesse
I have seen issues with some versions of the Scala Maven plugin auto-detecting 
the wrong JAVA_HOME when both a JRE and JDK are present on the system.  Setting 
JAVA_HOME explicitly to a JDK skips the plugins auto-detect logic and avoids 
the problem.

 

This may be related - https://github.com/davidB/scala-maven-plugin/pull/227 and 
https://github.com/davidB/scala-maven-plugin/issues/221 

Rob

 

From: Sean Owen 
Date: Tuesday, 30 April 2019 at 00:18
To: Shmuel Blitz 
Cc: dev 
Subject: Re: Spark build can't find javac

 

Your JAVA_HOME is pointing to a JRE rather than JDK installation. Or you've 
actually installed the JRE. Only the JDK has javac, etc.

 

On Mon, Apr 29, 2019 at 4:36 PM Shmuel Blitz  
wrote:

Hi,

 

Trying to build Spark on Manjaro with OpenJDK version 1.8.0_212, and I'm 
getting the following error:

 

Cannot run program "/usr/lib/jvm/java-8-openjdk/jre/bin/javac": error=2, No 
such file or directory

> which javac

/usr/bin/javac

 

only when I set JAVA_HOME as follows, do I get it to run.

> export JAVA_HOME=/usr/lib/jvm/default

 

 

Any idea what the issue is?

-- 

Shmuel Blitz 
Data Analysis Team Leader 
Email: shmuel.bl...@similarweb.com 
www.similarweb.com 
 



Re: Toward an "API" for spark images used by the Kubernetes back-end

2018-03-22 Thread Rob Vesse
The difficulty with a custom Spark config is that you need to be careful that 
the Spark config the user provides does not conflict with the auto-generated 
portions of the Spark config necessary to make Spark on K8S work.  So part of 
any “API” definition might need to be what Spark config is considered “managed” 
by the Kubernetes scheduler backend.

 

For more controlled environments - i.e. security conscious - allowing end users 
to provide custom images may be a non-starter so the more we can do at the 
“API” level without customising the containers the better.  A practical example 
of this is managing Python dependencies, one option we’re considering is having 
a base image with Anaconda included and then simply projecting a Conda 
environment spec into the containers (via volume mounts) and then having the 
container recreate that Conda environment on startup.  That won’t work for all 
possible environments e.g. those that use non-standard Conda channels but it 
would provide a lot of capability without customising the images.

 

Rob

 

From: Felix Cheung 
Date: Thursday, 22 March 2018 at 06:21
To: Holden Karau , Erik Erlandson 
Cc: dev 
Subject: Re: Toward an "API" for spark images used by the Kubernetes back-end

 

I like being able to customize the docker image itself - but I realize this 
thread is more about “API” for the stock image.

 

Environment is nice. Probably we need a way to set custom spark config (as a 
file??)

 

 

From: Holden Karau 
Sent: Wednesday, March 21, 2018 10:44:20 PM
To: Erik Erlandson
Cc: dev
Subject: Re: Toward an "API" for spark images used by the Kubernetes back-end 

 

I’m glad this discussion is happening on dev@ :)

 

Personally I like customizing with shell env variables during rolling my own 
image, but definitely documentation the expectations/usage of the variables is 
needed before we can really call it an API.

 

On the related question I suspect two of the more “common” likely 
customizations is adding additional jars for bootstrapping fetching from a DFS 
& also similarity complicated Python dependencies (although given the Pythons 
support isn’t merged yet it’s hard to say what exactly this would look like).

 

I could also see some vendors wanting to add some bootstrap/setup scripts to 
fetch keys or other things.

 

What other ways do folks foresee customizing their Spark docker containers? 

 

On Wed, Mar 21, 2018 at 5:04 PM Erik Erlandson  wrote:

During the review of the recent PR to remove use of the init_container from 
kube pods as created by the Kubernetes back-end, the topic of documenting the 
"API" for these container images also came up. What information does the 
back-end provide to these containers? In what form? What assumptions does the 
back-end make about the structure of these containers?  This information is 
important in a scenario where a user wants to create custom images, 
particularly if these are not based on the reference dockerfiles.

 

A related topic is deciding what such an API should look like.  For example, 
early incarnations were based more purely on environment variables, which could 
have advantages in terms of an API that is easy to describe in a document.  If 
we document the current API, should we annotate it as Experimental?  If not, 
does that effectively freeze the API?

 

We are interested in community input about possible customization use cases and 
opinions on possible API designs!

Cheers,

Erik

-- 

Twitter: https://twitter.com/holdenkarau



Re: Build issues with apache-spark-on-k8s.

2018-03-29 Thread Rob Vesse
Kubernetes support was only added as an experimental feature in Spark 2.3.0

 

It does not exist in the Apache Spark branch-2.2

 

If you really must build for Spark 2.2 you will need to use 
branch-2.2-kubernetes from the apache-spark-on-k8s fork on GitHub

 

Note that there are various functional and implementation differences between 
the fork and what is currently integrated into Spark so please ensure you refer 
to the official/fork documentation as appropriate

 

Rob

 

From: Atul Sowani 
Date: Thursday, 29 March 2018 at 11:27
To: Anirudh Ramanathan 
Cc: Lucas Kacher , dev 
Subject: Re: Build issues with apache-spark-on-k8s.

 

Thanks all for responding and helping me with the build issue. I tried building 
the code at git://github.com/apache/spark.git (master branch) in my ppc64le 
Ubuntu 16.04 VM and it failed. I tried building a specific branch (branch-2.2) 
using following command:

 

build/mvn -DskipTests -Pkubernetes clean package install

 

This builds it successfully, but again I do not see "dockerfiles" and "jars" 
directories anywhere. This behaviour is exactly same as observed with source 
code at https://github.com/apache-spark-on-k8s/spark

 

Any advise on how to proceed on this? As far as possible, I need to build v2.2.

 

Thanks,

Atul.

 

 

 

On Wed, Mar 28, 2018 at 8:06 PM, Anirudh Ramanathan  
wrote:

As Lucas said, those directories are generated and copied when you run a full 
maven build with the -Pkubernetes flag specified (or use instructions in  
https://spark.apache.org/docs/latest/building-spark.html#building-a-runnable-distribution).

 

Also, using the Kubernetes integration in the  main Apache Spark project is 
recommended. The fork https://github.com/apache-spark-on-k8s/spark/ will be 
retired once we finish upstreaming all those features in Spark 2.4. 

 

 

On Wed, Mar 28, 2018, 6:42 AM Lucas Kacher  wrote:

Are you building on the fork or on the official release now? I built v2.3.0 
from source w/out issue. One thing I noticed is that I needed to run the 
build-image command from the bin which was placed in dist/ as opposed to the 
one in the repo (as that's how it copies the necessary targets).

(Failed to reply-all to the list).

 

On Wed, Mar 28, 2018 at 4:30 AM, Atul Sowani  wrote:

Hi,

 

I built apache-spark-on-k8s from source on Ubuntu 16.04 and it got built 
without errors. Next, I wanted to create docker images, so as explained at 
https://apache-spark-on-k8s.github.io/userdocs/running-on-kubernetes.html I 
used sbin/build-push-docker-images.sh to create those. While using this script 
I came across 2 issues:

 

1. It references "dockerfiles" directory which should be in "spark", however 
this directory is missing. I created "dockerfiles" directory and copied 
Dockerfiles from resource-managers/kubernetes/docker-minimal-bundle

 

2, spark-base dockerfile expects to have some JAR files present in a directory 
called "jars" - this directory is missing. I tried rebuilding the code but this 
directory is not getting generated if it is supposed to be.

 

My doubt is, if this is a genuine/known issue or am I missing out some build 
steps?

 

Thanks,

Atul.

 



 

-- 

Lucas Kacher
Senior Engineer
-
vsco.co

New York, NY

818.512.5239

 



[DISCUSS][K8S] Supporting advanced pod customisation

2018-09-19 Thread Rob Vesse
Hey all

 

For those following the K8S backend you are probably aware of SPARK-24434 [1] 
(and PR 22416 [2]) which proposes a mechanism to allow for advanced pod 
customisation via pod templates.  This is motivated by the fact that 
introducing additional Spark configuration properties for each aspect of pod 
specification a user might wish to customise was becoming unwieldy.

 

However I am concerned that the current implementation doesn’t go far enough 
and actually limits the utility of the proposed new feature.  The problem stems 
from the fact that the implementation simply uses the pod template as a base 
and then Spark attempts to build a pod spec on top of that.  As the code that 
does this doesn’t do any kind of validation or inspection of the incoming 
template it is possible to provide a template that causes Spark to generate an 
invalid pod spec ultimately causing the job to be rejected by Kubernetes.

 

Now clearly Spark code cannot attempt to account for every possible 
customisation that a user may attempt to make via pod templates nor should it 
be responsible for ensuring that the user doesn’t start from an invalid 
template in the first place.  However it seems like we could be more 
intelligent in how we build our pod specs to avoid generating invalid specs in 
cases where we have a clear use case for advanced customisation.  For example 
the current implementation does not allow users to customise the volumes used 
to back SPARK_LOCAL_DIRS to better suit the compute environment the K8S cluster 
is running on and trying to do so with a pod template will result in an invalid 
spec due to duplicate volumes.

 

I think there are a few ways the community could address this:

 
Status quo – provide the pod template feature as-is and simply tell users that 
certain customisations are never supported and may result in invalid pod specs
Provide the ability for advanced users to explicitly skip pod spec building 
steps they know interfere with their pod templates via configuration properties
Modify the pod spec building code to be aware of known desirable user 
customisation points and avoid generating  invalid specs in those cases
 

Currently committers seem to be going for Option 1.  Personally I would like to 
see the community adopt option 3 but have already received considerable 
pushback when I proposed that in one of my PRs hence the suggestion of the 
compromise option 2.  Yes this still has the possibility of ending up with 
invalid specs if users are over-zealous in the spec building steps they disable 
but since this is a power user feature I think this would be a risk power users 
would be willing to assume.  If we are going to provide features for power 
users we should avoid unnecessarily limiting the utility of those features.

 

What do other K8S folks think about this issue?

 

Thanks,

 

Rob

 

[1] https://issues.apache.org/jira/browse/SPARK-24434

[2] https://github.com/apache/spark/pull/22146

 



[DISCUSS][K8S] Local dependencies with Kubernetes

2018-10-05 Thread Rob Vesse
Folks

 

One of the big limitations of the current Spark on K8S implementation is that 
it isn’t possible to use local dependencies (SPARK-23153 [1]) i.e. code, JARs, 
data etc that only lives on the submission client.  This basically leaves end 
users with several options on how to actually run their Spark jobs under K8S:

 
Store local dependencies on some external distributed file system e.g. HDFS
Build custom images with their local dependencies
Mount local dependencies into volumes that are mounted by the K8S pods
 

In all cases the onus is on the end user to do the prep work.  Option 1 is 
unfortunately rare in the environments we’re looking to deploy Spark and Option 
2 tends to be a non-starter as many of our customers whitelist approved images 
i.e. custom images are not permitted.

 

Option 3 is more workable but still requires the users to provide a bunch of 
extra config options to configure this for simple cases or rely upon the 
pending pod template feature for complex cases.

 

Ideally this would all just be handled automatically for users in the way that 
all other resource managers do, the K8S backend even did this at one point in 
the downstream fork but after a long discussion [2] this got dropped in favour 
of using Spark standard mechanisms i.e. spark-submit.  Unfortunately this 
apparently was never followed through upon as it doesn’t work with master as of 
today.  Moreover I am unclear how this would work in the case of Spark on K8S 
cluster mode where the driver itself is inside a pod since the spark-submit 
mechanism is based upon copying from the drivers filesystem to the executors 
via a file server on the driver, if the driver is inside a pod it won’t be able 
to see local files on the submission client.  I think this may work out of the 
box with client mode but I haven’t dug into that enough to verify yet.

 

I would like to start work on addressing this problem but to be honest I am 
unclear where to start with this.  It seems using the standard spark-submit 
mechanism is the way to go but I’m not sure how to get around the driver pod 
issue.  I would appreciate any pointers from folks who’ve looked at this 
previously on how and where to start on this.

 

Cheers,

 

Rob

 

[1] https://issues.apache.org/jira/browse/SPARK-23153

[2] 
https://lists.apache.org/thread.html/82b4ae9a2eb5ddeb3f7240ebf154f06f19b830f8b3120038e5d687a1@%3Cdev.spark.apache.org%3E



Re: [DISCUSS][K8S] Local dependencies with Kubernetes

2018-10-08 Thread Rob Vesse
Folks, thanks for all the great input. Responding to various points raised:

 

Marcelo/Yinan/Felix – 

 

Yes, client mode will work.  The main JAR will be automatically distributed and 
--jars/--files specified dependencies are also distributed though for --files 
user code needs to use the appropriate Spark APIs to resolve the actual path 
i.e. SparkFiles.get()

 

However client mode can be awkward if you want to mix spark-submit distribution 
with mounting dependencies via volumes since you may need to ensure that 
dependencies appear at the same path both on the local submission client and 
when mounted into the executors.  This mainly applies to the case where user 
code does not use SparkFiles.get() and simply tries to access the path directly.

 

Marcelo/Stavros – 

 

Yes I did give the other resource managers too much credit.  From my past 
experience with Mesos and Standalone I had thought this wasn’t an issue but 
going back and looking at what we did for both of those it appears we were 
entirely reliant on the shared file system (whether HDFS, NFS or other POSIX 
compliant filesystems e.g. Lustre).

 

Since connectivity back to the client is a potential stumbling block for 
cluster mode I wander if it would be better to think in reverse i.e. rather 
than having the driver pull from the client have the client push to the driver 
pod?

 

You can do this manually yourself via kubectl cp so it should be possible to 
programmatically do this since it looks like this is just a tar piped into a 
kubectl exec.   This would keep the relevant logic in the Kubernetes specific 
client which may/may not be desirable depending on whether we’re looking to 
just fix this for K8S or more generally.  Of course there is probably a fair 
bit of complexity in making this work but does that sound like something worth 
exploring?

 

I hadn’t really considered the HA aspect, a first step would be to get the 
basics working and then look at the HA aspect.  Although if the above 
theoretical approach is practical that could simply be part of restarting the 
driver.

 

Rob

 

 

From: Felix Cheung 
Date: Sunday, 7 October 2018 at 23:00
To: Yinan Li , Stavros Kontopoulos 

Cc: Rob Vesse , dev 
Subject: Re: [DISCUSS][K8S] Local dependencies with Kubernetes

 

Jars and libraries only accessible locally at the driver is fairly limited? 
Don’t you want the same on all executor?

 

 

 

From: Yinan Li 
Sent: Friday, October 5, 2018 11:25 AM
To: Stavros Kontopoulos
Cc: rve...@dotnetrdf.org; dev
Subject: Re: [DISCUSS][K8S] Local dependencies with Kubernetes 

 

> Just to be clear: in client mode things work right? (Although I'm not
really familiar with how client mode works in k8s - never tried it.) 

 

If the driver runs on the submission client machine, yes, it should just work. 
If the driver runs in a pod, however, it faces the same problem as in cluster 
mode.

 

Yinan

 

On Fri, Oct 5, 2018 at 11:06 AM Stavros Kontopoulos 
 wrote:

@Marcelo is correct. Mesos does not have something similar. Only Yarn does due 
to the distributed cache thing. 

I have described most of the above in the the jira also there are some other 
options.

 

Best,

Stavros

 

On Fri, Oct 5, 2018 at 8:28 PM, Marcelo Vanzin  
wrote:

On Fri, Oct 5, 2018 at 7:54 AM Rob Vesse  wrote:
> Ideally this would all just be handled automatically for users in the way 
> that all other resource managers do

I think you're giving other resource managers too much credit. In
cluster mode, only YARN really distributes local dependencies, because
YARN has that feature (its distributed cache) and Spark just uses it.

Standalone doesn't do it (see SPARK-4160) and I don't remember seeing
anything similar on the Mesos side.

There are things that could be done; e.g. if you have HDFS you could
do a restricted version of what YARN does (upload files to HDFS, and
change the "spark.jars" and "spark.files" URLs to point to HDFS
instead). Or you could turn the submission client into a file server
that the cluster-mode driver downloads files from - although that
requires connectivity from the driver back to the client.

Neither is great, but better than not having that feature.

Just to be clear: in client mode things work right? (Although I'm not
really familiar with how client mode works in k8s - never tried it.)

-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



 



Re: [DISCUSS][K8S] Local dependencies with Kubernetes

2018-10-08 Thread Rob Vesse
Well yes.  However the submission client is already able to monitor the driver 
pod status so can see when it is up and running.  And couldn’t we potentially 
modify the K8S entry points e.g. KubernetesClientApplication that run inside 
the driver pods to wait for dependencies to be uploaded?

 

I guess at this stage I am just throwing ideas out there and trying to figure 
out what’s practical/reasonable

 

Rob

 

From: Yinan Li 
Date: Monday, 8 October 2018 at 17:36
To: Rob Vesse 
Cc: dev 
Subject: Re: [DISCUSS][K8S] Local dependencies with Kubernetes

 

However, the pod must be up and running for this to work. So if you want to use 
this to upload dependencies to the driver pod, the driver pod must already be 
up and running. So you may not even have a chance to upload the dependencies at 
this point.



Re: [DISCUSS][K8S][TESTS] Include Kerberos integration tests for Spark 2.4

2018-10-16 Thread Rob Vesse
Right now the Kerberos support for Spark on K8S is only on master AFAICT i.e. 
the feature is not present on branch-2.4 

 

Therefore I don’t see any point in adding the tests into branch-2.4 unless the 
plan is to also merge the Kerberos support to branch-2.4

 

Rob

 

From: Erik Erlandson 
Date: Tuesday, 16 October 2018 at 16:47
To: dev 
Subject: [DISCUSS][K8S][TESTS] Include Kerberos integration tests for Spark 2.4

 

I'd like to propose including integration testing for Kerberos on the Spark 2.4 
release:

https://github.com/apache/spark/pull/22608

 

Arguments in favor:

1) it improves testing coverage on a feature important for integrating with 
HDFS deployments

2) its intersection with existing code is small - it consists primarily of new 
testing code, with a bit of refactoring into 'main' and 'test' sub-trees. These 
new tests appear stable.

3) Spark 2.4 is still in RC, with outstanding correctness issues.

 

The argument 'against' that I'm aware of would be the relatively large size of 
the PR. I believe this is considered above, but am soliciting community 
feedback before committing.

Cheers,

Erik