Re: k8s orchestrating Spark service

Pat Ferrel Wed, 03 Jul 2019 16:11:54 -0700

Thanks for the in depth explanation.

These methods would require us to architect our Server around Spark and it
is actually designed to be independent of the ML implementation. SparkML is
an important algo source, to be sure, but so is TensorFlow, and Python
non-spark libs among others. So Spark stays at arms length in a
microservices pattern. Doing this with access to Job status and management
is why Livy and the (Spark) Job Server exist. To us the ideal is treating
Spark like a compute server that will respond to a service API for job
submittal and control.


None of the above is solved by k8s Spark. Further we find that the Spark
Programatic API does not support deploy mode = “cluster”. This means we
have to take a simple part of our code and partition it into new Jars only
to get spark-submit to work. To help with Job tracking and management when
you are not using the Programatic API we look to Livy. I guess if you ask
our opinion of spark-submit, we’d (selfishly) say it hides architectural
issues that should be solved in the Spark Programatic API but the
popularity of spark-submit is causing the community to avoid these or just
not see or care about them. I guess we’ll see if Spark behind Livy gives us
what we want.

Maybe this is unusual but we see Spark as a service, not an integral
platform. We also see Kubernetes as very important but optional for HA or
when you want to scale horizontally, basically when vertical is not
sufficient. Vertical scaling is more cost effective so Docker Compose is a
nice solution for simpler, Kubernetes-less deployments.

So if we are agnostic about the job master, and communicate through Livy,
we are back to orchestrating services with Docker and Kubernetes. If k8s
becomes a super duper job master, great! But it doesn’t solve todays
question.


From: Matt Cheah <mch...@palantir.com> <mch...@palantir.com>
Reply: Matt Cheah <mch...@palantir.com> <mch...@palantir.com>
Date: July 1, 2019 at 5:14:05 PM
To: Pat Ferrel <p...@occamsmachete.com> <p...@occamsmachete.com>,
user@spark.apache.org <user@spark.apache.org> <user@spark.apache.org>
Subject:  Re: k8s orchestrating Spark service

> We’d like to deploy Spark Workers/Executors and Master (whatever master
is easiest to talk about since we really don’t care) in pods as we do with
the other services we use. Replace Spark Master with k8s if you insist. How
do the executors get deployed?



When running Spark against Kubernetes natively, the Spark library handles
requesting executors from the API server. So presumably one would only need
to know how to start the driver in the cluster – maybe spark-operator,
spark-submit, or just starting the pod and making a Spark context in client
mode with the right parameters. From there, the Spark scheduler code knows
how to interface with the API server and request executor pods according to
the resource requests configured in the app.



> We have a machine Learning Server. It submits various jobs through the
Spark Scala API. The Server is run in a pod deployed from a chart by k8s.
It later uses the Spark API to submit jobs. I guess we find spark-submit to
be a roadblock to our use of Spark and the k8s support is fine but how do
you run our Driver and Executors considering that the Driver is part of the
Server process?



It depends on how the server runs the jobs:

   - If each job is meant to be a separate forked driver pod / process: The
   ML server code can use the SparkLauncher API
   
<https://spark.apache.org/docs/latest/api/java/org/apache/spark/launcher/SparkLauncher.html>
   and configure the Spark driver through that API. Set the master to point to
   the Kubernetes API server and set the parameters for credentials according
   to your setup. SparkLauncher is a thin layer on top of spark-submit; a
   Spark distribution has to be packaged with the ML server image and
   SparkLauncher would point to the spark-submit script in said distribution.
   - If all jobs run inside the same driver, that being the ML server: One
   has to start the ML server with the right parameters to point to the
   Kubernetes master. Since the ML server is a driver, one has the option to
   use spark-submit or SparkLauncher to deploy the ML server itself.
   Alternatively one can use a custom script to start the ML server, then the
   ML server process has to create a SparkContext object parameterized against
   the Kubernetes server in question.



I hope this helps!



-Matt Cheah

*From: *Pat Ferrel <p...@occamsmachete.com>
*Date: *Monday, July 1, 2019 at 5:05 PM
*To: *"user@spark.apache.org" <user@spark.apache.org>, Matt Cheah <
mch...@palantir.com>
*Subject: *Re: k8s orchestrating Spark service



We have a machine Learning Server. It submits various jobs through the
Spark Scala API. The Server is run in a pod deployed from a chart by k8s.
It later uses the Spark API to submit jobs. I guess we find spark-submit to
be a roadblock to our use of Spark and the k8s support is fine but how do
you run our Driver and Executors considering that the Driver is part of the
Server process?



Maybe we are talking past each other with some mistaken assumptions (on my
part perhaps).







From: Pat Ferrel <p...@occamsmachete.com> <p...@occamsmachete.com>

Reply: Pat Ferrel <p...@occamsmachete.com> <p...@occamsmachete.com>
Date: July 1, 2019 at 4:57:20 PM
To: user@spark.apache.org <user@spark.apache.org> <user@spark.apache.org>,
Matt Cheah <mch...@palantir.com> <mch...@palantir.com>
Subject:  Re: k8s orchestrating Spark service



k8s as master would be nice but doesn’t solve the problem of running the
full cluster and is an orthogonal issue.



We’d like to deploy Spark Workers/Executors and Master (whatever master is
easiest to talk about since we really don’t care) in pods as we do with the
other services we use. Replace Spark Master with k8s if you insist. How do
the executors get deployed?



We have our own containers that almost work for 2.3.3. We have used this
before with older Spark so we are reasonably sure it makes sense. We just
wonder if our own image builds and charts are the best starting point.



Does anyone have something they like?




From: Matt Cheah <mch...@palantir.com> <mch...@palantir.com>
Reply: Matt Cheah <mch...@palantir.com> <mch...@palantir.com>
Date: July 1, 2019 at 4:45:55 PM
To: Pat Ferrel <p...@occamsmachete.com> <p...@occamsmachete.com>,
user@spark.apache.org <user@spark.apache.org> <user@spark.apache.org>
Subject:  Re: k8s orchestrating Spark service



Sorry, I don’t quite follow – why use the Spark standalone cluster as an
in-between layer when one can just deploy the Spark application directly
inside the Helm chart? I’m curious as to what the use case is, since I’m
wondering if there’s something we can improve with respect to the native
integration with Kubernetes here. Deploying on Spark standalone mode in
Kubernetes is, to my understanding, meant to be superseded by the native
integration introduced in Spark 2.4.



*From: *Pat Ferrel <p...@occamsmachete.com>
*Date: *Monday, July 1, 2019 at 4:40 PM
*To: *"user@spark.apache.org" <user@spark.apache.org>, Matt Cheah <
mch...@palantir.com>
*Subject: *Re: k8s orchestrating Spark service



Thanks Matt,



Actually I can’t use spark-submit. We submit the Driver programmatically
through the API. But this is not the issue and using k8s as the master is
also not the issue though you may be right about it being easier, it
doesn’t quite get to the heart.



We want to orchestrate a bunch of services including Spark. The rest work,
we are asking if anyone has seen a good starting point for adding Spark as
a k8s managed service.




From: Matt Cheah <mch...@palantir.com> <mch...@palantir.com>
Reply: Matt Cheah <mch...@palantir.com> <mch...@palantir.com>
Date: July 1, 2019 at 3:26:20 PM
To: Pat Ferrel <p...@occamsmachete.com> <p...@occamsmachete.com>,
user@spark.apache.org <user@spark.apache.org> <user@spark.apache.org>
Subject:  Re: k8s orchestrating Spark service



I would recommend looking into Spark’s native support for running on
Kubernetes. One can just start the application against Kubernetes directly
using spark-submit in cluster mode or starting the Spark context with the
right parameters in client mode. See
https://spark.apache.org/docs/latest/running-on-kubernetes.html
[spark.apache.org]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_running-2Don-2Dkubernetes.html&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=4XyH4cxucBNQAlSaHyR4gXJbHIo9g9vcur4VzBCYkwk&s=Q6mv_pZUq3UmxJU6EiJYJvG8L44WBeWJyAnw3vG0GBw&e=>



I would think that building Helm around this architecture of running Spark
applications would be easier than running a Spark standalone cluster. But
admittedly I’m not very familiar with the Helm technology – we just use
spark-submit.



-Matt Cheah

*From: *Pat Ferrel <p...@occamsmachete.com>
*Date: *Sunday, June 30, 2019 at 12:55 PM
*To: *"user@spark.apache.org" <user@spark.apache.org>
*Subject: *k8s orchestrating Spark service



We're trying to setup a system that includes Spark. The rest of the
services have good Docker containers and Helm charts to start from.



Spark on the other hand is proving difficult. We forked a container and
have tried to create our own chart but are having several problems with
this.



So back to the community… Can anyone recommend a Docker Container + Helm
Chart for use with Kubernetes to orchestrate:

   - Spark standalone Master
   - several Spark Workers/Executors

This not a request to use k8s to orchestrate Spark Jobs, but the service
cluster itself.



Thanks

Re: k8s orchestrating Spark service

Reply via email to