Re: Kubernetes: why use init containers?

2018-01-10 Thread Marcelo Vanzin
On Wed, Jan 10, 2018 at 3:00 PM, Anirudh Ramanathan wrote: > We can start by getting a PR going perhaps, and start augmenting the > integration testing to ensure that there are no surprises - with/without > credentials, accessing GCS, S3 etc as well. > When we get enough confidence and test covera

Re: Kubernetes: why use init containers?

2018-01-10 Thread Andrew Ash
It seems we have two standard practices for resource distribution in place here: - the Spark way is that the application (Spark) distributes the resources *during* app execution, and does this by exposing files/jars on an http server on the driver (or pre-staged elsewhere), and executors downloadi

Re: Kubernetes: why use init containers?

2018-01-10 Thread Anirudh Ramanathan
Thanks for this discussion everyone. It has been very useful in getting an overall understanding here. I think in general, consensus is that this change doesn't introduce behavioral changes, and it's definitely an advantage to reuse the constructs that Spark provides to us. Moving on to a differen

Re: Kubernetes: why use init containers?

2018-01-10 Thread Marcelo Vanzin
On Wed, Jan 10, 2018 at 2:51 PM, Matt Cheah wrote: > those sidecars may perform side effects that are undesirable if the main > Spark application failed because dependencies weren’t available If the contract is that the Spark driver pod does not have an init container, and the driver handles its

Re: Kubernetes: why use init containers?

2018-01-10 Thread Matt Cheah
With regards to separation of concerns, there’s a fringe use case here – if more than one main container is on the pod, then none of them will run if the init-containers fail. A user can have a Pod Preset that attaches more sidecar containers to the driver and/or executors. In that case, those s

Re: Kubernetes: why use init containers?

2018-01-10 Thread Marcelo Vanzin
On Wed, Jan 10, 2018 at 2:30 PM, Yinan Li wrote: > 1. Retries of init-containers are automatically supported by k8s through pod > restart policies. For this point, sorry I'm not sure how spark-submit > achieves this. Great, add that feature to spark-submit, everybody benefits, not just k8s. > 2.

Re: Kubernetes: why use init containers?

2018-01-10 Thread Yinan Li
> Sorry, but what are those again? So far all the benefits are already > provided by spark-submit... 1. Retries of init-containers are automatically supported by k8s through pod restart policies. For this point, sorry I'm not sure how spark-submit achieves this. 2. The ability to use credentials t

Re: Kubernetes: why use init containers?

2018-01-10 Thread Marcelo Vanzin
On Wed, Jan 10, 2018 at 2:16 PM, Yinan Li wrote: > but we can not rule out the benefits init-containers bring either. Sorry, but what are those again? So far all the benefits are already provided by spark-submit... > Again, I would suggest we look at this more thoroughly post 2.3. Actually, one

Re: Kubernetes: why use init containers?

2018-01-10 Thread Yinan Li
> 1500 less lines of code trump all of the arguments given so far for > what the init container might be a good idea. We can also reduce the #lines of code by simply refactoring the code in such as way that a lot of code can be shared between configuration of the main container and that of the ini

Re: Kubernetes: why use init containers?

2018-01-10 Thread Marcelo Vanzin
On Wed, Jan 10, 2018 at 2:00 PM, Yinan Li wrote: > I want to re-iterate on one point, that the init-container achieves a clear > separation between preparing an application and actually running the > application. It's a guarantee provided by the K8s admission control and > scheduling components th

Re: Kubernetes: why use init containers?

2018-01-10 Thread Yinan Li
I want to re-iterate on one point, that the init-container achieves a clear separation between preparing an application and actually running the application. It's a guarantee provided by the K8s admission control and scheduling components that if the init-container fails, the main container won't b

Re: Kubernetes: why use init containers?

2018-01-10 Thread Marcelo Vanzin
On Wed, Jan 10, 2018 at 1:47 PM, Matt Cheah wrote: >> With a config value set by the submission code, like what I'm doing to >> prevent client mode submission in my p.o.c.? > > The contract for what determines the appropriate scheduler backend to > instantiate is then going to be different in Ku

Re: Kubernetes: why use init containers?

2018-01-10 Thread Matt Cheah
> With a config value set by the submission code, like what I'm doing to > prevent client mode submission in my p.o.c.? The contract for what determines the appropriate scheduler backend to instantiate is then going to be different in Kubernetes versus the other cluster managers. The cluster ma

Re: Kubernetes: why use init containers?

2018-01-10 Thread Marcelo Vanzin
On Wed, Jan 10, 2018 at 1:33 PM, Matt Cheah wrote: > If we use spark-submit in client mode from the driver container, how do we > handle needing to switch between a cluster-mode scheduler backend and a > client-mode scheduler backend in the future? With a config value set by the submission code

Re: Kubernetes: why use init containers?

2018-01-10 Thread Matt Cheah
If we use spark-submit in client mode from the driver container, how do we handle needing to switch between a cluster-mode scheduler backend and a client-mode scheduler backend in the future? Something else re: client mode accessibility – if we make client mode accessible to users even if it’s

Re: Kubernetes: why use init containers?

2018-01-10 Thread Marcelo Vanzin
On Wed, Jan 10, 2018 at 1:10 PM, Matt Cheah wrote: > I’d imagine this is a reason why YARN hasn’t went with using spark-submit > from the application master... I wouldn't use YARN as a template to follow when writing a new backend. A lot of the reason why the YARN backend works the way it does i

Re: Kubernetes: why use init containers?

2018-01-10 Thread Matt Cheah
A crucial point here is considering whether we want to have a separate scheduler backend code path for client mode versus cluster mode. If we need such a separation in the code paths, it would be difficult to make it possible to run spark-submit in client mode from the driver container. We disc

Vectorized ORC Reader in Apache Spark 2.3 with Apache ORC 1.4.1.

2018-01-10 Thread Dongjoon Hyun
Hi, All. Vectorized ORC Reader is now supported in Apache Spark 2.3. https://issues.apache.org/jira/browse/SPARK-16060 It has been a long journey. From now, Spark can read ORC files faster without feature penalty. Thank you for all your support, especially Wenchen Fan. It's done by two com

[build system] currently experiencing git timeouts when building

2018-01-10 Thread shane knapp
i just noticed we're starting to see the once-yearly rash of git timeouts when building. i'll be looking in to this today... i'm at our lab retreat, so my attention will be divided during the day but i will report back here once i have some more information. in the meantime, if your jobs have a

Re: Kubernetes: why use init containers?

2018-01-10 Thread Marcelo Vanzin
On a side note, while it's great that you guys have meetings to discuss things related to the project, it's general Apache practice to discuss these things in the mailing list - or at the very list send detailed info about what discussed in these meetings to the mailing list. Not everybody can atte

spark streaming direct receiver offset initialization

2018-01-10 Thread Evo Eftimov
In the class CachedKafkaConsumer.scala https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/sca la/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala what is the purpose of the following condition check in the method get(offset: Long, timeout: Long): Consume