Re: One click to run Spark on Kubernetes

2022-02-22 Thread Mich Talebzadeh
Hi, There are two distinct actions here; namely Deploy and Run. Deployment can be done by command line script with autoscaling. In the newer versions of Kubernnetes you don't even need to specify the node types, you can leave it to the Kubernetes cluster to scale up and down and decide on node t

Re: One click to run Spark on Kubernetes

2022-02-22 Thread bo yang
Merging another email from Prasad. It could co-exist with livy. Livy is similar like the REST Service + Spark Operator. Unfortunately Livy is not very active right now. To Amihay, the link is: https://github.com/datapunchorg/punch. On Tue, Feb 22, 2022 at 8:53 PM amihay gonen wrote: > Can you s

Re: One click to run Spark on Kubernetes

2022-02-22 Thread amihay gonen
Can you share link to the source? בתאריך יום ד׳, 23 בפבר׳ 2022, 6:52, מאת bo yang ‏: > We do not have SaaS yet. Now it is an open source project we build in our > part time , and we welcome more people working together on that. > > You could specify cluster size (EC2 instance type and number of i

Re: One click to run Spark on Kubernetes

2022-02-22 Thread bo yang
We do not have SaaS yet. Now it is an open source project we build in our part time , and we welcome more people working together on that. You could specify cluster size (EC2 instance type and number of instances) and run it for 1 hour. Then you could run one click command to destroy the cluster.

Re: One click to run Spark on Kubernetes

2022-02-22 Thread Prasad Paravatha
Hi Bo Yang, Would it be something along the lines of Apache livy? Thanks, Prasad On Tue, Feb 22, 2022 at 10:22 PM bo yang wrote: > It is not a standalone spark cluster. In some details, it deploys a Spark > Operator (https://github.com/GoogleCloudPlatform/spark-on-k8s-operator) > and an extra

Re: One click to run Spark on Kubernetes

2022-02-22 Thread Bitfox
How can I specify the cluster memory and cores? For instance, I want to run a job with 16 cores and 300 GB memory for about 1 hour. Do you have the SaaS solution for this? I can pay as I did. Thanks On Wed, Feb 23, 2022 at 12:21 PM bo yang wrote: > It is not a standalone spark cluster. In some

Re: One click to run Spark on Kubernetes

2022-02-22 Thread bo yang
It is not a standalone spark cluster. In some details, it deploys a Spark Operator (https://github.com/GoogleCloudPlatform/spark-on-k8s-operator) and an extra REST Service. When people submit Spark application to that REST Service, the REST Service will create a CRD inside the Kubernetes cluster. T

Re: One click to run Spark on Kubernetes

2022-02-22 Thread Bitfox
Can it be a cluster installation of spark? or just the standalone node? Thanks On Wed, Feb 23, 2022 at 12:06 PM bo yang wrote: > Hi Spark Community, > > We built an open source tool to deploy and run Spark on Kubernetes with a > one click command. For example, on AWS, it could automatically cre

One click to run Spark on Kubernetes

2022-02-22 Thread bo yang
Hi Spark Community, We built an open source tool to deploy and run Spark on Kubernetes with a one click command. For example, on AWS, it could automatically create an EKS cluster, node group, NGINX ingress, and Spark Operator. Then you will be able to use curl or a CLI tool to submit Spark applica

Re: [E] COMMERCIAL BULK: Re: TensorFlow on Spark

2022-02-22 Thread Bitfox
tensorflow itself can implement the distributed computing via a parameter server. Why did you want spark here? regards. On Wed, Feb 23, 2022 at 11:27 AM Vijayant Kumar wrote: > Thanks Sean for your response. !! > > > > Want to add some more background here. > > > > I am using Spark3.0+ version

Re: [E] COMMERCIAL BULK: Re: TensorFlow on Spark

2022-02-22 Thread Sean Owen
Dependencies? Sure like any python library. What are you asking about there? I don't know of a modern alternative on Spark. Did you read the docs or search? Plenty of examples On Tue, Feb 22, 2022, 9:27 PM Vijayant Kumar wrote: > Thanks Sean for your response. !! > > > > Want to add some more

RE: [E] COMMERCIAL BULK: Re: TensorFlow on Spark

2022-02-22 Thread Vijayant Kumar
Thanks Sean for your response. !! Want to add some more background here. I am using Spark3.0+ version with Tensorflow 2.0+. My use case is not for the image data but for the Time-series data where I am using LSTM and transformers to forecast. I evaluated SparkFlow and spark_tensorflow_distribut

Re: TensorFlow on Spark

2022-02-22 Thread Sean Owen
Sure, Horovod is commonly used on Spark for this: https://horovod.readthedocs.io/en/stable/spark_include.html On Tue, Feb 22, 2022 at 8:51 PM Vijayant Kumar wrote: > Hi All, > > > > Anyone using Apache spark with TensorFlow for building models. My > requirement is to use TensorFlow distributed m

TensorFlow on Spark

2022-02-22 Thread Vijayant Kumar
Hi All, Anyone using Apache spark with TensorFlow for building models. My requirement is to use TensorFlow distributed model training across the Spark executors. Please help me with some resources or some sample code. Thanks, Vijayant This e-mail message may cont

Re: Spark Explain Plan and Joins

2022-02-22 Thread Sid Kal
Hi Mich / Gourav, Thanks for your time :) Much appreciated. I went through the article shared by Mich about the query execution plan. I pretty much understood most of the things till now except the two things below. 1) HashAggregate in the plan? Does this always indicate "group by" columns? 2) Pre

Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

2022-02-22 Thread Saurabh Gulati
Hey Mich, We use spark 3.2 now. We are using BQ but migrating away because: * Its not reflective of our current lake structure with all deltas/history tables/models outputs etc * Its pretty expensive to load everything in BQ and essentially it will be a copy of all data in gcs. External

Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

2022-02-22 Thread Mich Talebzadeh
Ok interesting. I am surprised why you are not using BigQuery and using Hive. My assumption is that your Spark is version 3.1.1 with standard GKE on auto-scaler. What benefits are you getting from Using Hive here? As you have your hive tables on gs buckets, you can easily download your hive tables

Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

2022-02-22 Thread Saurabh Gulati
To correct my last message, its hive-metastore​ running as a service in a container and not hive. We use Spark-thriftserver for query execution. From: Saurabh Gulati Sent: 22 February 2022 16:33 To: Mich Talebzadeh Cc: user@spark.apache.org Subject: Re: [EXTERNA

Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

2022-02-22 Thread Saurabh Gulati
Thanks Sean for your response. @Mich Talebzadeh We run all workloads on GKE as docker containers. So to answer your questions, Hive is running in a container as K8S service and spark thrift-server in another container as a service and Superset in a third contai

Re: Need to make WHERE clause compulsory in Spark SQL

2022-02-22 Thread Mich Talebzadeh
Is your hive on prem with external tables in cloud storage? Where is your spark running from and what cloud buckets are you using? HTH On Tue, 22 Feb 2022 at 12:36, Saurabh Gulati wrote: > Hello, > We are trying to setup Spark as the execution engine for exposing our data > stored in lake. We

Re: Need to make WHERE clause compulsory in Spark SQL

2022-02-22 Thread Sean Owen
Spark does not use Hive for execution, so Hive params will not have an effect. I don't think you can enforce that in Spark. Typically you enforce things like that at a layer above your SQL engine, or can do so, because there is probably other access you need to lock down. On Tue, Feb 22, 2022 at 6

RE: Spark-SQL : Getting current user name in UDF

2022-02-22 Thread Lavelle, Shawn
Apologies, this is Spark 3.2.0. ~ Shawn From: Lavelle, Shawn Sent: Monday, February 21, 2022 5:39 PM To: 'user@spark.apache.org' Subject: Spark-SQL : Getting current user name in UDF Hello Spark Users, I have a UDF I wrote for use with Spark-SQL that performs a look up. In that look up,

Need to make WHERE clause compulsory in Spark SQL

2022-02-22 Thread Saurabh Gulati
Hello, We are trying to setup Spark as the execution engine for exposing our data stored in lake. We have hive metastore running along with Spark thrift server and are using Superset as the UI. We save all tables as External tables in hive metastore with storge being on Cloud. We see that righ