Re: Difference between 'cores' config params: spark submit on k8s

2019-04-20 Thread Li Gao
hi Battini, The limit is a k8s construct that tells k8s how much cpu/cores your driver *can* consume. when you have the same value for 'spark.driver.cores' and ' spark.kubernetes.driver.limit.cores' your driver then runs at the 'Guranteed' k8s quality of service class, which can make your driver

How to execute non-timestamp-based aggregations in spark structured streaming?

2019-04-20 Thread Stephen Boesch
Consider the following *intended* sql: select row_number() over (partition by Origin order by OnTimeDepPct desc) OnTimeDepRank,* from flights This will *not* work in *structured streaming* : The culprit is: partition by Origin The requirement is to use a timestamp-typed field such as par

repartition in df vs partitionBy in df

2019-04-20 Thread kumar.rajat20del
Hi Spark Users, repartition and partitionBy seems to be very same in Df. In which scenario we use one? As per my understanding repartition is very expensive operation as it needs full shuffle then when do we use repartition ? Thanks Rajat -- Sent from: http://apache-spark-user-list.1001560.n

Re: toDebugString - RDD Logical Plan

2019-04-20 Thread Dylan Guedes
Kanchan, the `toDebugString` looks unformatted because in some scenarios you need to parse it before (can't remember the reason, though). I suggest you to print the RDD Lineage using `print(rdd.toDebugString().decode("utf-8"))` instead (obs: this only occurs in Pyspark). About the other question,

toDebugString - RDD Logical Plan

2019-04-20 Thread kanchan tewary
Dear All, Greetings! I am new to Apache Spark and working on RDDs using pyspark. I am trying to understand the logical plan provided by toDebugString function, but I find two issues a) the output is not formatted when I print the result b) I do not see number of partitions shown. Can anyone dire

Feature engineering ETL for machine learning

2019-04-20 Thread Subash Prabakar
Hi, I have a series of queries to extract from multiple tables in hive and do a feature engineering on the extracted final data.. I can run queries using spark sql and use mllib to perform the feature transformation I needed. The question is do you guys use any kind of tool to perform this workfl

Re: --jars vs --spark.executor.extraClassPath vs --spark.driver.extraClassPath

2019-04-20 Thread Jason Nerothin
Hi Rajat, A little more color: The executor classpath will be used by the spark workers/slaves. For example, all JVMs that are started with $SPARK_HOME/sbin/start-slave.sh. If you run with --deploy-mode cluster, then the driver itself will be run from on the cluster (with executor classpath). If

Re: --jars vs --spark.executor.extraClassPath vs --spark.driver.extraClassPath

2019-04-20 Thread Subash Prabakar
Hey Rajat, The documentation page is self explanatory.. You can refer this for more configs https://spark.apache.org/docs/2.0.0/configuration.html or any version of Spark documentation Thanks. Subash On Sat, 20 Apr 2019 at 16:04, rajat kumar wrote: > Hi, > > Can anyone pls explain ? > > > O

Re: --jars vs --spark.executor.extraClassPath vs --spark.driver.extraClassPath

2019-04-20 Thread rajat kumar
Hi, Can anyone pls explain ? On Mon, 15 Apr 2019, 09:31 rajat kumar Hi All, > > I came across different parameters in spark submit > > --jars , --spark.executor.extraClassPath , --spark.driver.extraClassPath > > What are the differences between them? When to use which one? Will it > differ > if