Hello, I'm not sure that's your reason but check this discussion: http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-ML-Pipeline-performance-regression-between-1-6-and-2-x-td20803.html
On Tue, Feb 14, 2017 at 9:25 PM, anatva <[email protected]> wrote: > Hi, > I am reading an ORC file, and perform some joins, aggregations and finally > generate a dense vector to perform analytics. > > The code runs in 45 minutes on spark 1.6 on a 4 node cluster. When the same > code is migrated to run on spark 2.0 on the same cluster, it takes around > 4-5 hours. It is surprising and frustrating. > > Can anyone take a look and help me what should I change in order to get > atleast same performance in spark 2.0. > > spark-shell --master yarn-client --driver-memory 5G --executor-memory 20G \ > --num-executors 15 --executor-cores 5 --queue grid_analytics --conf > spark.yarn.executor.memoryOverhead=5120 --conf > spark.executor.heartbeatInterval=1200 > > import sqlContext.implicits._ > import org.apache.spark.storage.StorageLevel > import org.apache.spark.sql.functions.lit > import org.apache.hadoop._ > import org.apache.spark.sql.functions.udf > import org.apache.spark.mllib.linalg.{Vector, Vectors} > import org.apache.spark.ml.clustering.KMeans > import org.apache.spark.ml.feature.StandardScaler > import org.apache.spark.ml.feature.Normalizer > > here are the steps: > 1. read orc file > 2. filter some of the records > 3. persist resulting data frame > 4. get distinct accounts from the df and get a sample of 50k accts from the > distinct list > 5. join the above data frame with distinct 50k accounts to pull records for > only those 50k accts > 6. perform a group by to get the avg, mean, sum, count of readings for the > given 50k accts > 7. join the df obtained by GROUPBY with original DF > 8. convert the resultant DF to an RDD, do a groupbyKey(), and generate a > DENSE VECTOR > 9. convert RDD back to DF and store it in a parquet file > > The above steps worked fine in spark 1.6 but i m not sure why they run > painfully long in spark 2.0. > > I am using spark 1.6 & spark 2.0 on HDP 2.5.3 > > > > > -- > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/My-spark-job-runs-faster-in- > spark-1-6-and-much-slower-in-spark-2-0-tp28390.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe e-mail: [email protected] > >
