Hello,
I'm not sure that's your reason but check this discussion:

http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-ML-Pipeline-performance-regression-between-1-6-and-2-x-td20803.html

On Tue, Feb 14, 2017 at 9:25 PM, anatva <[email protected]> wrote:

> Hi,
> I am reading an ORC file, and perform some joins, aggregations and finally
> generate a dense vector to perform analytics.
>
> The code runs in 45 minutes on spark 1.6 on a 4 node cluster. When the same
> code is migrated to run on spark 2.0 on the same cluster, it takes around
> 4-5 hours. It is surprising and frustrating.
>
> Can anyone take a look and help me what should I change in order to get
> atleast same performance in spark 2.0.
>
> spark-shell --master yarn-client --driver-memory 5G --executor-memory 20G \
> --num-executors 15 --executor-cores 5 --queue grid_analytics --conf
> spark.yarn.executor.memoryOverhead=5120 --conf
> spark.executor.heartbeatInterval=1200
>
> import sqlContext.implicits._
> import org.apache.spark.storage.StorageLevel
> import org.apache.spark.sql.functions.lit
> import org.apache.hadoop._
> import org.apache.spark.sql.functions.udf
> import org.apache.spark.mllib.linalg.{Vector, Vectors}
> import org.apache.spark.ml.clustering.KMeans
> import org.apache.spark.ml.feature.StandardScaler
> import org.apache.spark.ml.feature.Normalizer
>
> here are the steps:
> 1. read orc file
> 2. filter some of the records
> 3. persist resulting data frame
> 4. get distinct accounts from the df and get a sample of 50k accts from the
> distinct list
> 5. join the above data frame with distinct 50k accounts to pull records for
> only those 50k accts
> 6. perform a group by to get the avg, mean, sum, count of readings for the
> given 50k accts
> 7. join the df obtained by GROUPBY with original DF
> 8. convert the resultant DF to an RDD, do a groupbyKey(), and generate a
> DENSE VECTOR
> 9. convert RDD back to DF and store it in a parquet file
>
> The above steps worked fine in spark 1.6 but i m not sure why they run
> painfully long in spark 2.0.
>
> I am using spark 1.6 & spark 2.0 on HDP 2.5.3
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/My-spark-job-runs-faster-in-
> spark-1-6-and-much-slower-in-spark-2-0-tp28390.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [email protected]
>
>

Reply via email to