Hi all, Can somebody put some light on this pls?
Thanks, Aakash. ---------- Forwarded message ---------- From: "Aakash Basu" <aakash.spark....@gmail.com> Date: 15-Jun-2017 2:57 PM Subject: Repartition vs PartitionBy Help/Understanding needed To: "user" <user@spark.apache.org> Cc: Hi all, > > Everybody is giving a difference between coalesce and repartition, but > nowhere I found a difference between partitionBy and repartition. My > question is, is it better to write a data set in parquet partitioning by a > column and then reading the respective directories to work on that column > in accordance and relevance or using repartition on that column to do the > same in memory? > > > A) One scenario is - > > *val partitioned_DF = df_factFundRate.repartition($"YrEqual")//New change > for performance test* > > > *val df_YrEq_true = > partitioned_DF.filter("YrEqual=true").withColumnRenamed("validFromYr", > "yr_id").drop("validThruYr")* > > *val exists = partitioned_DF.filter("YrEqual = false").count()* > *if(exists > 0) * > > > > B) And the other scenario is - > > *val df_cluster = sqlContext.sql("select * from factFundRate cluster by > YrEqual")* > *df_factFundRate.coalesce(50).write.mode("overwrite").option("header", > "true").partitionBy("YrEqual").parquet(args(25))* > > *val df_YrEq_true = > sqlContext.read.parquet(args(25)+"YrEqual=true/").withColumnRenamed("validFromYr", > "yr_id").drop("validThruYr")* > > > *val hadoopconf = new Configuration()* > *val fileSystem = FileSystem.get(hadoopconf)* > > > *val exists = FileSystem.get(new URI(args(26)), > sparkContext.hadoopConfiguration).exists(new > Path(args(25)+"YrEqual=false"))* > *if(exists)* > > > The second scenario finishes within 6 mins whereas the first scenario > takes 13 mins to complete. > > Please help! > > > Thanks in adv, > Aakash. >