Hi all,

Can somebody put some light on this pls?

Thanks,
Aakash.
---------- Forwarded message ----------
From: "Aakash Basu" <aakash.spark....@gmail.com>
Date: 15-Jun-2017 2:57 PM
Subject: Repartition vs PartitionBy Help/Understanding needed
To: "user" <user@spark.apache.org>
Cc:

Hi all,
>
> Everybody is giving a difference between coalesce and repartition, but
> nowhere I found a difference between partitionBy and repartition. My
> question is, is it better to write a data set in parquet partitioning by a
> column and then reading the respective directories to work on that column
> in accordance and relevance or using repartition on that column to do the
> same in memory?
>
>
> A) One scenario is -
>
> *val partitioned_DF = df_factFundRate.repartition($"YrEqual")//New change
> for performance test*
>
>
> *val df_YrEq_true =
> partitioned_DF.filter("YrEqual=true").withColumnRenamed("validFromYr",
> "yr_id").drop("validThruYr")*
>
> *val exists = partitioned_DF.filter("YrEqual = false").count()*
> *if(exists > 0) *
>
>
>
> B) And the other scenario is -
>
> *val df_cluster = sqlContext.sql("select * from factFundRate cluster by
> YrEqual")*
> *df_factFundRate.coalesce(50).write.mode("overwrite").option("header",
> "true").partitionBy("YrEqual").parquet(args(25))*
>
> *val df_YrEq_true =
> sqlContext.read.parquet(args(25)+"YrEqual=true/").withColumnRenamed("validFromYr",
> "yr_id").drop("validThruYr")*
>
>
> *val hadoopconf = new Configuration()*
> *val fileSystem = FileSystem.get(hadoopconf)*
>
>
> *val exists = FileSystem.get(new URI(args(26)),
> sparkContext.hadoopConfiguration).exists(new
> Path(args(25)+"YrEqual=false"))*
> *if(exists)*
>
>
> The second scenario finishes within 6 mins whereas the first scenario
> takes 13 mins to complete.
>
> Please help!
>
>
> Thanks in adv,
> Aakash.
>

Reply via email to