and could you paste the stage and task information from SparkUI On Wed, Mar 29, 2017 at 11:30 AM, Ryan <ryan.hd....@gmail.com> wrote:
> how long does it take if you remove the repartition and just collect the > result? I don't think repartition is needed here. There's already a shuffle > for group by > > On Tue, Mar 28, 2017 at 10:35 PM, KhajaAsmath Mohammed < > mdkhajaasm...@gmail.com> wrote: > >> Hi, >> >> I am working on requirement where i need to join two tables and do group >> by to get max value on some fileds. >> >> Table1: 10 GB of data >> Table2: 96 GB of data >> >> Same query in Impala is taking around 20 miniutes and it took almost 3 >> hours to run in spark sql. >> >> I have added repartition to dataframe, persist as memory and disk still >> response is very bad. any suggetions. >> >> val results_group_dataframe=sqlContext.sql("SELECT >> a.originalsamplingstate,a.vin,max(a.utctime) as geospatialtime FROM >> GeoSpatialTemp A GROUP BY a.VIN, >> a.OriginalSamplingState").repartition(numPartitions) >> >> Thanks, >> >> Asmath >> >> >