and could you paste the stage and task information from SparkUI

On Wed, Mar 29, 2017 at 11:30 AM, Ryan <ryan.hd....@gmail.com> wrote:

> how long does it take if you remove the repartition and just collect the
> result? I don't think repartition is needed here. There's already a shuffle
> for group by
>
> On Tue, Mar 28, 2017 at 10:35 PM, KhajaAsmath Mohammed <
> mdkhajaasm...@gmail.com> wrote:
>
>> Hi,
>>
>> I am working on requirement where i need to join two tables and do group
>> by to get max value on some fileds.
>>
>> Table1: 10 GB of data
>> Table2: 96 GB of data
>>
>> Same query in Impala is taking around 20 miniutes and it took almost 3
>> hours to run in spark sql.
>>
>> I have added repartition to dataframe, persist as memory and disk still
>> response is very bad. any suggetions.
>>
>> val results_group_dataframe=sqlContext.sql("SELECT 
>> a.originalsamplingstate,a.vin,max(a.utctime) as geospatialtime FROM 
>> GeoSpatialTemp A GROUP BY a.VIN, 
>> a.OriginalSamplingState").repartition(numPartitions)
>>
>> Thanks,
>>
>> Asmath
>>
>>
>

Reply via email to