Hi,

I am working on requirement where I need to perform groupby on set of data
and find the max value on that group.

GroupBy on dataframe is resulting in skewness and job is running for quite
a long time (actually more time than in Hive and Impala for one day worth
of data).

Any suggestions on how to overcome this?

dataframe.groupBy(Constants.Datapoint.Vin,Constants.Datapoint.Utctime,Constants.Datapoint.ProviderDesc,Constants.Datapoint.Latitude,Constants.Datapoint.Longitude)

*Note: *I have added colleace and persited data into memory and disk too
still no improvement

Thanks,
Asmath.

Reply via email to