Hi, I am working on requirement where I need to perform groupby on set of data and find the max value on that group.
GroupBy on dataframe is resulting in skewness and job is running for quite a long time (actually more time than in Hive and Impala for one day worth of data). Any suggestions on how to overcome this? dataframe.groupBy(Constants.Datapoint.Vin,Constants.Datapoint.Utctime,Constants.Datapoint.ProviderDesc,Constants.Datapoint.Latitude,Constants.Datapoint.Longitude) *Note: *I have added colleace and persited data into memory and disk too still no improvement Thanks, Asmath.