> > Ordering for a struct goes in order of the fields. So the max struct is > the one with the highest TotalValue (and then the highest category > if there are multiple entries with the same hour and total value). > > Is this due to "InterpretedOrdering" in StructType? >
That is one implementation, but the code generated ordering also follows the same contract. > 4) Is it faster doing it this way than doing a join or window function > in Spark SQL? > > Way faster. This is a very efficient way to calculate argmax. > > Can you explain how this is way faster than window function? I can > understand join doesn't make sense in this case. But to calculate the > grouping max, you just have to shuffle the data by grouping keys. You maybe > can do a combiner on the mapper side before shuffling, but that is it. Do > you mean windowing function in Spark SQL won't do any map side combiner, > even it is for max? > Windowing can't do partial aggregation and will have to collect all the data for a group so that it can be sorted before applying the function. In contrast a max aggregation will do partial aggregation (map side combining) and can be calculated in a streaming fashion. Also, aggregation is more common and thus has seen more optimization beyond the theoretical limits described above.