Hello all, I'm trying to use the spark.sql min_by aggregation function with pyspark. I'm relying on this distribution of spark : spark-3.2.1-bin-hadoop3.2
I have a dataframe made of these columns: - productId : int - sellerId : int - price : double For each product, I want to get the seller who sells the product for the cheapest price. Naive approach would be to do this, but I would expect two shuffles: import spark.sql.functions as F cheapest_prices_df = df.groupby('productId').agg(F.min('price').alias('price')) cheapest_sellers_df = df.join(cheapest_prices_df, on=['productId', 'price']) I would had loved to do this instead : import spark.sql.functions as F cheapest_sellers_df = df.groupby('productId').agg(F.min('price'), F.min_by('sellerId', 'price')) Unfortunately min_by does not seem available in pyspark sql functions, whereas I can see it in the doc : https://spark.apache.org/docs/latest/api/sql/index.html I have managed to use min_by with this approach but it looks slow (maybe because of temp table creation ?): df.createOrReplaceTempView("table") cheapest_sellers_df = spark.sql("select min_by(sellerId, price) sellerId, min(price) from table group by productId") Is there a way I can rely on min_by directly in groupby ? Is there some code missing in pyspark wrapper to make min_by visible somehow ? Thank you in advance for your help. Cheers David