Hello all,
I'm trying to use the spark.sql min_by aggregation function with pyspark.
I'm relying on this distribution of spark : spark-3.2.1-bin-hadoop3.2
I have a dataframe made of these columns:
- productId : int
- sellerId : int
- price : double
For each product, I want to get the seller who sells the product for the
cheapest price.
Naive approach would be to do this, but I would expect two shuffles:
import spark.sql.functions as F
cheapest_prices_df =
df.groupby('productId').agg(F.min('price').alias('price'))
cheapest_sellers_df = df.join(cheapest_prices_df, on=['productId', 'price'])
I would had loved to do this instead :
import spark.sql.functions as F
cheapest_sellers_df = df.groupby('productId').agg(F.min('price'),
F.min_by('sellerId', 'price'))
Unfortunately min_by does not seem available in pyspark sql functions,
whereas I can see it in the doc :
https://spark.apache.org/docs/latest/api/sql/index.html
I have managed to use min_by with this approach but it looks slow (maybe
because of temp table creation ?):
df.createOrReplaceTempView("table")
cheapest_sellers_df = spark.sql("select min_by(sellerId, price) sellerId,
min(price) from table group by productId")
Is there a way I can rely on min_by directly in groupby ?
Is there some code missing in pyspark wrapper to make min_by visible
somehow ?
Thank you in advance for your help.
Cheers
David