[pyspark 2.4+] BucketBy SortBy doesn't retain sort order

Rishi Shah Mon, 02 Mar 2020 18:22:01 -0800

Hi All,

I have 2 large tables (~1TB), I used the following to save both the tables.
Then when I try to join both tables with join_column, it still does shuffle
& sort before the join. Could someone please help?


df.repartition(2000).write.bucketBy(1,
join_column).sortBy(join_column).saveAsTable(tablename)

-- 
Regards,

Rishi Shah

[pyspark 2.4+] BucketBy SortBy doesn't retain sort order

Reply via email to