Thank you Peyman for clarifying this for me. Would you say there's a case for using bucketing in this case at all, or should I simply focus completely on the sorting solution? If so, when would you say bucketing is the preferred solution?
Patrik Iselind On Thu, Dec 31, 2020 at 4:15 PM Peyman Mohajerian <mohaj...@gmail.com> wrote: > You can save your data to hdfs or other targets using either a sorted or > bucketed dataframe. In the case of bucketing you will have a different data > skipping mechanism when you read back the data compared to the sorted > version. > > On Thu, Dec 31, 2020 at 5:40 AM Patrik Iselind <patrik....@gmail.com> > wrote: > >> Hi everyone, >> >> I am trying to push by understanding of bucketing vs sorting. I hope I >> can get som clarification from this list. >> >> Bucketing as I've come to understand it is primarily intended for when >> preparing the dataframe for join operations. Where the goal is to get data >> that will be joined together in the same partition, to make the joins >> faster. >> >> Sorting on the other hand is simply for when I want my data sorted, >> nothing strange there I guess. >> >> The effect of using bucketing, as I see it, would be the same as sorting >> if I'm not doing any joining and I use enough buckets, like in the >> following example program. Where the sorting or bucketing would replace the >> `?()` transformation. >> >> ```pseudo code >> df = spark.read.parquet("s3://...") >> // df contains the columns A, B, and C >> df2 = df.distinct().?().repartition(num_desired_partitions) >> df2.write.parquet("s3://,,,") >> ``` >> >> Is my understanding correct or am I missing something? >> >> Is there a performance consideration between sorting and bucketing that I >> need to keep in mind? >> >> The end goal for me here is not that the data as such is sorted on the A >> column, it's that each distinct value of A is kept together with all other >> rows which have the same value in A. If all rows with the same A value >> cannot fit within one partitions, then I accept that there's more than one >> partitions with the same value in the A column. If there's room left in the >> partitions, then it would be fine for rows with another value of A to fill >> up the partition. >> >> I would like something as depicted below >> ```desireable example >> -- Partition 1 >> A|B|C >> ===== >> 2|?|? >> 2|?|? >> 2|?|? >> 2|?|? >> 2|?|? >> 2|?|? >> 2|?|? >> -- Partition 2 >> A|B|C >> ===== >> 2|?|? >> 0|?|? >> 0|?|? >> 0|?|? >> 1|?|? >> ``` >> >> What I don't want is something like below >> >> ```undesireable example >> -- Partition 1 >> A|B|C >> ===== >> 0|?|? >> 0|?|? >> 1|?|? >> 0|?|? >> 1|?|? >> 2|?|? >> 1|?|? >> -- Partition 2 >> A|B|C >> ===== >> 0|?|? >> 0|?|? >> 0|?|? >> 1|?|? >> 2|?|? >> ``` >> Where the A value varies. >> >> Patrik Iselind >> >