You can save your data to hdfs or other targets using either a sorted or bucketed dataframe. In the case of bucketing you will have a different data skipping mechanism when you read back the data compared to the sorted version.
On Thu, Dec 31, 2020 at 5:40 AM Patrik Iselind <patrik....@gmail.com> wrote: > Hi everyone, > > I am trying to push by understanding of bucketing vs sorting. I hope I can > get som clarification from this list. > > Bucketing as I've come to understand it is primarily intended for when > preparing the dataframe for join operations. Where the goal is to get data > that will be joined together in the same partition, to make the joins > faster. > > Sorting on the other hand is simply for when I want my data sorted, > nothing strange there I guess. > > The effect of using bucketing, as I see it, would be the same as sorting > if I'm not doing any joining and I use enough buckets, like in the > following example program. Where the sorting or bucketing would replace the > `?()` transformation. > > ```pseudo code > df = spark.read.parquet("s3://...") > // df contains the columns A, B, and C > df2 = df.distinct().?().repartition(num_desired_partitions) > df2.write.parquet("s3://,,,") > ``` > > Is my understanding correct or am I missing something? > > Is there a performance consideration between sorting and bucketing that I > need to keep in mind? > > The end goal for me here is not that the data as such is sorted on the A > column, it's that each distinct value of A is kept together with all other > rows which have the same value in A. If all rows with the same A value > cannot fit within one partitions, then I accept that there's more than one > partitions with the same value in the A column. If there's room left in the > partitions, then it would be fine for rows with another value of A to fill > up the partition. > > I would like something as depicted below > ```desireable example > -- Partition 1 > A|B|C > ===== > 2|?|? > 2|?|? > 2|?|? > 2|?|? > 2|?|? > 2|?|? > 2|?|? > -- Partition 2 > A|B|C > ===== > 2|?|? > 0|?|? > 0|?|? > 0|?|? > 1|?|? > ``` > > What I don't want is something like below > > ```undesireable example > -- Partition 1 > A|B|C > ===== > 0|?|? > 0|?|? > 1|?|? > 0|?|? > 1|?|? > 2|?|? > 1|?|? > -- Partition 2 > A|B|C > ===== > 0|?|? > 0|?|? > 0|?|? > 1|?|? > 2|?|? > ``` > Where the A value varies. > > Patrik Iselind >