Thank you Peyman for clarifying this for me.
Would you say there's a case for using bucketing in this case at all, or
should I simply focus completely on the sorting solution? If so, when would
you say bucketing is the preferred solution?

Patrik Iselind


On Thu, Dec 31, 2020 at 4:15 PM Peyman Mohajerian <mohaj...@gmail.com>
wrote:

> You can save your data to hdfs or other targets using either a sorted or
> bucketed dataframe. In the case of bucketing you will have a different data
> skipping mechanism when you read back the data compared to the sorted
> version.
>
> On Thu, Dec 31, 2020 at 5:40 AM Patrik Iselind <patrik....@gmail.com>
> wrote:
>
>> Hi everyone,
>>
>> I am trying to push by understanding of bucketing vs sorting. I hope I
>> can get som clarification from this list.
>>
>> Bucketing as I've come to understand it is primarily intended for when
>> preparing the dataframe for join operations. Where the goal is to get data
>> that will be joined together in the same partition, to make the joins
>> faster.
>>
>> Sorting on the other hand is simply for when I want my data sorted,
>> nothing strange there I guess.
>>
>> The effect of using bucketing, as I see it, would be the same as sorting
>> if I'm not doing any joining and I use enough buckets, like in the
>> following example program. Where the sorting or bucketing would replace the
>> `?()` transformation.
>>
>> ```pseudo code
>> df = spark.read.parquet("s3://...")
>> // df contains the columns A, B, and C
>> df2 = df.distinct().?().repartition(num_desired_partitions)
>> df2.write.parquet("s3://,,,")
>> ```
>>
>> Is my understanding correct or am I missing something?
>>
>> Is there a performance consideration between sorting and bucketing that I
>> need to keep in mind?
>>
>> The end goal for me here is not that the data as such is sorted on the A
>> column, it's that each  distinct value of A is kept together with all other
>> rows which have the same value in A. If all rows with the same A value
>> cannot fit within one partitions, then I accept that there's more than one
>> partitions with the same value in the A column. If there's room left in the
>> partitions, then it would be fine for rows with another value of A to fill
>> up the partition.
>>
>> I would like something as depicted below
>> ```desireable example
>> -- Partition 1
>> A|B|C
>> =====
>> 2|?|?
>> 2|?|?
>> 2|?|?
>> 2|?|?
>> 2|?|?
>> 2|?|?
>> 2|?|?
>> -- Partition 2
>> A|B|C
>> =====
>> 2|?|?
>> 0|?|?
>> 0|?|?
>> 0|?|?
>> 1|?|?
>> ```
>>
>> What I don't want is something like below
>>
>> ```undesireable example
>> -- Partition 1
>> A|B|C
>> =====
>> 0|?|?
>> 0|?|?
>> 1|?|?
>> 0|?|?
>> 1|?|?
>> 2|?|?
>> 1|?|?
>> -- Partition 2
>> A|B|C
>> =====
>> 0|?|?
>> 0|?|?
>> 0|?|?
>> 1|?|?
>> 2|?|?
>> ```
>> Where the A value varies.
>>
>> Patrik Iselind
>>
>

Reply via email to