You can save your data to hdfs or other targets using either a sorted or
bucketed dataframe. In the case of bucketing you will have a different data
skipping mechanism when you read back the data compared to the sorted
version.

On Thu, Dec 31, 2020 at 5:40 AM Patrik Iselind <patrik....@gmail.com> wrote:

> Hi everyone,
>
> I am trying to push by understanding of bucketing vs sorting. I hope I can
> get som clarification from this list.
>
> Bucketing as I've come to understand it is primarily intended for when
> preparing the dataframe for join operations. Where the goal is to get data
> that will be joined together in the same partition, to make the joins
> faster.
>
> Sorting on the other hand is simply for when I want my data sorted,
> nothing strange there I guess.
>
> The effect of using bucketing, as I see it, would be the same as sorting
> if I'm not doing any joining and I use enough buckets, like in the
> following example program. Where the sorting or bucketing would replace the
> `?()` transformation.
>
> ```pseudo code
> df = spark.read.parquet("s3://...")
> // df contains the columns A, B, and C
> df2 = df.distinct().?().repartition(num_desired_partitions)
> df2.write.parquet("s3://,,,")
> ```
>
> Is my understanding correct or am I missing something?
>
> Is there a performance consideration between sorting and bucketing that I
> need to keep in mind?
>
> The end goal for me here is not that the data as such is sorted on the A
> column, it's that each  distinct value of A is kept together with all other
> rows which have the same value in A. If all rows with the same A value
> cannot fit within one partitions, then I accept that there's more than one
> partitions with the same value in the A column. If there's room left in the
> partitions, then it would be fine for rows with another value of A to fill
> up the partition.
>
> I would like something as depicted below
> ```desireable example
> -- Partition 1
> A|B|C
> =====
> 2|?|?
> 2|?|?
> 2|?|?
> 2|?|?
> 2|?|?
> 2|?|?
> 2|?|?
> -- Partition 2
> A|B|C
> =====
> 2|?|?
> 0|?|?
> 0|?|?
> 0|?|?
> 1|?|?
> ```
>
> What I don't want is something like below
>
> ```undesireable example
> -- Partition 1
> A|B|C
> =====
> 0|?|?
> 0|?|?
> 1|?|?
> 0|?|?
> 1|?|?
> 2|?|?
> 1|?|?
> -- Partition 2
> A|B|C
> =====
> 0|?|?
> 0|?|?
> 0|?|?
> 1|?|?
> 2|?|?
> ```
> Where the A value varies.
>
> Patrik Iselind
>

Reply via email to