So there's the hive partitions, that's at rest partitioning, vs Spark
partitioning, make sure you're not confusing the two. If the cardinality of
the column you want to bucket by isn't too high and you don't have data
skewness with respect to the buckets then you should use it (and each
partition h
Thank you Peyman for clarifying this for me.
Would you say there's a case for using bucketing in this case at all, or
should I simply focus completely on the sorting solution? If so, when would
you say bucketing is the preferred solution?
Patrik Iselind
On Thu, Dec 31, 2020 at 4:15 PM Peyman Moh
Looking at the Big Picture https://backbutton.co.uk/about.html
This guy gives his reasons for choosing Flink over Spark. https://youtu.be/sYlbD_OoHhs
Airbus makes more of the sky with Flink - Jesse Anderson & Hassene Ben Salem
Is he leading people up the wrong garden path by making a
You can save your data to hdfs or other targets using either a sorted or
bucketed dataframe. In the case of bucketing you will have a different data
skipping mechanism when you read back the data compared to the sorted
version.
On Thu, Dec 31, 2020 at 5:40 AM Patrik Iselind wrote:
> Hi everyone,
Hi everyone,
I am trying to push by understanding of bucketing vs sorting. I hope I can
get som clarification from this list.
Bucketing as I've come to understand it is primarily intended for when
preparing the dataframe for join operations. Where the goal is to get data
that will be joined toget
Holden Karau https://www.amazon.co.uk/High-Performance-Spark-Practices-Optimizing-ebook/dp/B0725YT69J
made the same point earlier (can be found in the archives) as seen in the Big Picture https://backbutton.co.uk/about.html
when she said Apache Spark and Apache Flink are not "enemies",