Bucketed Joins on Iceberg

Romin Parekh Tue, 26 Jan 2021 17:36:51 -0800

Hi Iceberg Devs,
I am evaluating the performance of bucketed joins across two bucketed
datasets to find an optimal bucketing strategy. I was able to ingest into a
bucketed table [1] and using the TableScan API, I am able to see a subset
(total files / bucket size) of files being scanned [2]. I also benchmarked
different joins across the two datasets (different bucket variations).
However, I recently came across this comment
<https://github.com/apache/iceberg/issues/430#issuecomment-533360026> on
issue #430 <https://github.com/apache/iceberg/issues/430> indicating some
work is still pending for Spark to leverage Iceberg bucket values. I was
wondering if that comment is still accurate? Is there anything I can help
contribute?


*[1] - Partition Spec*

val partitionSpec = PartitionSpec
    .builderFor(mergedSchema)
    .identity("namespace")
    .bucket("id", numberOfBuckets)
    .build()

*[2] - TableScan API*

val iBucketIdExp  = Expressions.equal("id", "1")
val iBucketIdScan = table.newScan().filter(iBucketIdExp)
val filesScanned  = iBucketIdScan.planFiles.asScala.size.toLong


-- 
Thanks,
Romin

Bucketed Joins on Iceberg

Reply via email to