Re: Bucket partitioning in addition to regular partitioning

Ryan Blue Fri, 20 Nov 2020 18:10:51 -0800

Hi Scott,

There are some docs to help with this situation:
https://iceberg.apache.org/spark/#writing-against-partitioned-table


We added a helper function, IcebergSpark.registerBucketUDF, to register the
UDF that you need for the bucket column. That's probably the source of the
problem.

I always recommend an orderBy with the partition expressions to write.
Spark seems to do best when it produces a global ordering.

rb

On Fri, Nov 20, 2020 at 2:40 PM Kruger, Scott <sckru...@paypal.com.invalid>
wrote:

> I want to have a table that’s partitioned by the following, in order:
>
>
>
>    - Low-cardinality identity
>    - Day
>    - Bucketed long ID, 16 buckets
>
>
>
> Is this possible? If so, how should I do the dataframe write? This is what
> I’ve tried so far:
>
>
>
>    1. df.orderBy(“identity”,
>    “day”).sortWithinPartitions(expr(“iceberg_bucket16(id)”))
>    2. df.orderBy(“identity”, “day”, expr(“iceberg_bucket16(id)”))
>    3. df.repartition(“identity”, “day”).sortWithinPartitions(“identity”,
>    “day”, expr(“iceberg_bucket16(id)”))
>    4. df.repartition(“identity”,  “day”,
>    “id”).sortWithinPartitions(“identity”, “day”, expr(“iceberg_bucket16(id)”))
>    5. df.repartitionByRange(“identity”,
>    “day”).sortWithinPartitions(“identity”, “day”, 
> expr(“iceberg_bucket16(id)”))
>    6. df.repartitionByRange(“identity”,  “day”,
>    “id”).sortWithinPartitions(“identity”, “day”, expr(“iceberg_bucket16(id)”))
>
>
>
> But I keep getting the error indicating that a partition has already been
> closed.
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: Bucket partitioning in addition to regular partitioning

Reply via email to