It should work if you use `ORDER BY category, ts, iceberg_bucket16(id)`.
You just need to ensure that each task receives data clustered by partition.

On Tue, Nov 24, 2020 at 7:25 AM Kruger, Scott <> wrote:

> I did register the bucket UDF (you can see me using it in the examples),
> and the docs were helpful to an extent, but the issue is that it only shows
> how to use bucketing when it’s the only partitioning scheme, not the
> innermost of a multi-level partitioning scheme. That’s what I’m having
> trouble with (I can get things to work just fine if I follow the docs and
> only partition by the bucketed ID).
> *From: *Ryan Blue <>
> *Reply-To: *"" <>, "
>" <>
> *Date: *Friday, November 20, 2020 at 8:11 PM
> *To: *Iceberg Dev List <>
> *Subject: *Re: Bucket partitioning in addition to regular partitioning
> This message contains hyperlinks, take precaution before opening these
> links.
> Hi Scott,
> There are some docs to help with this situation:
> <>
> We added a helper function, IcebergSpark.registerBucketUDF, to register
> the UDF that you need for the bucket column. That's probably the source of
> the problem.
> I always recommend an orderBy with the partition expressions to write.
> Spark seems to do best when it produces a global ordering.
> rb
> On Fri, Nov 20, 2020 at 2:40 PM Kruger, Scott <>
> wrote:
> I want to have a table that’s partitioned by the following, in order:
>    - Low-cardinality identity
>    - Day
>    - Bucketed long ID, 16 buckets
> Is this possible? If so, how should I do the dataframe write? This is what
> I’ve tried so far:
>    1. df.orderBy(“identity”,
>    “day”).sortWithinPartitions(expr(“iceberg_bucket16(id)”))
>    2. df.orderBy(“identity”, “day”, expr(“iceberg_bucket16(id)”))
>    3. df.repartition(“identity”, “day”).sortWithinPartitions(“identity”,
>    “day”, expr(“iceberg_bucket16(id)”))
>    4. df.repartition(“identity”,  “day”,
>    “id”).sortWithinPartitions(“identity”, “day”, expr(“iceberg_bucket16(id)”))
>    5. df.repartitionByRange(“identity”,
>    “day”).sortWithinPartitions(“identity”, “day”, 
> expr(“iceberg_bucket16(id)”))
>    6. df.repartitionByRange(“identity”,  “day”,
>    “id”).sortWithinPartitions(“identity”, “day”, expr(“iceberg_bucket16(id)”))
> But I keep getting the error indicating that a partition has already been
> closed.
> --
> Ryan Blue
> Software Engineer
> Netflix

Ryan Blue
Software Engineer

Reply via email to