It should work if you use `ORDER BY category, ts, iceberg_bucket16(id)`.
You just need to ensure that each task receives data clustered by partition.

On Tue, Nov 24, 2020 at 7:25 AM Kruger, Scott <sckru...@paypal.com> wrote:

> I did register the bucket UDF (you can see me using it in the examples),
> and the docs were helpful to an extent, but the issue is that it only shows
> how to use bucketing when it’s the only partitioning scheme, not the
> innermost of a multi-level partitioning scheme. That’s what I’m having
> trouble with (I can get things to work just fine if I follow the docs and
> only partition by the bucketed ID).
>
>
>
> *From: *Ryan Blue <rb...@netflix.com.INVALID>
> *Reply-To: *"dev@iceberg.apache.org" <dev@iceberg.apache.org>, "
> rb...@netflix.com" <rb...@netflix.com>
> *Date: *Friday, November 20, 2020 at 8:11 PM
> *To: *Iceberg Dev List <dev@iceberg.apache.org>
> *Subject: *Re: Bucket partitioning in addition to regular partitioning
>
>
>
> This message contains hyperlinks, take precaution before opening these
> links.
>
> Hi Scott,
>
>
>
> There are some docs to help with this situation:
> https://iceberg.apache.org/spark/#writing-against-partitioned-table
> <https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Ficeberg.apache.org%2Fspark%2F%23writing-against-partitioned-table&data=04%7C01%7Csckruger%40paypal.com%7C7f069b9f8f34493744a708d88dc2b53e%7Cfb00791460204374977e21bac5f3f4c8%7C1%7C0%7C637415214691902926%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=kV1aAAall4L5Nv7z%2BSOnlkoOKtz4LWjp4SHrmMPTPpE%3D&reserved=0>
>
>
>
> We added a helper function, IcebergSpark.registerBucketUDF, to register
> the UDF that you need for the bucket column. That's probably the source of
> the problem.
>
>
>
> I always recommend an orderBy with the partition expressions to write.
> Spark seems to do best when it produces a global ordering.
>
>
>
> rb
>
>
>
> On Fri, Nov 20, 2020 at 2:40 PM Kruger, Scott <sckru...@paypal.com.invalid>
> wrote:
>
> I want to have a table that’s partitioned by the following, in order:
>
>
>
>    - Low-cardinality identity
>    - Day
>    - Bucketed long ID, 16 buckets
>
>
>
> Is this possible? If so, how should I do the dataframe write? This is what
> I’ve tried so far:
>
>
>
>    1. df.orderBy(“identity”,
>    “day”).sortWithinPartitions(expr(“iceberg_bucket16(id)”))
>    2. df.orderBy(“identity”, “day”, expr(“iceberg_bucket16(id)”))
>    3. df.repartition(“identity”, “day”).sortWithinPartitions(“identity”,
>    “day”, expr(“iceberg_bucket16(id)”))
>    4. df.repartition(“identity”,  “day”,
>    “id”).sortWithinPartitions(“identity”, “day”, expr(“iceberg_bucket16(id)”))
>    5. df.repartitionByRange(“identity”,
>    “day”).sortWithinPartitions(“identity”, “day”, 
> expr(“iceberg_bucket16(id)”))
>    6. df.repartitionByRange(“identity”,  “day”,
>    “id”).sortWithinPartitions(“identity”, “day”, expr(“iceberg_bucket16(id)”))
>
>
>
> But I keep getting the error indicating that a partition has already been
> closed.
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>


-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to