I did register the bucket UDF (you can see me using it in the examples), and the docs were helpful to an extent, but the issue is that it only shows how to use bucketing when it’s the only partitioning scheme, not the innermost of a multi-level partitioning scheme. That’s what I’m having trouble with (I can get things to work just fine if I follow the docs and only partition by the bucketed ID).
From: Ryan Blue <rb...@netflix.com.INVALID> Reply-To: "dev@iceberg.apache.org" <dev@iceberg.apache.org>, "rb...@netflix.com" <rb...@netflix.com> Date: Friday, November 20, 2020 at 8:11 PM To: Iceberg Dev List <dev@iceberg.apache.org> Subject: Re: Bucket partitioning in addition to regular partitioning This message contains hyperlinks, take precaution before opening these links. Hi Scott, There are some docs to help with this situation: https://iceberg.apache.org/spark/#writing-against-partitioned-table<https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Ficeberg.apache.org%2Fspark%2F%23writing-against-partitioned-table&data=04%7C01%7Csckruger%40paypal.com%7C7f069b9f8f34493744a708d88dc2b53e%7Cfb00791460204374977e21bac5f3f4c8%7C1%7C0%7C637415214691902926%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=kV1aAAall4L5Nv7z%2BSOnlkoOKtz4LWjp4SHrmMPTPpE%3D&reserved=0> We added a helper function, IcebergSpark.registerBucketUDF, to register the UDF that you need for the bucket column. That's probably the source of the problem. I always recommend an orderBy with the partition expressions to write. Spark seems to do best when it produces a global ordering. rb On Fri, Nov 20, 2020 at 2:40 PM Kruger, Scott <sckru...@paypal.com.invalid> wrote: I want to have a table that’s partitioned by the following, in order: * Low-cardinality identity * Day * Bucketed long ID, 16 buckets Is this possible? If so, how should I do the dataframe write? This is what I’ve tried so far: 1. df.orderBy(“identity”, “day”).sortWithinPartitions(expr(“iceberg_bucket16(id)”)) 2. df.orderBy(“identity”, “day”, expr(“iceberg_bucket16(id)”)) 3. df.repartition(“identity”, “day”).sortWithinPartitions(“identity”, “day”, expr(“iceberg_bucket16(id)”)) 4. df.repartition(“identity”, “day”, “id”).sortWithinPartitions(“identity”, “day”, expr(“iceberg_bucket16(id)”)) 5. df.repartitionByRange(“identity”, “day”).sortWithinPartitions(“identity”, “day”, expr(“iceberg_bucket16(id)”)) 6. df.repartitionByRange(“identity”, “day”, “id”).sortWithinPartitions(“identity”, “day”, expr(“iceberg_bucket16(id)”)) But I keep getting the error indicating that a partition has already been closed. -- Ryan Blue Software Engineer Netflix