It should work if you use `ORDER BY category, ts, iceberg_bucket16(id)`. You just need to ensure that each task receives data clustered by partition.
On Tue, Nov 24, 2020 at 7:25 AM Kruger, Scott <sckru...@paypal.com> wrote: > I did register the bucket UDF (you can see me using it in the examples), > and the docs were helpful to an extent, but the issue is that it only shows > how to use bucketing when it’s the only partitioning scheme, not the > innermost of a multi-level partitioning scheme. That’s what I’m having > trouble with (I can get things to work just fine if I follow the docs and > only partition by the bucketed ID). > > > > *From: *Ryan Blue <rb...@netflix.com.INVALID> > *Reply-To: *"dev@iceberg.apache.org" <dev@iceberg.apache.org>, " > rb...@netflix.com" <rb...@netflix.com> > *Date: *Friday, November 20, 2020 at 8:11 PM > *To: *Iceberg Dev List <dev@iceberg.apache.org> > *Subject: *Re: Bucket partitioning in addition to regular partitioning > > > > This message contains hyperlinks, take precaution before opening these > links. > > Hi Scott, > > > > There are some docs to help with this situation: > https://iceberg.apache.org/spark/#writing-against-partitioned-table > <https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Ficeberg.apache.org%2Fspark%2F%23writing-against-partitioned-table&data=04%7C01%7Csckruger%40paypal.com%7C7f069b9f8f34493744a708d88dc2b53e%7Cfb00791460204374977e21bac5f3f4c8%7C1%7C0%7C637415214691902926%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=kV1aAAall4L5Nv7z%2BSOnlkoOKtz4LWjp4SHrmMPTPpE%3D&reserved=0> > > > > We added a helper function, IcebergSpark.registerBucketUDF, to register > the UDF that you need for the bucket column. That's probably the source of > the problem. > > > > I always recommend an orderBy with the partition expressions to write. > Spark seems to do best when it produces a global ordering. > > > > rb > > > > On Fri, Nov 20, 2020 at 2:40 PM Kruger, Scott <sckru...@paypal.com.invalid> > wrote: > > I want to have a table that’s partitioned by the following, in order: > > > > - Low-cardinality identity > - Day > - Bucketed long ID, 16 buckets > > > > Is this possible? If so, how should I do the dataframe write? This is what > I’ve tried so far: > > > > 1. df.orderBy(“identity”, > “day”).sortWithinPartitions(expr(“iceberg_bucket16(id)”)) > 2. df.orderBy(“identity”, “day”, expr(“iceberg_bucket16(id)”)) > 3. df.repartition(“identity”, “day”).sortWithinPartitions(“identity”, > “day”, expr(“iceberg_bucket16(id)”)) > 4. df.repartition(“identity”, “day”, > “id”).sortWithinPartitions(“identity”, “day”, expr(“iceberg_bucket16(id)”)) > 5. df.repartitionByRange(“identity”, > “day”).sortWithinPartitions(“identity”, “day”, > expr(“iceberg_bucket16(id)”)) > 6. df.repartitionByRange(“identity”, “day”, > “id”).sortWithinPartitions(“identity”, “day”, expr(“iceberg_bucket16(id)”)) > > > > But I keep getting the error indicating that a partition has already been > closed. > > > > > -- > > Ryan Blue > > Software Engineer > > Netflix > -- Ryan Blue Software Engineer Netflix