It should work if you use `ORDER BY category, ts, iceberg_bucket16(id)`. You just need to ensure that each task receives data clustered by partition.
On Tue, Nov 24, 2020 at 7:25 AM Kruger, Scott <[email protected]> wrote: > I did register the bucket UDF (you can see me using it in the examples), > and the docs were helpful to an extent, but the issue is that it only shows > how to use bucketing when it’s the only partitioning scheme, not the > innermost of a multi-level partitioning scheme. That’s what I’m having > trouble with (I can get things to work just fine if I follow the docs and > only partition by the bucketed ID). > > > > *From: *Ryan Blue <[email protected]> > *Reply-To: *"[email protected]" <[email protected]>, " > [email protected]" <[email protected]> > *Date: *Friday, November 20, 2020 at 8:11 PM > *To: *Iceberg Dev List <[email protected]> > *Subject: *Re: Bucket partitioning in addition to regular partitioning > > > > This message contains hyperlinks, take precaution before opening these > links. > > Hi Scott, > > > > There are some docs to help with this situation: > https://iceberg.apache.org/spark/#writing-against-partitioned-table > <https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Ficeberg.apache.org%2Fspark%2F%23writing-against-partitioned-table&data=04%7C01%7Csckruger%40paypal.com%7C7f069b9f8f34493744a708d88dc2b53e%7Cfb00791460204374977e21bac5f3f4c8%7C1%7C0%7C637415214691902926%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=kV1aAAall4L5Nv7z%2BSOnlkoOKtz4LWjp4SHrmMPTPpE%3D&reserved=0> > > > > We added a helper function, IcebergSpark.registerBucketUDF, to register > the UDF that you need for the bucket column. That's probably the source of > the problem. > > > > I always recommend an orderBy with the partition expressions to write. > Spark seems to do best when it produces a global ordering. > > > > rb > > > > On Fri, Nov 20, 2020 at 2:40 PM Kruger, Scott <[email protected]> > wrote: > > I want to have a table that’s partitioned by the following, in order: > > > > - Low-cardinality identity > - Day > - Bucketed long ID, 16 buckets > > > > Is this possible? If so, how should I do the dataframe write? This is what > I’ve tried so far: > > > > 1. df.orderBy(“identity”, > “day”).sortWithinPartitions(expr(“iceberg_bucket16(id)”)) > 2. df.orderBy(“identity”, “day”, expr(“iceberg_bucket16(id)”)) > 3. df.repartition(“identity”, “day”).sortWithinPartitions(“identity”, > “day”, expr(“iceberg_bucket16(id)”)) > 4. df.repartition(“identity”, “day”, > “id”).sortWithinPartitions(“identity”, “day”, expr(“iceberg_bucket16(id)”)) > 5. df.repartitionByRange(“identity”, > “day”).sortWithinPartitions(“identity”, “day”, > expr(“iceberg_bucket16(id)”)) > 6. df.repartitionByRange(“identity”, “day”, > “id”).sortWithinPartitions(“identity”, “day”, expr(“iceberg_bucket16(id)”)) > > > > But I keep getting the error indicating that a partition has already been > closed. > > > > > -- > > Ryan Blue > > Software Engineer > > Netflix > -- Ryan Blue Software Engineer Netflix
