I did register the bucket UDF (you can see me using it in the examples), and 
the docs were helpful to an extent, but the issue is that it only shows how to 
use bucketing when it’s the only partitioning scheme, not the innermost of a 
multi-level partitioning scheme. That’s what I’m having trouble with (I can get 
things to work just fine if I follow the docs and only partition by the 
bucketed ID).

From: Ryan Blue <rb...@netflix.com.INVALID>
Reply-To: "dev@iceberg.apache.org" <dev@iceberg.apache.org>, 
"rb...@netflix.com" <rb...@netflix.com>
Date: Friday, November 20, 2020 at 8:11 PM
To: Iceberg Dev List <dev@iceberg.apache.org>
Subject: Re: Bucket partitioning in addition to regular partitioning

This message contains hyperlinks, take precaution before opening these links.
Hi Scott,

There are some docs to help with this situation: 
https://iceberg.apache.org/spark/#writing-against-partitioned-table<https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Ficeberg.apache.org%2Fspark%2F%23writing-against-partitioned-table&data=04%7C01%7Csckruger%40paypal.com%7C7f069b9f8f34493744a708d88dc2b53e%7Cfb00791460204374977e21bac5f3f4c8%7C1%7C0%7C637415214691902926%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=kV1aAAall4L5Nv7z%2BSOnlkoOKtz4LWjp4SHrmMPTPpE%3D&reserved=0>

We added a helper function, IcebergSpark.registerBucketUDF, to register the UDF 
that you need for the bucket column. That's probably the source of the problem.

I always recommend an orderBy with the partition expressions to write. Spark 
seems to do best when it produces a global ordering.

rb

On Fri, Nov 20, 2020 at 2:40 PM Kruger, Scott <sckru...@paypal.com.invalid> 
wrote:
I want to have a table that’s partitioned by the following, in order:


  *   Low-cardinality identity
  *   Day
  *   Bucketed long ID, 16 buckets

Is this possible? If so, how should I do the dataframe write? This is what I’ve 
tried so far:


  1.  df.orderBy(“identity”, 
“day”).sortWithinPartitions(expr(“iceberg_bucket16(id)”))
  2.  df.orderBy(“identity”, “day”, expr(“iceberg_bucket16(id)”))
  3.  df.repartition(“identity”, “day”).sortWithinPartitions(“identity”, “day”, 
expr(“iceberg_bucket16(id)”))
  4.  df.repartition(“identity”,  “day”, “id”).sortWithinPartitions(“identity”, 
“day”, expr(“iceberg_bucket16(id)”))
  5.  df.repartitionByRange(“identity”, “day”).sortWithinPartitions(“identity”, 
“day”, expr(“iceberg_bucket16(id)”))
  6.  df.repartitionByRange(“identity”,  “day”, 
“id”).sortWithinPartitions(“identity”, “day”, expr(“iceberg_bucket16(id)”))

But I keep getting the error indicating that a partition has already been 
closed.


--
Ryan Blue
Software Engineer
Netflix

Reply via email to