Re: Bucket partitioning in addition to regular partitioning

2020-11-20 Thread Ryan Blue
Hi Scott, There are some docs to help with this situation: https://iceberg.apache.org/spark/#writing-against-partitioned-table We added a helper function, IcebergSpark.registerBucketUDF, to register the UDF that you need for the bucket column. That's probably the source of the problem. I always

Bucket partitioning in addition to regular partitioning

2020-11-20 Thread Kruger, Scott
I want to have a table that’s partitioned by the following, in order: * Low-cardinality identity * Day * Bucketed long ID, 16 buckets Is this possible? If so, how should I do the dataframe write? This is what I’ve tried so far: 1. df.orderBy(“identity”, “day”).sortWithinPartit

Re: Integrating Existing Iceberg Tables with a Metastore

2020-11-20 Thread Jacques Nadeau
FYI, I would avoid adopting HMS because you need a better catalog. While the HMS Iceberg catalog is mature, you're adopting something (HMS) that carries a lot of baggage. I'd look at the other catalogs that are up and coming if you can. For example, Nessie (projectnessie.org) was built to provide

Re: Integrating Existing Iceberg Tables with a Metastore

2020-11-20 Thread Marko Babic
Hi Peter. Thanks for responding. > The command you mention below: `CREATE EXTERNAL TABLE` above an existing Iceberg table will not transfer the "responsibility" of tracking the snapshot to HMS. It only creates a HMS external table ... So my understanding is that the HiveCatalog is basically just

Re: Integrating Existing Iceberg Tables with a Metastore

2020-11-20 Thread Marko Babic
Hi John, Thanks for the experience report and pointers to resources. :) If we do end up going down that road it'll be super helpful. Marko On Thu, Nov 19, 2020 at 12:29 PM John Clara wrote: > Hi, > > My team has been using the custom catalog along with atomic metadata > updates but we never mi

Re: Proposal for additional fields in Iceberg manifest files

2020-11-20 Thread Mass Dosage
+1 - I also like the idea of having more data profiling info for the partition but worry about hostnames and IP addresses and maintaining those as things change, especially if you have hundreds of hosts, I'd rather leave that to the name node. On Fri, 20 Nov 2020 at 17:48, Ryan Blue wrote: > Tha

Re: Proposal for additional fields in Iceberg manifest files

2020-11-20 Thread Ryan Blue
Thanks Vivekanand! I made some comments on the doc. Overall, I think a partition index is a good idea. We've thought about adding sketches that contain skew estimates for certain columns in a partition so that we can do better join estimation. Getting a start on how we would store data like this i

Proposal for additional fields in Iceberg manifest files

2020-11-20 Thread Vivekanand Vellanki
Hi, I would like to propose additional fields in Iceberg manifest files to support the following scenarios: - Partition index to include per-partition stats to help support planning - Data locality infor

Re: Integrating Existing Iceberg Tables with a Metastore

2020-11-20 Thread Peter Vary
Hi Marko, The command you mention below: `CREATE EXTERNAL TABLE` above an existing Iceberg table will not transfer the "responsibility" of tracking the snapshot to HMS. It only creates a HMS external table which will allow Hive queries to read the given table. If you want to track the snapshot