Yes, I think that is a good summary of the principles. #4 is correct because we provide some information that is informational (Hive schema) or tracked only by the metastore (best-effort current user). I also agree that it would be good to have a table identifier in HMS table metadata when loading from an external table. That gives us a way to handle name conflicts.
On Wed, Nov 25, 2020 at 5:14 PM Jacques Nadeau <jacq...@dremio.com> wrote: > Minor error, my last example should have been: > > db1.table1_etl_branch => nessie.folder1.folder2.folder3.table1@etl_branch > > -- > Jacques Nadeau > CTO and Co-Founder, Dremio > > > On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau <jacq...@dremio.com> wrote: > >> I agree with Ryan on the core principles here. As I understand them: >> >> 1. Iceberg metadata describes all properties of a table >> 2. Hive table properties describe "how to get to" Iceberg metadata >> (which catalog + possibly ptr, path, token, etc) >> 3. There could be default "how to get to" information set at a global >> level >> 4. Best-effort schema should stored be in the table properties in >> HMS. This should be done for information schema retrieval purposes within >> Hive but should be ignored during Hive/other tool execution. >> >> Is that a fair summary of your statements Ryan (except 4, which I just >> added)? >> >> One comment I have on #2 is that for different catalogs and use cases, I >> think it can be somewhat more complex where it would be desirable for a >> table that initially existed without Hive that was later exposed in Hive to >> support a ptr/path/token for how the table is named externally. For >> example, in a Nessie context we support arbitrary paths for an Iceberg >> table (such as folder1.folder2.folder3.table1). If you then want to expose >> that table to Hive, you might have this mapping for #2 >> >> db1.table1 => nessie:folder1.folder2.folder3.table1 >> >> Similarly, you might want to expose a particular branch version of a >> table. So it might say: >> >> db1.table1_etl_branch => nessie.folder1@etl_branch >> >> Just saying that the address to the table in the catalog could itself >> have several properties. The key being that no matter what those are, we >> should follow #1 and only store properties that are about the ptr, not the >> content/metadata. >> >> Lastly, I believe #4 is the case but haven't tested it. Can someone >> confirm that it is true? And that it is possible/not problematic? >> >> >> -- >> Jacques Nadeau >> CTO and Co-Founder, Dremio >> >> >> On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <rb...@netflix.com.invalid> >> wrote: >> >>> Thanks for working on this, Laszlo. I’ve been thinking about these >>> problems as well, so this is a good time to have a discussion about Hive >>> config. >>> >>> I think that Hive configuration should work mostly like other engines, >>> where different configurations are used for different purposes. Different >>> purposes means that there is not a global configuration priority. >>> Hopefully, I can explain how we use the different config sources elsewhere >>> to clarify. >>> >>> Let’s take Spark as an example. Spark uses Hadoop, so it has a Hadoop >>> Configuration, but it also has its own global configuration. There are also >>> Iceberg table properties, and all of the various Hive properties if you’re >>> tracking tables with a Hive MetaStore. >>> >>> The first step is to simplify where we can, so we effectively eliminate >>> 2 sources of config: >>> >>> - The Hadoop Configuration is only used to instantiate Hadoop >>> classes, like FileSystem. Iceberg should not use it for any other config. >>> - Config in the Hive MetaStore is only used to identify that a table >>> is Iceberg and point to its metadata location. All other config in HMS is >>> informational. For example, the input format is FileInputFormat so that >>> non-Iceberg readers cannot actually instantiate the format (it’s >>> abstract) >>> but it is available so they also don’t fail trying to load the class. >>> Table-specific config should not be stored in table or serde properties. >>> >>> That leaves Spark configuration and Iceberg table configuration. >>> >>> Iceberg differs from other tables because it is opinionated: data >>> configuration should be maintained at the table level. This is cleaner for >>> users because config is standardized across engines and in one place. And >>> it also enables services that analyze a table and update its configuration >>> to tune options that users almost never do, like row group or stripe size >>> in the columnar formats. Iceberg table configuration is used to configure >>> table-specific concerns and behavior. >>> >>> Spark configuration is used for engine-specific concerns, and runtime >>> overrides. A good example of an engine-specific concern is the catalogs >>> that are available to load Iceberg tables. Spark has a way to load and >>> configure catalog implementations and Iceberg uses that for all >>> catalog-level config. Runtime overrides are things like target split size. >>> Iceberg has a table-level default split size in table properties, but this >>> can be overridden by a Spark option for each table, as well as an option >>> passed to the individual read. Note that these necessarily have different >>> config names for how they are used: Iceberg uses read.split.target-size >>> and the read-specific option is target-size. >>> >>> Applying this to Hive is a little strange for a couple reasons. First, >>> Hive’s engine configuration *is* a Hadoop Configuration. As a result, I >>> think the right place to store engine-specific config is there, including >>> Iceberg catalogs using a strategy similar to what Spark does: what external >>> Iceberg catalogs are available and their configuration should come from the >>> HiveConf. >>> >>> The second way Hive is strange is that Hive needs to use its own >>> MetaStore to track Hive table concerns. The MetaStore may have tables >>> created by an Iceberg HiveCatalog, and Hive also needs to be able to load >>> tables from other Iceberg catalogs by creating table entries for them. >>> >>> Here’s how I think Hive should work: >>> >>> - There should be a default HiveCatalog that uses the current >>> MetaStore URI to be used for HiveCatalog tables tracked in the MetaStore >>> - Other catalogs should be defined in HiveConf >>> - HMS table properties should be used to determine how to load a >>> table: using a Hadoop location, using the default metastore catalog, or >>> using an external Iceberg catalog >>> - If there is a metadata_location, then use the HiveCatalog for >>> this metastore (where it is tracked) >>> - If there is a catalog property, then load that catalog and use >>> it to load the table identifier, or maybe an identifier from HMS table >>> properties >>> - If there is no catalog or metadata_location, then use >>> HadoopTables to load the table location as an Iceberg table >>> >>> This would make it possible to access all types of Iceberg tables in the >>> same query, and would match how Spark and Flink configure catalogs. Other >>> than the configuration above, I don’t think that config in HMS should be >>> used at all, like how the other engines work. Iceberg is the source of >>> truth for table metadata, HMS stores how to load the Iceberg table, and >>> HiveConf defines the catalogs (or runtime overrides). >>> >>> This isn’t quite how configuration works right now. Currently, the >>> catalog is controlled by a HiveConf property, iceberg.mr.catalog. If >>> that isn’t set, HadoopTables will be used to load table locations. If it is >>> set, then that catalog will be used to load all tables by name. This makes >>> it impossible to load tables from different catalogs at the same time. >>> That’s why I think the Iceberg catalog for a table should be stored in HMS >>> table properties. >>> >>> I should also explain iceberg.hive.engine.enabled flag, but I think >>> this is long enough for now. >>> >>> rb >>> >>> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter >>> <lpin...@cloudera.com.invalid> wrote: >>> >>>> Hi All, >>>> >>>> I would like to start a discussion, how should we handle properties >>>> from various sources like Iceberg, Hive or global configuration. I've put >>>> together a short document >>>> <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>, >>>> please have a look and let me know what you think. >>>> >>>> Thanks, >>>> Laszlo >>>> >>> >>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix >>> >> -- Ryan Blue Software Engineer Netflix