Minor error, my last example should have been: db1.table1_etl_branch => nessie.folder1.folder2.folder3.table1@etl_branch
-- Jacques Nadeau CTO and Co-Founder, Dremio On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau <jacq...@dremio.com> wrote: > I agree with Ryan on the core principles here. As I understand them: > > 1. Iceberg metadata describes all properties of a table > 2. Hive table properties describe "how to get to" Iceberg metadata > (which catalog + possibly ptr, path, token, etc) > 3. There could be default "how to get to" information set at a global > level > 4. Best-effort schema should stored be in the table properties in HMS. > This should be done for information schema retrieval purposes within Hive > but should be ignored during Hive/other tool execution. > > Is that a fair summary of your statements Ryan (except 4, which I just > added)? > > One comment I have on #2 is that for different catalogs and use cases, I > think it can be somewhat more complex where it would be desirable for a > table that initially existed without Hive that was later exposed in Hive to > support a ptr/path/token for how the table is named externally. For > example, in a Nessie context we support arbitrary paths for an Iceberg > table (such as folder1.folder2.folder3.table1). If you then want to expose > that table to Hive, you might have this mapping for #2 > > db1.table1 => nessie:folder1.folder2.folder3.table1 > > Similarly, you might want to expose a particular branch version of a > table. So it might say: > > db1.table1_etl_branch => nessie.folder1@etl_branch > > Just saying that the address to the table in the catalog could itself have > several properties. The key being that no matter what those are, we should > follow #1 and only store properties that are about the ptr, not the > content/metadata. > > Lastly, I believe #4 is the case but haven't tested it. Can someone > confirm that it is true? And that it is possible/not problematic? > > > -- > Jacques Nadeau > CTO and Co-Founder, Dremio > > > On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <rb...@netflix.com.invalid> > wrote: > >> Thanks for working on this, Laszlo. I’ve been thinking about these >> problems as well, so this is a good time to have a discussion about Hive >> config. >> >> I think that Hive configuration should work mostly like other engines, >> where different configurations are used for different purposes. Different >> purposes means that there is not a global configuration priority. >> Hopefully, I can explain how we use the different config sources elsewhere >> to clarify. >> >> Let’s take Spark as an example. Spark uses Hadoop, so it has a Hadoop >> Configuration, but it also has its own global configuration. There are also >> Iceberg table properties, and all of the various Hive properties if you’re >> tracking tables with a Hive MetaStore. >> >> The first step is to simplify where we can, so we effectively eliminate 2 >> sources of config: >> >> - The Hadoop Configuration is only used to instantiate Hadoop >> classes, like FileSystem. Iceberg should not use it for any other config. >> - Config in the Hive MetaStore is only used to identify that a table >> is Iceberg and point to its metadata location. All other config in HMS is >> informational. For example, the input format is FileInputFormat so that >> non-Iceberg readers cannot actually instantiate the format (it’s abstract) >> but it is available so they also don’t fail trying to load the class. >> Table-specific config should not be stored in table or serde properties. >> >> That leaves Spark configuration and Iceberg table configuration. >> >> Iceberg differs from other tables because it is opinionated: data >> configuration should be maintained at the table level. This is cleaner for >> users because config is standardized across engines and in one place. And >> it also enables services that analyze a table and update its configuration >> to tune options that users almost never do, like row group or stripe size >> in the columnar formats. Iceberg table configuration is used to configure >> table-specific concerns and behavior. >> >> Spark configuration is used for engine-specific concerns, and runtime >> overrides. A good example of an engine-specific concern is the catalogs >> that are available to load Iceberg tables. Spark has a way to load and >> configure catalog implementations and Iceberg uses that for all >> catalog-level config. Runtime overrides are things like target split size. >> Iceberg has a table-level default split size in table properties, but this >> can be overridden by a Spark option for each table, as well as an option >> passed to the individual read. Note that these necessarily have different >> config names for how they are used: Iceberg uses read.split.target-size >> and the read-specific option is target-size. >> >> Applying this to Hive is a little strange for a couple reasons. First, >> Hive’s engine configuration *is* a Hadoop Configuration. As a result, I >> think the right place to store engine-specific config is there, including >> Iceberg catalogs using a strategy similar to what Spark does: what external >> Iceberg catalogs are available and their configuration should come from the >> HiveConf. >> >> The second way Hive is strange is that Hive needs to use its own >> MetaStore to track Hive table concerns. The MetaStore may have tables >> created by an Iceberg HiveCatalog, and Hive also needs to be able to load >> tables from other Iceberg catalogs by creating table entries for them. >> >> Here’s how I think Hive should work: >> >> - There should be a default HiveCatalog that uses the current >> MetaStore URI to be used for HiveCatalog tables tracked in the MetaStore >> - Other catalogs should be defined in HiveConf >> - HMS table properties should be used to determine how to load a >> table: using a Hadoop location, using the default metastore catalog, or >> using an external Iceberg catalog >> - If there is a metadata_location, then use the HiveCatalog for >> this metastore (where it is tracked) >> - If there is a catalog property, then load that catalog and use >> it to load the table identifier, or maybe an identifier from HMS table >> properties >> - If there is no catalog or metadata_location, then use >> HadoopTables to load the table location as an Iceberg table >> >> This would make it possible to access all types of Iceberg tables in the >> same query, and would match how Spark and Flink configure catalogs. Other >> than the configuration above, I don’t think that config in HMS should be >> used at all, like how the other engines work. Iceberg is the source of >> truth for table metadata, HMS stores how to load the Iceberg table, and >> HiveConf defines the catalogs (or runtime overrides). >> >> This isn’t quite how configuration works right now. Currently, the >> catalog is controlled by a HiveConf property, iceberg.mr.catalog. If >> that isn’t set, HadoopTables will be used to load table locations. If it is >> set, then that catalog will be used to load all tables by name. This makes >> it impossible to load tables from different catalogs at the same time. >> That’s why I think the Iceberg catalog for a table should be stored in HMS >> table properties. >> >> I should also explain iceberg.hive.engine.enabled flag, but I think this >> is long enough for now. >> >> rb >> >> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter >> <lpin...@cloudera.com.invalid> wrote: >> >>> Hi All, >>> >>> I would like to start a discussion, how should we handle properties from >>> various sources like Iceberg, Hive or global configuration. I've put >>> together a short document >>> <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>, >>> please have a look and let me know what you think. >>> >>> Thanks, >>> Laszlo >>> >> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> >