I agree with Ryan on the core principles here. As I understand them: 1. Iceberg metadata describes all properties of a table 2. Hive table properties describe "how to get to" Iceberg metadata (which catalog + possibly ptr, path, token, etc) 3. There could be default "how to get to" information set at a global level 4. Best-effort schema should stored be in the table properties in HMS. This should be done for information schema retrieval purposes within Hive but should be ignored during Hive/other tool execution.
Is that a fair summary of your statements Ryan (except 4, which I just added)? One comment I have on #2 is that for different catalogs and use cases, I think it can be somewhat more complex where it would be desirable for a table that initially existed without Hive that was later exposed in Hive to support a ptr/path/token for how the table is named externally. For example, in a Nessie context we support arbitrary paths for an Iceberg table (such as folder1.folder2.folder3.table1). If you then want to expose that table to Hive, you might have this mapping for #2 db1.table1 => nessie:folder1.folder2.folder3.table1 Similarly, you might want to expose a particular branch version of a table. So it might say: db1.table1_etl_branch => nessie.folder1@etl_branch Just saying that the address to the table in the catalog could itself have several properties. The key being that no matter what those are, we should follow #1 and only store properties that are about the ptr, not the content/metadata. Lastly, I believe #4 is the case but haven't tested it. Can someone confirm that it is true? And that it is possible/not problematic? -- Jacques Nadeau CTO and Co-Founder, Dremio On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <rb...@netflix.com.invalid> wrote: > Thanks for working on this, Laszlo. I’ve been thinking about these > problems as well, so this is a good time to have a discussion about Hive > config. > > I think that Hive configuration should work mostly like other engines, > where different configurations are used for different purposes. Different > purposes means that there is not a global configuration priority. > Hopefully, I can explain how we use the different config sources elsewhere > to clarify. > > Let’s take Spark as an example. Spark uses Hadoop, so it has a Hadoop > Configuration, but it also has its own global configuration. There are also > Iceberg table properties, and all of the various Hive properties if you’re > tracking tables with a Hive MetaStore. > > The first step is to simplify where we can, so we effectively eliminate 2 > sources of config: > > - The Hadoop Configuration is only used to instantiate Hadoop classes, > like FileSystem. Iceberg should not use it for any other config. > - Config in the Hive MetaStore is only used to identify that a table > is Iceberg and point to its metadata location. All other config in HMS is > informational. For example, the input format is FileInputFormat so that > non-Iceberg readers cannot actually instantiate the format (it’s abstract) > but it is available so they also don’t fail trying to load the class. > Table-specific config should not be stored in table or serde properties. > > That leaves Spark configuration and Iceberg table configuration. > > Iceberg differs from other tables because it is opinionated: data > configuration should be maintained at the table level. This is cleaner for > users because config is standardized across engines and in one place. And > it also enables services that analyze a table and update its configuration > to tune options that users almost never do, like row group or stripe size > in the columnar formats. Iceberg table configuration is used to configure > table-specific concerns and behavior. > > Spark configuration is used for engine-specific concerns, and runtime > overrides. A good example of an engine-specific concern is the catalogs > that are available to load Iceberg tables. Spark has a way to load and > configure catalog implementations and Iceberg uses that for all > catalog-level config. Runtime overrides are things like target split size. > Iceberg has a table-level default split size in table properties, but this > can be overridden by a Spark option for each table, as well as an option > passed to the individual read. Note that these necessarily have different > config names for how they are used: Iceberg uses read.split.target-size > and the read-specific option is target-size. > > Applying this to Hive is a little strange for a couple reasons. First, > Hive’s engine configuration *is* a Hadoop Configuration. As a result, I > think the right place to store engine-specific config is there, including > Iceberg catalogs using a strategy similar to what Spark does: what external > Iceberg catalogs are available and their configuration should come from the > HiveConf. > > The second way Hive is strange is that Hive needs to use its own MetaStore > to track Hive table concerns. The MetaStore may have tables created by an > Iceberg HiveCatalog, and Hive also needs to be able to load tables from > other Iceberg catalogs by creating table entries for them. > > Here’s how I think Hive should work: > > - There should be a default HiveCatalog that uses the current > MetaStore URI to be used for HiveCatalog tables tracked in the MetaStore > - Other catalogs should be defined in HiveConf > - HMS table properties should be used to determine how to load a > table: using a Hadoop location, using the default metastore catalog, or > using an external Iceberg catalog > - If there is a metadata_location, then use the HiveCatalog for > this metastore (where it is tracked) > - If there is a catalog property, then load that catalog and use it > to load the table identifier, or maybe an identifier from HMS table > properties > - If there is no catalog or metadata_location, then use > HadoopTables to load the table location as an Iceberg table > > This would make it possible to access all types of Iceberg tables in the > same query, and would match how Spark and Flink configure catalogs. Other > than the configuration above, I don’t think that config in HMS should be > used at all, like how the other engines work. Iceberg is the source of > truth for table metadata, HMS stores how to load the Iceberg table, and > HiveConf defines the catalogs (or runtime overrides). > > This isn’t quite how configuration works right now. Currently, the catalog > is controlled by a HiveConf property, iceberg.mr.catalog. If that isn’t > set, HadoopTables will be used to load table locations. If it is set, then > that catalog will be used to load all tables by name. This makes it > impossible to load tables from different catalogs at the same time. That’s > why I think the Iceberg catalog for a table should be stored in HMS table > properties. > > I should also explain iceberg.hive.engine.enabled flag, but I think this > is long enough for now. > > rb > > On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter <lpin...@cloudera.com.invalid> > wrote: > >> Hi All, >> >> I would like to start a discussion, how should we handle properties from >> various sources like Iceberg, Hive or global configuration. I've put >> together a short document >> <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>, >> please have a look and let me know what you think. >> >> Thanks, >> Laszlo >> > > > -- > Ryan Blue > Software Engineer > Netflix >