Hi, Based on the discussion below I understand we have the following kinds of properties: Iceberg table properties - Engine independent, storage related parameters "how to get to" - I think these are mostly Hive table specific properties, since for Spark, the Spark catalog configuration serves for the same purpose. I think the best place for storing these would be the Hive SERDEPROPERTIES, as this describes the access information for the SerDe. Sidenote: I think we should decide if we allow HiveCatalogs pointing to a different HMS and the 'iceberg.table_identifier' would make sense only if we allow having multiple catalogs. Query specific properties - These are engine specific and might be mapped to / even override the Iceberg table properties on the engine specific code paths, but currently these properties have independent names and mapped on a case-by-case basis.
Based on this: Shall we move the "how to get to" properties to SERDEPROPERTIES? Shall we define a prefix for setting Iceberg table properties from Hive queries and omitting other engine specific properties? Thanks, Peter > On Nov 27, 2020, at 17:45, Mass Dosage <massdos...@gmail.com> wrote: > > I like these suggestions, comments inline below on the last round... > > On Thu, 26 Nov 2020 at 09:45, Zoltán Borók-Nagy <borokna...@apache.org > <mailto:borokna...@apache.org>> wrote: > Hi, > > The above aligns with what we did in Impala, i.e. we store information about > table loading in HMS table properties. We are just a bit more explicit about > which catalog to use. > We have table property 'iceberg.catalog' to determine the catalog type, right > now the supported values are 'hadoop.tables', 'hadoop.catalog', and > 'hive.catalog'. Additional table properties can be set based on the catalog > type. > > So, if the value of 'iceberg.catalog' is > > I'm all for renaming this, having "mr" in the property name is confusing. > > hadoop.tables > the table location is used to load the table > The only question I have is should we have this as the default? i.e. if you > don't set a catalog it will assume its HadoopTables and use the location? Or > should we require this property to be here to be consistent and avoid any > "magic"? > > hadoop.catalog > Required table property 'iceberg.catalog_location' specifies the location of > the hadoop catalog in the file system > Optional table property 'iceberg.table_identifier' specifies the table id. If > it's not set, then <database_name>.<table_name> is used as table identifier > I like this as it would allow you to use a different database and table name > in Hive as opposed to the Hadoop Catalog - at the moment they have to match. > The only thing here is that I think Hive requires a table LOCATION to be set > and it's then confusing as there are now two locations on the table. I'm not > sure whether in the Hive storage handler or SerDe etc. we can get Hive to not > require that and maybe even disallow it from being set. That would probably > be best in conjunction with this. Another solution would be to not have the > 'iceberg.catalog_location' property but instead use the table LOCATION for > this but that's a bit confusing from a Hive point of view. > > hive.catalog > Optional table property 'iceberg.table_identifier' specifies the table id. If > it's not set, then <database_name>.<table_name> is used as table identifier > We have the assumption that the current Hive metastore stores the table, i.e. > we don't support external Hive metastores currently > These sound fine for Hive catalog tables that are created outside of the > automatic Hive table creation (see https://iceberg.apache.org/hive/ > <https://iceberg.apache.org/hive/> -> Using Hive Catalog) we'd just need to > document how you can create these yourself and that one could use a different > Hive database and table etc. > > Independent of catalog implementations, but we also have table property > 'iceberg.file_format' to specify the file format for the data files. > > OK, I don't think we need that for Hive? > > We haven't released it yet, so we are open to changes, but I think these > properties are reasonable and it would be great if we could standardize the > properties across engines that use HMS as the primary metastore of tables. > > > If others agree I think we should create an issue where we document the above > changes so it's very clear what we're doing and can then go an implement them > and update the docs etc. > > Cheers, > Zoltan > > > On Thu, Nov 26, 2020 at 2:20 AM Ryan Blue <rb...@netflix.com.invalid> wrote: > Yes, I think that is a good summary of the principles. > > #4 is correct because we provide some information that is informational (Hive > schema) or tracked only by the metastore (best-effort current user). I also > agree that it would be good to have a table identifier in HMS table metadata > when loading from an external table. That gives us a way to handle name > conflicts. > > On Wed, Nov 25, 2020 at 5:14 PM Jacques Nadeau <jacq...@dremio.com > <mailto:jacq...@dremio.com>> wrote: > Minor error, my last example should have been: > > db1.table1_etl_branch => nessie.folder1.folder2.folder3.table1@etl_branch > > -- > Jacques Nadeau > CTO and Co-Founder, Dremio > > > On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau <jacq...@dremio.com > <mailto:jacq...@dremio.com>> wrote: > I agree with Ryan on the core principles here. As I understand them: > Iceberg metadata describes all properties of a table > Hive table properties describe "how to get to" Iceberg metadata (which > catalog + possibly ptr, path, token, etc) > There could be default "how to get to" information set at a global level > Best-effort schema should stored be in the table properties in HMS. This > should be done for information schema retrieval purposes within Hive but > should be ignored during Hive/other tool execution. > Is that a fair summary of your statements Ryan (except 4, which I just added)? > > One comment I have on #2 is that for different catalogs and use cases, I > think it can be somewhat more complex where it would be desirable for a table > that initially existed without Hive that was later exposed in Hive to support > a ptr/path/token for how the table is named externally. For example, in a > Nessie context we support arbitrary paths for an Iceberg table (such as > folder1.folder2.folder3.table1). If you then want to expose that table to > Hive, you might have this mapping for #2 > > db1.table1 => nessie:folder1.folder2.folder3.table1 > > Similarly, you might want to expose a particular branch version of a table. > So it might say: > > db1.table1_etl_branch => nessie.folder1@etl_branch > > Just saying that the address to the table in the catalog could itself have > several properties. The key being that no matter what those are, we should > follow #1 and only store properties that are about the ptr, not the > content/metadata. > > Lastly, I believe #4 is the case but haven't tested it. Can someone confirm > that it is true? And that it is possible/not problematic? > > > -- > Jacques Nadeau > CTO and Co-Founder, Dremio > > > On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <rb...@netflix.com.invalid> wrote: > Thanks for working on this, Laszlo. I’ve been thinking about these problems > as well, so this is a good time to have a discussion about Hive config. > > I think that Hive configuration should work mostly like other engines, where > different configurations are used for different purposes. Different purposes > means that there is not a global configuration priority. Hopefully, I can > explain how we use the different config sources elsewhere to clarify. > > Let’s take Spark as an example. Spark uses Hadoop, so it has a Hadoop > Configuration, but it also has its own global configuration. There are also > Iceberg table properties, and all of the various Hive properties if you’re > tracking tables with a Hive MetaStore. > > The first step is to simplify where we can, so we effectively eliminate 2 > sources of config: > > The Hadoop Configuration is only used to instantiate Hadoop classes, like > FileSystem. Iceberg should not use it for any other config. > Config in the Hive MetaStore is only used to identify that a table is Iceberg > and point to its metadata location. All other config in HMS is informational. > For example, the input format is FileInputFormat so that non-Iceberg readers > cannot actually instantiate the format (it’s abstract) but it is available so > they also don’t fail trying to load the class. Table-specific config should > not be stored in table or serde properties. > That leaves Spark configuration and Iceberg table configuration. > > Iceberg differs from other tables because it is opinionated: data > configuration should be maintained at the table level. This is cleaner for > users because config is standardized across engines and in one place. And it > also enables services that analyze a table and update its configuration to > tune options that users almost never do, like row group or stripe size in the > columnar formats. Iceberg table configuration is used to configure > table-specific concerns and behavior. > > Spark configuration is used for engine-specific concerns, and runtime > overrides. A good example of an engine-specific concern is the catalogs that > are available to load Iceberg tables. Spark has a way to load and configure > catalog implementations and Iceberg uses that for all catalog-level config. > Runtime overrides are things like target split size. Iceberg has a > table-level default split size in table properties, but this can be > overridden by a Spark option for each table, as well as an option passed to > the individual read. Note that these necessarily have different config names > for how they are used: Iceberg uses read.split.target-size and the > read-specific option is target-size. > > Applying this to Hive is a little strange for a couple reasons. First, Hive’s > engine configuration is a Hadoop Configuration. As a result, I think the > right place to store engine-specific config is there, including Iceberg > catalogs using a strategy similar to what Spark does: what external Iceberg > catalogs are available and their configuration should come from the HiveConf. > > The second way Hive is strange is that Hive needs to use its own MetaStore to > track Hive table concerns. The MetaStore may have tables created by an > Iceberg HiveCatalog, and Hive also needs to be able to load tables from other > Iceberg catalogs by creating table entries for them. > > Here’s how I think Hive should work: > > There should be a default HiveCatalog that uses the current MetaStore URI to > be used for HiveCatalog tables tracked in the MetaStore > Other catalogs should be defined in HiveConf > HMS table properties should be used to determine how to load a table: using a > Hadoop location, using the default metastore catalog, or using an external > Iceberg catalog > If there is a metadata_location, then use the HiveCatalog for this metastore > (where it is tracked) > If there is a catalog property, then load that catalog and use it to load the > table identifier, or maybe an identifier from HMS table properties > If there is no catalog or metadata_location, then use HadoopTables to load > the table location as an Iceberg table > This would make it possible to access all types of Iceberg tables in the same > query, and would match how Spark and Flink configure catalogs. Other than the > configuration above, I don’t think that config in HMS should be used at all, > like how the other engines work. Iceberg is the source of truth for table > metadata, HMS stores how to load the Iceberg table, and HiveConf defines the > catalogs (or runtime overrides). > > This isn’t quite how configuration works right now. Currently, the catalog is > controlled by a HiveConf property, iceberg.mr.catalog. If that isn’t set, > HadoopTables will be used to load table locations. If it is set, then that > catalog will be used to load all tables by name. This makes it impossible to > load tables from different catalogs at the same time. That’s why I think the > Iceberg catalog for a table should be stored in HMS table properties. > > I should also explain iceberg.hive.engine.enabled flag, but I think this is > long enough for now. > > rb > > > On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter <lpin...@cloudera.com.invalid> > wrote: > Hi All, > > I would like to start a discussion, how should we handle properties from > various sources like Iceberg, Hive or global configuration. I've put together > a short document > <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>, > please have a look and let me know what you think. > > Thanks, > Laszlo > > > -- > Ryan Blue > Software Engineer > Netflix > > > -- > Ryan Blue > Software Engineer > Netflix