Hey Peter, thanks for updating the doc and your heads up in the other thread on your capacity to look at this before EOY.
I'm going to try to create a specification document based on the discussion document you put together. I think there is general consensus around what you call "Spark-like catalog configuration" so I'd like to formalize that more. It seems like there is less consensus around the whitelist/blacklist side of things. You outline four approaches: 1. Hard coded HMS only property list 2. Hard coded Iceberg only property list 3. Prefix for Iceberg properties 4. Prefix for HMS only properties I generally think #2 is a no-go as it creates too much coupling between catalog implementations and core iceberg. It seems like Ryan Blue would prefer #4 (correct?). Any other strong opinions? -- Jacques Nadeau CTO and Co-Founder, Dremio On Thu, Dec 3, 2020 at 9:27 AM Peter Vary <pv...@cloudera.com.invalid> wrote: > As Jacques suggested (with the help of Zoltan) I have collected the > current state and the proposed solutions in a document: > > https://docs.google.com/document/d/1KumHM9IKbQyleBEUHZDbeoMjd7n6feUPJ5zK8NQb-Qw/edit?usp=sharing > > My feeling is that we do not have a final decision, so tried to list all > the possible solutions. > Please comment! > > Thanks, > Peter > > On Dec 2, 2020, at 18:10, Peter Vary <pv...@cloudera.com> wrote: > > When I was working on the CREATE TABLE patch I found the following > TBLPROPERTIES on newly created tables: > > - external.table.purge > - EXTERNAL > - bucketing_version > - numRows > - rawDataSize > - totalSize > - numFiles > - numFileErasureCoded > > > I am afraid that we can not change the name of most of these properties, > and might not be useful to have most of them along with Iceberg statistics > already there. Also my feeling is that this is only the top of the Iceberg > (pun intended :)) so this is why I think we should be more targeted way to > push properties to the Iceberg tables. > > On Dec 2, 2020, at 18:04, Ryan Blue <rb...@netflix.com> wrote: > > Sorry, I accidentally didn’t copy the dev list on this reply. Resending: > > Also I expect that we want to add Hive write specific configs to table > level when the general engine independent configuration is not ideal for > Hive, but every Hive query for a given table should use some specific > config. > > Hive may need configuration, but I think these should still be kept in the > Iceberg table. There is no reason to make Hive config inaccessible from > other engines. If someone wants to view all of the config for a table from > Spark, the Hive config should also be included right? > > On Tue, Dec 1, 2020 at 10:36 AM Peter Vary <pv...@cloudera.com> wrote: > >> I will ask Laszlo if he wants to update his doc. >> >> I see both pros and cons of catalog definition in config files. If there >> is an easy default then I do not mind any of the proposed solutions. >> >> OTOH I am in favor of the "use prefix for Iceberg table properties" >> solution, because in Hive it is common to add new keys to the property list >> - no restriction is in place (I am not even sure that the currently >> implemented blacklist for preventing to propagate properties to Iceberg >> tables is complete). Also I expect that we want to add Hive write specific >> configs to table level when the general engine independent configuration is >> not ideal for Hive, but every Hive query for a given table should use some >> specific config. >> >> Thanks, Peter >> >> Jacques Nadeau <jacq...@dremio.com> ezt írta (időpont: 2020. dec. 1., Ke >> 17:06): >> >>> Would someone be willing to create a document that states the current >>> proposal? >>> >>> It is becoming somewhat difficult to follow this thread. I also worry >>> that without a complete statement of the current shape that people may be >>> incorrectly thinking they are in alignment. >>> >>> >>> >>> -- >>> Jacques Nadeau >>> CTO and Co-Founder, Dremio >>> >>> >>> On Tue, Dec 1, 2020 at 5:32 AM Zoltán Borók-Nagy < >>> borokna...@cloudera.com> wrote: >>> >>>> Thanks, Ryan. I answered inline. >>>> >>>> On Mon, Nov 30, 2020 at 8:26 PM Ryan Blue <rb...@netflix.com> wrote: >>>> >>>>> This sounds like a good plan overall, but I have a couple of notes: >>>>> >>>>> 1. We need to keep in mind that users plug in their own catalogs, >>>>> so iceberg.catalog could be a Glue or Nessie catalog, not just >>>>> Hive or Hadoop. I don’t think it makes much sense to use separate >>>>> hadoop.catalog and hive.catalog values. Those should just be names for >>>>> catalogs configured in Configuration, i.e., via hive-site.xml. We >>>>> then only need a special value for loading Hadoop tables from paths. >>>>> >>>>> About extensibility, I think the usual Hive way is to use Java class >>>> names. So this way the value for 'iceberg.catalog' could be e.g. >>>> 'org.apache.iceberg.hive.HiveCatalog'. Then each catalog implementation >>>> would need to have a factory method that constructs the catalog object from >>>> a properties object (Map<String, String>). E.g. >>>> 'org.apache.iceberg.hadoop.HadoopCatalog' would require >>>> 'iceberg.catalog_location' to be present in properties. >>>> >>>>> >>>>> 1. I don’t think that catalog configuration should be kept in >>>>> table properties. A catalog should not be loaded for each table. So I >>>>> don’t >>>>> think we need iceberg.catalog_location. Instead, we should have a >>>>> way to define catalogs in the Configuration for tables in the >>>>> metastore to reference. >>>>> >>>>> I think it makes sense, on the other hand it would make adding new >>>> catalogs more heavy-weight, i.e. now you'd need to edit configuration files >>>> and restart/reinit services. Maybe it can be cumbersome in some >>>> environments. >>>> >>>>> >>>>> 1. I’d rather use a prefix to exclude properties from being passed >>>>> to Iceberg than to include them. Otherwise, users don’t know what to >>>>> do to >>>>> pass table properties from Hive or Impala. If we exclude a prefix or >>>>> specific properties, then everything but the properties reserved for >>>>> locating the table are passed as the user would expect. >>>>> >>>>> I don't have a strong opinion about this, but yeah, maybe this >>>> behavior would cause the least surprises. >>>> >>>>> >>>>> >>>>> >>>>> On Mon, Nov 30, 2020 at 7:51 AM Zoltán Borók-Nagy < >>>>> borokna...@apache.org> wrote: >>>>> >>>>>> Thanks, Peter. I answered inline. >>>>>> >>>>>> On Mon, Nov 30, 2020 at 3:13 PM Peter Vary < >>>>>> pv...@cloudera.com.invalid> wrote: >>>>>> >>>>>>> Hi Zoltan, >>>>>>> >>>>>>> Answers below: >>>>>>> >>>>>>> On Nov 30, 2020, at 14:19, Zoltán Borók-Nagy < >>>>>>> borokna...@cloudera.com.INVALID> wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> Thanks for the replies. My take for the above questions are as >>>>>>> follows >>>>>>> >>>>>>> - Should 'iceberg.catalog' be a required property? >>>>>>> - Yeah, I think it would be nice if this would be required to >>>>>>> avoid any implicit behavior >>>>>>> >>>>>>> Currently we have a Catalogs class to get/initialize/use the >>>>>>> different Catalogs. At that time the decision was to use HadoopTables >>>>>>> as a >>>>>>> default catalog. >>>>>>> It might be worthwhile to use the same class in Impala as well, so >>>>>>> the behavior is consistent. >>>>>>> >>>>>> >>>>>> Yeah, I think it'd be beneficial for us to use the Iceberg classes >>>>>> whenever possible. The Catalogs class is very similar to what we have >>>>>> currently in Impala. >>>>>> >>>>>>> >>>>>>> - 'hadoop.catalog' LOCATION and catalog_location >>>>>>> - In Impala we don't allow setting LOCATION for tables stored >>>>>>> in 'hadoop.catalog'. But Impala internally sets LOCATION to the >>>>>>> Iceberg >>>>>>> table's actual location. We were also thinking about using only >>>>>>> the table >>>>>>> LOCATION, and set it to the catalog location, but we also found it >>>>>>> confusing. >>>>>>> >>>>>>> It could definitely work, but it is somewhat strange that we have an >>>>>>> external table location set to an arbitrary path, and we have a >>>>>>> different >>>>>>> location generated by other configs. It would be nice to have the real >>>>>>> location set in the external table location as well. >>>>>>> >>>>>> >>>>>> Impala sets the real Iceberg table location for external tables. E.g. >>>>>> if the user issues >>>>>> >>>>>> CREATE EXTERNAL TABLE my_hive_db.iceberg_table_hadoop_catalog >>>>>> STORED AS ICEBERG >>>>>> TBLPROPERTIES('iceberg.catalog'='hadoop.catalog', >>>>>> 'iceberg.catalog_location'='/path/to/hadoop/catalog', >>>>>> >>>>>> 'iceberg.table_identifier'='namespace1.namespace2.ice_t'); >>>>>> >>>>>> If the end user had specified LOCATION, then Impala would have raised >>>>>> an error. But the above DDL statement is correct, so Impala loads the >>>>>> iceberg table via Iceberg API, then creates the HMS table and sets >>>>>> LOCATION >>>>>> to the Iceberg table location (something like >>>>>> /path/to/hadoop/catalog/namespace1/namespace2/ice_t). >>>>>> >>>>>> >>>>>>> I like the flexibility of setting the table_identifier on table >>>>>>> level, which could help removing naming conflicts. We might want to have >>>>>>> this in the Iceberg Catalog implementation. >>>>>>> >>>>>>> >>>>>>> - 'iceberg.table_identifier' for HiveCatalog >>>>>>> - Yeah, it doesn't add much if we only allow using the >>>>>>> current HMS. I think it can be only useful if we are allowing >>>>>>> external >>>>>>> HMSes. >>>>>>> - Moving properties to SERDEPROPERTIES >>>>>>> - I see that these properties are used by the SerDe classes >>>>>>> in Hive, but I feel that these properties are just not about >>>>>>> serialization >>>>>>> and deserialization. And as I see the current SERDEPROPERTIES are >>>>>>> things >>>>>>> like 'field.delim', 'separatorChar', 'quoteChar', etc. So >>>>>>> properties about >>>>>>> table loading more naturally belong to TBLPROPERTIES in my >>>>>>> opinion. >>>>>>> >>>>>>> I have seen it used both ways for HBaseSerDe. (even the wiki page >>>>>>> uses both :) ). Since Impala prefers TBLPROPERTIES and if we start using >>>>>>> prefix for separating real Iceberg table properties from other >>>>>>> properties, >>>>>>> then we can keep it at TBLPROPERTIES. >>>>>>> >>>>>> >>>>>> In the google doc I also had a comment about prefixing iceberg table >>>>>> properties. We could use a prefix like 'iceberg.tblproperties.', and pass >>>>>> every property with this prefix to the Iceberg table. Currently Impala >>>>>> passes every table property to the Iceberg table. >>>>>> >>>>>> >>>>>>> >>>>>>> Thanks, >>>>>>> Zoltan >>>>>>> >>>>>>> >>>>>>> On Mon, Nov 30, 2020 at 1:33 PM Peter Vary < >>>>>>> pv...@cloudera.com.invalid> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> Based on the discussion below I understand we have the following >>>>>>>> kinds of properties: >>>>>>>> >>>>>>>> 1. Iceberg table properties - Engine independent, storage >>>>>>>> related parameters >>>>>>>> 2. "how to get to" - I think these are mostly Hive table >>>>>>>> specific properties, since for Spark, the Spark catalog >>>>>>>> configuration >>>>>>>> serves for the same purpose. I think the best place for storing >>>>>>>> these would >>>>>>>> be the Hive SERDEPROPERTIES, as this describes the access >>>>>>>> information for >>>>>>>> the SerDe. Sidenote: I think we should decide if we allow >>>>>>>> HiveCatalogs pointing to a different HMS and the >>>>>>>> 'iceberg.table_identifier' >>>>>>>> would make sense only if we allow having multiple catalogs. >>>>>>>> 3. Query specific properties - These are engine specific and >>>>>>>> might be mapped to / even override the Iceberg table properties on >>>>>>>> the >>>>>>>> engine specific code paths, but currently these properties have >>>>>>>> independent >>>>>>>> names and mapped on a case-by-case basis. >>>>>>>> >>>>>>>> >>>>>>>> Based on this: >>>>>>>> >>>>>>>> - Shall we move the "how to get to" properties to >>>>>>>> SERDEPROPERTIES? >>>>>>>> - Shall we define a prefix for setting Iceberg table properties >>>>>>>> from Hive queries and omitting other engine specific properties? >>>>>>>> >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Peter >>>>>>>> >>>>>>>> >>>>>>>> On Nov 27, 2020, at 17:45, Mass Dosage <massdos...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>> I like these suggestions, comments inline below on the last round... >>>>>>>> >>>>>>>> On Thu, 26 Nov 2020 at 09:45, Zoltán Borók-Nagy < >>>>>>>> borokna...@apache.org> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> The above aligns with what we did in Impala, i.e. we store >>>>>>>>> information about table loading in HMS table properties. We are just >>>>>>>>> a bit >>>>>>>>> more explicit about which catalog to use. >>>>>>>>> We have table property 'iceberg.catalog' to determine the catalog >>>>>>>>> type, right now the supported values are 'hadoop.tables', >>>>>>>>> 'hadoop.catalog', >>>>>>>>> and 'hive.catalog'. Additional table properties can be set based on >>>>>>>>> the >>>>>>>>> catalog type. >>>>>>>>> >>>>>>>>> So, if the value of 'iceberg.catalog' is >>>>>>>>> >>>>>>>> >>>>>>>> I'm all for renaming this, having "mr" in the property name is >>>>>>>> confusing. >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> - hadoop.tables >>>>>>>>> - the table location is used to load the table >>>>>>>>> >>>>>>>>> The only question I have is should we have this as the default? >>>>>>>> i.e. if you don't set a catalog it will assume its HadoopTables and >>>>>>>> use the >>>>>>>> location? Or should we require this property to be here to be >>>>>>>> consistent >>>>>>>> and avoid any "magic"? >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> - hadoop.catalog >>>>>>>>> - Required table property 'iceberg.catalog_location' >>>>>>>>> specifies the location of the hadoop catalog in the file system >>>>>>>>> - Optional table property 'iceberg.table_identifier' >>>>>>>>> specifies the table id. If it's not set, then >>>>>>>>> <database_name>.<table_name> >>>>>>>>> is used as table identifier >>>>>>>>> >>>>>>>>> I like this as it would allow you to use a different database and >>>>>>>> table name in Hive as opposed to the Hadoop Catalog - at the moment >>>>>>>> they >>>>>>>> have to match. The only thing here is that I think Hive requires a >>>>>>>> table >>>>>>>> LOCATION to be set and it's then confusing as there are now two >>>>>>>> locations >>>>>>>> on the table. I'm not sure whether in the Hive storage handler or SerDe >>>>>>>> etc. we can get Hive to not require that and maybe even disallow it >>>>>>>> from >>>>>>>> being set. That would probably be best in conjunction with this. >>>>>>>> Another >>>>>>>> solution would be to not have the 'iceberg.catalog_location' property >>>>>>>> but >>>>>>>> instead use the table LOCATION for this but that's a bit confusing >>>>>>>> from a >>>>>>>> Hive point of view. >>>>>>>> >>>>>>>> >>>>>>>>> - hive.catalog >>>>>>>>> - Optional table property 'iceberg.table_identifier' >>>>>>>>> specifies the table id. If it's not set, then >>>>>>>>> <database_name>.<table_name> >>>>>>>>> is used as table identifier >>>>>>>>> - We have the assumption that the current Hive metastore >>>>>>>>> stores the table, i.e. we don't support external Hive >>>>>>>>> metastores currently >>>>>>>>> >>>>>>>>> These sound fine for Hive catalog tables that are created outside >>>>>>>> of the automatic Hive table creation (see >>>>>>>> https://iceberg.apache.org/hive/ -> Using Hive Catalog) we'd just >>>>>>>> need to document how you can create these yourself and that one could >>>>>>>> use a >>>>>>>> different Hive database and table etc. >>>>>>>> >>>>>>>> >>>>>>>>> Independent of catalog implementations, but we also have table >>>>>>>>> property 'iceberg.file_format' to specify the file format for the data >>>>>>>>> files. >>>>>>>>> >>>>>>>> >>>>>>>> OK, I don't think we need that for Hive? >>>>>>>> >>>>>>>> >>>>>>>>> We haven't released it yet, so we are open to changes, but I think >>>>>>>>> these properties are reasonable and it would be great if we could >>>>>>>>> standardize the properties across engines that use HMS as the primary >>>>>>>>> metastore of tables. >>>>>>>>> >>>>>>>>> >>>>>>>> If others agree I think we should create an issue where we document >>>>>>>> the above changes so it's very clear what we're doing and can then go >>>>>>>> an >>>>>>>> implement them and update the docs etc. >>>>>>>> >>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Zoltan >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Nov 26, 2020 at 2:20 AM Ryan Blue < >>>>>>>>> rb...@netflix.com.invalid> wrote: >>>>>>>>> >>>>>>>>>> Yes, I think that is a good summary of the principles. >>>>>>>>>> >>>>>>>>>> #4 is correct because we provide some information that is >>>>>>>>>> informational (Hive schema) or tracked only by the metastore >>>>>>>>>> (best-effort >>>>>>>>>> current user). I also agree that it would be good to have a table >>>>>>>>>> identifier in HMS table metadata when loading from an external >>>>>>>>>> table. That >>>>>>>>>> gives us a way to handle name conflicts. >>>>>>>>>> >>>>>>>>>> On Wed, Nov 25, 2020 at 5:14 PM Jacques Nadeau < >>>>>>>>>> jacq...@dremio.com> wrote: >>>>>>>>>> >>>>>>>>>>> Minor error, my last example should have been: >>>>>>>>>>> >>>>>>>>>>> db1.table1_etl_branch => >>>>>>>>>>> nessie.folder1.folder2.folder3.table1@etl_branch >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Jacques Nadeau >>>>>>>>>>> CTO and Co-Founder, Dremio >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau < >>>>>>>>>>> jacq...@dremio.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> I agree with Ryan on the core principles here. As I understand >>>>>>>>>>>> them: >>>>>>>>>>>> >>>>>>>>>>>> 1. Iceberg metadata describes all properties of a table >>>>>>>>>>>> 2. Hive table properties describe "how to get to" Iceberg >>>>>>>>>>>> metadata (which catalog + possibly ptr, path, token, etc) >>>>>>>>>>>> 3. There could be default "how to get to" information set >>>>>>>>>>>> at a global level >>>>>>>>>>>> 4. Best-effort schema should stored be in the table >>>>>>>>>>>> properties in HMS. This should be done for information schema >>>>>>>>>>>> retrieval >>>>>>>>>>>> purposes within Hive but should be ignored during Hive/other >>>>>>>>>>>> tool execution. >>>>>>>>>>>> >>>>>>>>>>>> Is that a fair summary of your statements Ryan (except 4, which >>>>>>>>>>>> I just added)? >>>>>>>>>>>> >>>>>>>>>>>> One comment I have on #2 is that for different catalogs and use >>>>>>>>>>>> cases, I think it can be somewhat more complex where it would be >>>>>>>>>>>> desirable for a table that initially existed without Hive that was >>>>>>>>>>>> later >>>>>>>>>>>> exposed in Hive to support a ptr/path/token for how the table is >>>>>>>>>>>> named >>>>>>>>>>>> externally. For example, in a Nessie context we support arbitrary >>>>>>>>>>>> paths for >>>>>>>>>>>> an Iceberg table (such as folder1.folder2.folder3.table1). If you >>>>>>>>>>>> then want >>>>>>>>>>>> to expose that table to Hive, you might have this mapping for #2 >>>>>>>>>>>> >>>>>>>>>>>> db1.table1 => nessie:folder1.folder2.folder3.table1 >>>>>>>>>>>> >>>>>>>>>>>> Similarly, you might want to expose a particular branch version >>>>>>>>>>>> of a table. So it might say: >>>>>>>>>>>> >>>>>>>>>>>> db1.table1_etl_branch => nessie.folder1@etl_branch >>>>>>>>>>>> >>>>>>>>>>>> Just saying that the address to the table in the catalog could >>>>>>>>>>>> itself have several properties. The key being that no matter what >>>>>>>>>>>> those >>>>>>>>>>>> are, we should follow #1 and only store properties that are about >>>>>>>>>>>> the ptr, >>>>>>>>>>>> not the content/metadata. >>>>>>>>>>>> >>>>>>>>>>>> Lastly, I believe #4 is the case but haven't tested it. Can >>>>>>>>>>>> someone confirm that it is true? And that it is possible/not >>>>>>>>>>>> problematic? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Jacques Nadeau >>>>>>>>>>>> CTO and Co-Founder, Dremio >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue < >>>>>>>>>>>> rb...@netflix.com.invalid> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Thanks for working on this, Laszlo. I’ve been thinking about >>>>>>>>>>>>> these problems as well, so this is a good time to have a >>>>>>>>>>>>> discussion about >>>>>>>>>>>>> Hive config. >>>>>>>>>>>>> >>>>>>>>>>>>> I think that Hive configuration should work mostly like other >>>>>>>>>>>>> engines, where different configurations are used for different >>>>>>>>>>>>> purposes. >>>>>>>>>>>>> Different purposes means that there is not a global configuration >>>>>>>>>>>>> priority. >>>>>>>>>>>>> Hopefully, I can explain how we use the different config sources >>>>>>>>>>>>> elsewhere >>>>>>>>>>>>> to clarify. >>>>>>>>>>>>> >>>>>>>>>>>>> Let’s take Spark as an example. Spark uses Hadoop, so it has a >>>>>>>>>>>>> Hadoop Configuration, but it also has its own global >>>>>>>>>>>>> configuration. There >>>>>>>>>>>>> are also Iceberg table properties, and all of the various Hive >>>>>>>>>>>>> properties >>>>>>>>>>>>> if you’re tracking tables with a Hive MetaStore. >>>>>>>>>>>>> >>>>>>>>>>>>> The first step is to simplify where we can, so we effectively >>>>>>>>>>>>> eliminate 2 sources of config: >>>>>>>>>>>>> >>>>>>>>>>>>> - The Hadoop Configuration is only used to instantiate >>>>>>>>>>>>> Hadoop classes, like FileSystem. Iceberg should not use it for >>>>>>>>>>>>> any other >>>>>>>>>>>>> config. >>>>>>>>>>>>> - Config in the Hive MetaStore is only used to identify >>>>>>>>>>>>> that a table is Iceberg and point to its metadata location. >>>>>>>>>>>>> All other >>>>>>>>>>>>> config in HMS is informational. For example, the input format >>>>>>>>>>>>> is >>>>>>>>>>>>> FileInputFormat so that non-Iceberg readers cannot actually >>>>>>>>>>>>> instantiate the >>>>>>>>>>>>> format (it’s abstract) but it is available so they also don’t >>>>>>>>>>>>> fail trying >>>>>>>>>>>>> to load the class. Table-specific config should not be stored >>>>>>>>>>>>> in table or >>>>>>>>>>>>> serde properties. >>>>>>>>>>>>> >>>>>>>>>>>>> That leaves Spark configuration and Iceberg table >>>>>>>>>>>>> configuration. >>>>>>>>>>>>> >>>>>>>>>>>>> Iceberg differs from other tables because it is opinionated: >>>>>>>>>>>>> data configuration should be maintained at the table level. This >>>>>>>>>>>>> is cleaner >>>>>>>>>>>>> for users because config is standardized across engines and in >>>>>>>>>>>>> one place. >>>>>>>>>>>>> And it also enables services that analyze a table and update its >>>>>>>>>>>>> configuration to tune options that users almost never do, like >>>>>>>>>>>>> row group or >>>>>>>>>>>>> stripe size in the columnar formats. Iceberg table configuration >>>>>>>>>>>>> is used to >>>>>>>>>>>>> configure table-specific concerns and behavior. >>>>>>>>>>>>> >>>>>>>>>>>>> Spark configuration is used for engine-specific concerns, and >>>>>>>>>>>>> runtime overrides. A good example of an engine-specific concern >>>>>>>>>>>>> is the >>>>>>>>>>>>> catalogs that are available to load Iceberg tables. Spark has a >>>>>>>>>>>>> way to load >>>>>>>>>>>>> and configure catalog implementations and Iceberg uses that for >>>>>>>>>>>>> all >>>>>>>>>>>>> catalog-level config. Runtime overrides are things like target >>>>>>>>>>>>> split size. >>>>>>>>>>>>> Iceberg has a table-level default split size in table properties, >>>>>>>>>>>>> but this >>>>>>>>>>>>> can be overridden by a Spark option for each table, as well as an >>>>>>>>>>>>> option >>>>>>>>>>>>> passed to the individual read. Note that these necessarily have >>>>>>>>>>>>> different >>>>>>>>>>>>> config names for how they are used: Iceberg uses >>>>>>>>>>>>> read.split.target-size and the read-specific option is >>>>>>>>>>>>> target-size. >>>>>>>>>>>>> >>>>>>>>>>>>> Applying this to Hive is a little strange for a couple >>>>>>>>>>>>> reasons. First, Hive’s engine configuration *is* a Hadoop >>>>>>>>>>>>> Configuration. As a result, I think the right place to store >>>>>>>>>>>>> engine-specific config is there, including Iceberg catalogs using >>>>>>>>>>>>> a >>>>>>>>>>>>> strategy similar to what Spark does: what external Iceberg >>>>>>>>>>>>> catalogs are >>>>>>>>>>>>> available and their configuration should come from the HiveConf. >>>>>>>>>>>>> >>>>>>>>>>>>> The second way Hive is strange is that Hive needs to use its >>>>>>>>>>>>> own MetaStore to track Hive table concerns. The MetaStore may >>>>>>>>>>>>> have tables >>>>>>>>>>>>> created by an Iceberg HiveCatalog, and Hive also needs to be able >>>>>>>>>>>>> to load >>>>>>>>>>>>> tables from other Iceberg catalogs by creating table entries for >>>>>>>>>>>>> them. >>>>>>>>>>>>> >>>>>>>>>>>>> Here’s how I think Hive should work: >>>>>>>>>>>>> >>>>>>>>>>>>> - There should be a default HiveCatalog that uses the >>>>>>>>>>>>> current MetaStore URI to be used for HiveCatalog tables >>>>>>>>>>>>> tracked in the >>>>>>>>>>>>> MetaStore >>>>>>>>>>>>> - Other catalogs should be defined in HiveConf >>>>>>>>>>>>> - HMS table properties should be used to determine how to >>>>>>>>>>>>> load a table: using a Hadoop location, using the default >>>>>>>>>>>>> metastore catalog, >>>>>>>>>>>>> or using an external Iceberg catalog >>>>>>>>>>>>> - If there is a metadata_location, then use the >>>>>>>>>>>>> HiveCatalog for this metastore (where it is tracked) >>>>>>>>>>>>> - If there is a catalog property, then load that >>>>>>>>>>>>> catalog and use it to load the table identifier, or maybe >>>>>>>>>>>>> an identifier >>>>>>>>>>>>> from HMS table properties >>>>>>>>>>>>> - If there is no catalog or metadata_location, then use >>>>>>>>>>>>> HadoopTables to load the table location as an Iceberg table >>>>>>>>>>>>> >>>>>>>>>>>>> This would make it possible to access all types of Iceberg >>>>>>>>>>>>> tables in the same query, and would match how Spark and Flink >>>>>>>>>>>>> configure >>>>>>>>>>>>> catalogs. Other than the configuration above, I don’t think that >>>>>>>>>>>>> config in >>>>>>>>>>>>> HMS should be used at all, like how the other engines work. >>>>>>>>>>>>> Iceberg is the >>>>>>>>>>>>> source of truth for table metadata, HMS stores how to load the >>>>>>>>>>>>> Iceberg >>>>>>>>>>>>> table, and HiveConf defines the catalogs (or runtime overrides). >>>>>>>>>>>>> >>>>>>>>>>>>> This isn’t quite how configuration works right now. Currently, >>>>>>>>>>>>> the catalog is controlled by a HiveConf property, >>>>>>>>>>>>> iceberg.mr.catalog. If that isn’t set, HadoopTables will be >>>>>>>>>>>>> used to load table locations. If it is set, then that catalog >>>>>>>>>>>>> will be used >>>>>>>>>>>>> to load all tables by name. This makes it impossible to load >>>>>>>>>>>>> tables from >>>>>>>>>>>>> different catalogs at the same time. That’s why I think the >>>>>>>>>>>>> Iceberg catalog >>>>>>>>>>>>> for a table should be stored in HMS table properties. >>>>>>>>>>>>> >>>>>>>>>>>>> I should also explain iceberg.hive.engine.enabled flag, but I >>>>>>>>>>>>> think this is long enough for now. >>>>>>>>>>>>> >>>>>>>>>>>>> rb >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter < >>>>>>>>>>>>> lpin...@cloudera.com.invalid> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi All, >>>>>>>>>>>>>> >>>>>>>>>>>>>> I would like to start a discussion, how should we handle >>>>>>>>>>>>>> properties from various sources like Iceberg, Hive or global >>>>>>>>>>>>>> configuration. >>>>>>>>>>>>>> I've put together a short document >>>>>>>>>>>>>> <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>, >>>>>>>>>>>>>> please have a look and let me know what you think. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> Laszlo >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Ryan Blue >>>>>>>>>>>>> Software Engineer >>>>>>>>>>>>> Netflix >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Ryan Blue >>>>>>>>>> Software Engineer >>>>>>>>>> Netflix >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Software Engineer >>>>> Netflix >>>>> >>>> > > -- > Ryan Blue > Software Engineer > Netflix > > > >