Thanks, Ryan. I answered inline. On Mon, Nov 30, 2020 at 8:26 PM Ryan Blue <rb...@netflix.com> wrote:
> This sounds like a good plan overall, but I have a couple of notes: > > 1. We need to keep in mind that users plug in their own catalogs, so > iceberg.catalog could be a Glue or Nessie catalog, not just Hive or > Hadoop. I don’t think it makes much sense to use separate hadoop.catalog > and hive.catalog values. Those should just be names for catalogs configured > in Configuration, i.e., via hive-site.xml. We then only need a special > value for loading Hadoop tables from paths. > > About extensibility, I think the usual Hive way is to use Java class names. So this way the value for 'iceberg.catalog' could be e.g. 'org.apache.iceberg.hive.HiveCatalog'. Then each catalog implementation would need to have a factory method that constructs the catalog object from a properties object (Map<String, String>). E.g. 'org.apache.iceberg.hadoop.HadoopCatalog' would require 'iceberg.catalog_location' to be present in properties. > > 1. I don’t think that catalog configuration should be kept in table > properties. A catalog should not be loaded for each table. So I don’t think > we need iceberg.catalog_location. Instead, we should have a way to > define catalogs in the Configuration for tables in the metastore to > reference. > > I think it makes sense, on the other hand it would make adding new catalogs more heavy-weight, i.e. now you'd need to edit configuration files and restart/reinit services. Maybe it can be cumbersome in some environments. > > 1. I’d rather use a prefix to exclude properties from being passed to > Iceberg than to include them. Otherwise, users don’t know what to do to > pass table properties from Hive or Impala. If we exclude a prefix or > specific properties, then everything but the properties reserved for > locating the table are passed as the user would expect. > > I don't have a strong opinion about this, but yeah, maybe this behavior would cause the least surprises. > > > > On Mon, Nov 30, 2020 at 7:51 AM Zoltán Borók-Nagy <borokna...@apache.org> > wrote: > >> Thanks, Peter. I answered inline. >> >> On Mon, Nov 30, 2020 at 3:13 PM Peter Vary <pv...@cloudera.com.invalid> >> wrote: >> >>> Hi Zoltan, >>> >>> Answers below: >>> >>> On Nov 30, 2020, at 14:19, Zoltán Borók-Nagy < >>> borokna...@cloudera.com.INVALID> wrote: >>> >>> Hi, >>> >>> Thanks for the replies. My take for the above questions are as follows >>> >>> - Should 'iceberg.catalog' be a required property? >>> - Yeah, I think it would be nice if this would be required to avoid >>> any implicit behavior >>> >>> Currently we have a Catalogs class to get/initialize/use the different >>> Catalogs. At that time the decision was to use HadoopTables as a default >>> catalog. >>> It might be worthwhile to use the same class in Impala as well, so the >>> behavior is consistent. >>> >> >> Yeah, I think it'd be beneficial for us to use the Iceberg classes >> whenever possible. The Catalogs class is very similar to what we have >> currently in Impala. >> >>> >>> - 'hadoop.catalog' LOCATION and catalog_location >>> - In Impala we don't allow setting LOCATION for tables stored in >>> 'hadoop.catalog'. But Impala internally sets LOCATION to the Iceberg >>> table's actual location. We were also thinking about using only the >>> table >>> LOCATION, and set it to the catalog location, but we also found it >>> confusing. >>> >>> It could definitely work, but it is somewhat strange that we have an >>> external table location set to an arbitrary path, and we have a different >>> location generated by other configs. It would be nice to have the real >>> location set in the external table location as well. >>> >> >> Impala sets the real Iceberg table location for external tables. E.g. if >> the user issues >> >> CREATE EXTERNAL TABLE my_hive_db.iceberg_table_hadoop_catalog >> STORED AS ICEBERG >> TBLPROPERTIES('iceberg.catalog'='hadoop.catalog', >> 'iceberg.catalog_location'='/path/to/hadoop/catalog', >> 'iceberg.table_identifier'='namespace1.namespace2.ice_t'); >> >> If the end user had specified LOCATION, then Impala would have raised an >> error. But the above DDL statement is correct, so Impala loads the iceberg >> table via Iceberg API, then creates the HMS table and sets LOCATION to the >> Iceberg table location (something like >> /path/to/hadoop/catalog/namespace1/namespace2/ice_t). >> >> >>> I like the flexibility of setting the table_identifier on table level, >>> which could help removing naming conflicts. We might want to have this in >>> the Iceberg Catalog implementation. >>> >>> >>> - 'iceberg.table_identifier' for HiveCatalog >>> - Yeah, it doesn't add much if we only allow using the current >>> HMS. I think it can be only useful if we are allowing external HMSes. >>> - Moving properties to SERDEPROPERTIES >>> - I see that these properties are used by the SerDe classes in >>> Hive, but I feel that these properties are just not about >>> serialization and >>> deserialization. And as I see the current SERDEPROPERTIES are things >>> like >>> 'field.delim', 'separatorChar', 'quoteChar', etc. So properties about >>> table >>> loading more naturally belong to TBLPROPERTIES in my opinion. >>> >>> I have seen it used both ways for HBaseSerDe. (even the wiki page uses >>> both :) ). Since Impala prefers TBLPROPERTIES and if we start using prefix >>> for separating real Iceberg table properties from other properties, then we >>> can keep it at TBLPROPERTIES. >>> >> >> In the google doc I also had a comment about prefixing iceberg table >> properties. We could use a prefix like 'iceberg.tblproperties.', and pass >> every property with this prefix to the Iceberg table. Currently Impala >> passes every table property to the Iceberg table. >> >> >>> >>> Thanks, >>> Zoltan >>> >>> >>> On Mon, Nov 30, 2020 at 1:33 PM Peter Vary <pv...@cloudera.com.invalid> >>> wrote: >>> >>>> Hi, >>>> >>>> Based on the discussion below I understand we have the following kinds >>>> of properties: >>>> >>>> 1. Iceberg table properties - Engine independent, storage related >>>> parameters >>>> 2. "how to get to" - I think these are mostly Hive table specific >>>> properties, since for Spark, the Spark catalog configuration serves for >>>> the >>>> same purpose. I think the best place for storing these would be the >>>> Hive SERDEPROPERTIES, as this describes the access information for the >>>> SerDe. Sidenote: I think we should decide if we allow HiveCatalogs >>>> pointing to a different HMS and the 'iceberg.table_identifier' would >>>> make >>>> sense only if we allow having multiple catalogs. >>>> 3. Query specific properties - These are engine specific and might >>>> be mapped to / even override the Iceberg table properties on the engine >>>> specific code paths, but currently these properties have independent >>>> names >>>> and mapped on a case-by-case basis. >>>> >>>> >>>> Based on this: >>>> >>>> - Shall we move the "how to get to" properties to SERDEPROPERTIES? >>>> - Shall we define a prefix for setting Iceberg table properties >>>> from Hive queries and omitting other engine specific properties? >>>> >>>> >>>> Thanks, >>>> Peter >>>> >>>> >>>> On Nov 27, 2020, at 17:45, Mass Dosage <massdos...@gmail.com> wrote: >>>> >>>> I like these suggestions, comments inline below on the last round... >>>> >>>> On Thu, 26 Nov 2020 at 09:45, Zoltán Borók-Nagy <borokna...@apache.org> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> The above aligns with what we did in Impala, i.e. we store information >>>>> about table loading in HMS table properties. We are just a bit more >>>>> explicit about which catalog to use. >>>>> We have table property 'iceberg.catalog' to determine the catalog >>>>> type, right now the supported values are 'hadoop.tables', >>>>> 'hadoop.catalog', >>>>> and 'hive.catalog'. Additional table properties can be set based on the >>>>> catalog type. >>>>> >>>>> So, if the value of 'iceberg.catalog' is >>>>> >>>> >>>> I'm all for renaming this, having "mr" in the property name is >>>> confusing. >>>> >>>> >>>>> >>>>> - hadoop.tables >>>>> - the table location is used to load the table >>>>> >>>>> The only question I have is should we have this as the default? i.e. >>>> if you don't set a catalog it will assume its HadoopTables and use the >>>> location? Or should we require this property to be here to be consistent >>>> and avoid any "magic"? >>>> >>>> >>>>> >>>>> - hadoop.catalog >>>>> - Required table property 'iceberg.catalog_location' specifies >>>>> the location of the hadoop catalog in the file system >>>>> - Optional table property 'iceberg.table_identifier' specifies >>>>> the table id. If it's not set, then <database_name>.<table_name> is >>>>> used as >>>>> table identifier >>>>> >>>>> I like this as it would allow you to use a different database and >>>> table name in Hive as opposed to the Hadoop Catalog - at the moment they >>>> have to match. The only thing here is that I think Hive requires a table >>>> LOCATION to be set and it's then confusing as there are now two locations >>>> on the table. I'm not sure whether in the Hive storage handler or SerDe >>>> etc. we can get Hive to not require that and maybe even disallow it from >>>> being set. That would probably be best in conjunction with this. Another >>>> solution would be to not have the 'iceberg.catalog_location' property but >>>> instead use the table LOCATION for this but that's a bit confusing from a >>>> Hive point of view. >>>> >>>> >>>>> - hive.catalog >>>>> - Optional table property 'iceberg.table_identifier' specifies >>>>> the table id. If it's not set, then <database_name>.<table_name> is >>>>> used as >>>>> table identifier >>>>> - We have the assumption that the current Hive metastore stores >>>>> the table, i.e. we don't support external Hive metastores currently >>>>> >>>>> These sound fine for Hive catalog tables that are created outside of >>>> the automatic Hive table creation (see https://iceberg.apache.org/hive/ >>>> -> Using Hive Catalog) we'd just need to document how you can create these >>>> yourself and that one could use a different Hive database and table etc. >>>> >>>> >>>>> Independent of catalog implementations, but we also have table >>>>> property 'iceberg.file_format' to specify the file format for the data >>>>> files. >>>>> >>>> >>>> OK, I don't think we need that for Hive? >>>> >>>> >>>>> We haven't released it yet, so we are open to changes, but I think >>>>> these properties are reasonable and it would be great if we could >>>>> standardize the properties across engines that use HMS as the primary >>>>> metastore of tables. >>>>> >>>>> >>>> If others agree I think we should create an issue where we document the >>>> above changes so it's very clear what we're doing and can then go an >>>> implement them and update the docs etc. >>>> >>>> >>>>> Cheers, >>>>> Zoltan >>>>> >>>>> >>>>> On Thu, Nov 26, 2020 at 2:20 AM Ryan Blue <rb...@netflix.com.invalid> >>>>> wrote: >>>>> >>>>>> Yes, I think that is a good summary of the principles. >>>>>> >>>>>> #4 is correct because we provide some information that is >>>>>> informational (Hive schema) or tracked only by the metastore (best-effort >>>>>> current user). I also agree that it would be good to have a table >>>>>> identifier in HMS table metadata when loading from an external table. >>>>>> That >>>>>> gives us a way to handle name conflicts. >>>>>> >>>>>> On Wed, Nov 25, 2020 at 5:14 PM Jacques Nadeau <jacq...@dremio.com> >>>>>> wrote: >>>>>> >>>>>>> Minor error, my last example should have been: >>>>>>> >>>>>>> db1.table1_etl_branch => >>>>>>> nessie.folder1.folder2.folder3.table1@etl_branch >>>>>>> >>>>>>> -- >>>>>>> Jacques Nadeau >>>>>>> CTO and Co-Founder, Dremio >>>>>>> >>>>>>> >>>>>>> On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau <jacq...@dremio.com> >>>>>>> wrote: >>>>>>> >>>>>>>> I agree with Ryan on the core principles here. As I understand them: >>>>>>>> >>>>>>>> 1. Iceberg metadata describes all properties of a table >>>>>>>> 2. Hive table properties describe "how to get to" Iceberg >>>>>>>> metadata (which catalog + possibly ptr, path, token, etc) >>>>>>>> 3. There could be default "how to get to" information set at a >>>>>>>> global level >>>>>>>> 4. Best-effort schema should stored be in the table properties >>>>>>>> in HMS. This should be done for information schema retrieval >>>>>>>> purposes >>>>>>>> within Hive but should be ignored during Hive/other tool execution. >>>>>>>> >>>>>>>> Is that a fair summary of your statements Ryan (except 4, which I >>>>>>>> just added)? >>>>>>>> >>>>>>>> One comment I have on #2 is that for different catalogs and use >>>>>>>> cases, I think it can be somewhat more complex where it would be >>>>>>>> desirable for a table that initially existed without Hive that was >>>>>>>> later >>>>>>>> exposed in Hive to support a ptr/path/token for how the table is named >>>>>>>> externally. For example, in a Nessie context we support arbitrary >>>>>>>> paths for >>>>>>>> an Iceberg table (such as folder1.folder2.folder3.table1). If you then >>>>>>>> want >>>>>>>> to expose that table to Hive, you might have this mapping for #2 >>>>>>>> >>>>>>>> db1.table1 => nessie:folder1.folder2.folder3.table1 >>>>>>>> >>>>>>>> Similarly, you might want to expose a particular branch version of >>>>>>>> a table. So it might say: >>>>>>>> >>>>>>>> db1.table1_etl_branch => nessie.folder1@etl_branch >>>>>>>> >>>>>>>> Just saying that the address to the table in the catalog could >>>>>>>> itself have several properties. The key being that no matter what those >>>>>>>> are, we should follow #1 and only store properties that are about the >>>>>>>> ptr, >>>>>>>> not the content/metadata. >>>>>>>> >>>>>>>> Lastly, I believe #4 is the case but haven't tested it. Can someone >>>>>>>> confirm that it is true? And that it is possible/not problematic? >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Jacques Nadeau >>>>>>>> CTO and Co-Founder, Dremio >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue < >>>>>>>> rb...@netflix.com.invalid> wrote: >>>>>>>> >>>>>>>>> Thanks for working on this, Laszlo. I’ve been thinking about these >>>>>>>>> problems as well, so this is a good time to have a discussion about >>>>>>>>> Hive >>>>>>>>> config. >>>>>>>>> >>>>>>>>> I think that Hive configuration should work mostly like other >>>>>>>>> engines, where different configurations are used for different >>>>>>>>> purposes. >>>>>>>>> Different purposes means that there is not a global configuration >>>>>>>>> priority. >>>>>>>>> Hopefully, I can explain how we use the different config sources >>>>>>>>> elsewhere >>>>>>>>> to clarify. >>>>>>>>> >>>>>>>>> Let’s take Spark as an example. Spark uses Hadoop, so it has a >>>>>>>>> Hadoop Configuration, but it also has its own global configuration. >>>>>>>>> There >>>>>>>>> are also Iceberg table properties, and all of the various Hive >>>>>>>>> properties >>>>>>>>> if you’re tracking tables with a Hive MetaStore. >>>>>>>>> >>>>>>>>> The first step is to simplify where we can, so we effectively >>>>>>>>> eliminate 2 sources of config: >>>>>>>>> >>>>>>>>> - The Hadoop Configuration is only used to instantiate Hadoop >>>>>>>>> classes, like FileSystem. Iceberg should not use it for any other >>>>>>>>> config. >>>>>>>>> - Config in the Hive MetaStore is only used to identify that a >>>>>>>>> table is Iceberg and point to its metadata location. All other >>>>>>>>> config in >>>>>>>>> HMS is informational. For example, the input format is >>>>>>>>> FileInputFormat so >>>>>>>>> that non-Iceberg readers cannot actually instantiate the format >>>>>>>>> (it’s >>>>>>>>> abstract) but it is available so they also don’t fail trying to >>>>>>>>> load the >>>>>>>>> class. Table-specific config should not be stored in table or serde >>>>>>>>> properties. >>>>>>>>> >>>>>>>>> That leaves Spark configuration and Iceberg table configuration. >>>>>>>>> >>>>>>>>> Iceberg differs from other tables because it is opinionated: data >>>>>>>>> configuration should be maintained at the table level. This is >>>>>>>>> cleaner for >>>>>>>>> users because config is standardized across engines and in one place. >>>>>>>>> And >>>>>>>>> it also enables services that analyze a table and update its >>>>>>>>> configuration >>>>>>>>> to tune options that users almost never do, like row group or stripe >>>>>>>>> size >>>>>>>>> in the columnar formats. Iceberg table configuration is used to >>>>>>>>> configure >>>>>>>>> table-specific concerns and behavior. >>>>>>>>> >>>>>>>>> Spark configuration is used for engine-specific concerns, and >>>>>>>>> runtime overrides. A good example of an engine-specific concern is the >>>>>>>>> catalogs that are available to load Iceberg tables. Spark has a way >>>>>>>>> to load >>>>>>>>> and configure catalog implementations and Iceberg uses that for all >>>>>>>>> catalog-level config. Runtime overrides are things like target split >>>>>>>>> size. >>>>>>>>> Iceberg has a table-level default split size in table properties, but >>>>>>>>> this >>>>>>>>> can be overridden by a Spark option for each table, as well as an >>>>>>>>> option >>>>>>>>> passed to the individual read. Note that these necessarily have >>>>>>>>> different >>>>>>>>> config names for how they are used: Iceberg uses >>>>>>>>> read.split.target-size and the read-specific option is target-size >>>>>>>>> . >>>>>>>>> >>>>>>>>> Applying this to Hive is a little strange for a couple reasons. >>>>>>>>> First, Hive’s engine configuration *is* a Hadoop Configuration. >>>>>>>>> As a result, I think the right place to store engine-specific config >>>>>>>>> is >>>>>>>>> there, including Iceberg catalogs using a strategy similar to what >>>>>>>>> Spark >>>>>>>>> does: what external Iceberg catalogs are available and their >>>>>>>>> configuration >>>>>>>>> should come from the HiveConf. >>>>>>>>> >>>>>>>>> The second way Hive is strange is that Hive needs to use its own >>>>>>>>> MetaStore to track Hive table concerns. The MetaStore may have tables >>>>>>>>> created by an Iceberg HiveCatalog, and Hive also needs to be able to >>>>>>>>> load >>>>>>>>> tables from other Iceberg catalogs by creating table entries for them. >>>>>>>>> >>>>>>>>> Here’s how I think Hive should work: >>>>>>>>> >>>>>>>>> - There should be a default HiveCatalog that uses the current >>>>>>>>> MetaStore URI to be used for HiveCatalog tables tracked in the >>>>>>>>> MetaStore >>>>>>>>> - Other catalogs should be defined in HiveConf >>>>>>>>> - HMS table properties should be used to determine how to load >>>>>>>>> a table: using a Hadoop location, using the default metastore >>>>>>>>> catalog, or >>>>>>>>> using an external Iceberg catalog >>>>>>>>> - If there is a metadata_location, then use the HiveCatalog >>>>>>>>> for this metastore (where it is tracked) >>>>>>>>> - If there is a catalog property, then load that catalog >>>>>>>>> and use it to load the table identifier, or maybe an identifier >>>>>>>>> from HMS >>>>>>>>> table properties >>>>>>>>> - If there is no catalog or metadata_location, then use >>>>>>>>> HadoopTables to load the table location as an Iceberg table >>>>>>>>> >>>>>>>>> This would make it possible to access all types of Iceberg tables >>>>>>>>> in the same query, and would match how Spark and Flink configure >>>>>>>>> catalogs. >>>>>>>>> Other than the configuration above, I don’t think that config in HMS >>>>>>>>> should >>>>>>>>> be used at all, like how the other engines work. Iceberg is the >>>>>>>>> source of >>>>>>>>> truth for table metadata, HMS stores how to load the Iceberg table, >>>>>>>>> and >>>>>>>>> HiveConf defines the catalogs (or runtime overrides). >>>>>>>>> >>>>>>>>> This isn’t quite how configuration works right now. Currently, the >>>>>>>>> catalog is controlled by a HiveConf property, iceberg.mr.catalog. >>>>>>>>> If that isn’t set, HadoopTables will be used to load table locations. >>>>>>>>> If it >>>>>>>>> is set, then that catalog will be used to load all tables by name. >>>>>>>>> This >>>>>>>>> makes it impossible to load tables from different catalogs at the same >>>>>>>>> time. That’s why I think the Iceberg catalog for a table should be >>>>>>>>> stored >>>>>>>>> in HMS table properties. >>>>>>>>> >>>>>>>>> I should also explain iceberg.hive.engine.enabled flag, but I >>>>>>>>> think this is long enough for now. >>>>>>>>> >>>>>>>>> rb >>>>>>>>> >>>>>>>>> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter < >>>>>>>>> lpin...@cloudera.com.invalid> wrote: >>>>>>>>> >>>>>>>>>> Hi All, >>>>>>>>>> >>>>>>>>>> I would like to start a discussion, how should we handle >>>>>>>>>> properties from various sources like Iceberg, Hive or global >>>>>>>>>> configuration. >>>>>>>>>> I've put together a short document >>>>>>>>>> <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>, >>>>>>>>>> please have a look and let me know what you think. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Laszlo >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Ryan Blue >>>>>>>>> Software Engineer >>>>>>>>> Netflix >>>>>>>>> >>>>>>>> >>>>>> >>>>>> -- >>>>>> Ryan Blue >>>>>> Software Engineer >>>>>> Netflix >>>>>> >>>>> >>>> >>> > > -- > Ryan Blue > Software Engineer > Netflix >