As Jacques suggested (with the help of Zoltan) I have collected the current state and the proposed solutions in a document: https://docs.google.com/document/d/1KumHM9IKbQyleBEUHZDbeoMjd7n6feUPJ5zK8NQb-Qw/edit?usp=sharing <https://docs.google.com/document/d/1KumHM9IKbQyleBEUHZDbeoMjd7n6feUPJ5zK8NQb-Qw/edit?usp=sharing>
My feeling is that we do not have a final decision, so tried to list all the possible solutions. Please comment! Thanks, Peter > On Dec 2, 2020, at 18:10, Peter Vary <pv...@cloudera.com> wrote: > > When I was working on the CREATE TABLE patch I found the following > TBLPROPERTIES on newly created tables: > external.table.purge > EXTERNAL > bucketing_version > numRows > rawDataSize > totalSize > numFiles > numFileErasureCoded > > I am afraid that we can not change the name of most of these properties, and > might not be useful to have most of them along with Iceberg statistics > already there. Also my feeling is that this is only the top of the Iceberg > (pun intended :)) so this is why I think we should be more targeted way to > push properties to the Iceberg tables. > >> On Dec 2, 2020, at 18:04, Ryan Blue <rb...@netflix.com >> <mailto:rb...@netflix.com>> wrote: >> >> Sorry, I accidentally didn’t copy the dev list on this reply. Resending: >> >> Also I expect that we want to add Hive write specific configs to table level >> when the general engine independent configuration is not ideal for Hive, but >> every Hive query for a given table should use some specific config. >> >> Hive may need configuration, but I think these should still be kept in the >> Iceberg table. There is no reason to make Hive config inaccessible from >> other engines. If someone wants to view all of the config for a table from >> Spark, the Hive config should also be included right? >> >> >> On Tue, Dec 1, 2020 at 10:36 AM Peter Vary <pv...@cloudera.com >> <mailto:pv...@cloudera.com>> wrote: >> I will ask Laszlo if he wants to update his doc. >> >> I see both pros and cons of catalog definition in config files. If there is >> an easy default then I do not mind any of the proposed solutions. >> >> OTOH I am in favor of the "use prefix for Iceberg table properties" >> solution, because in Hive it is common to add new keys to the property list >> - no restriction is in place (I am not even sure that the currently >> implemented blacklist for preventing to propagate properties to Iceberg >> tables is complete). Also I expect that we want to add Hive write specific >> configs to table level when the general engine independent configuration is >> not ideal for Hive, but every Hive query for a given table should use some >> specific config. >> >> Thanks, Peter >> >> Jacques Nadeau <jacq...@dremio.com <mailto:jacq...@dremio.com>> ezt írta >> (időpont: 2020. dec. 1., Ke 17:06): >> Would someone be willing to create a document that states the current >> proposal? >> >> It is becoming somewhat difficult to follow this thread. I also worry that >> without a complete statement of the current shape that people may be >> incorrectly thinking they are in alignment. >> >> >> >> -- >> Jacques Nadeau >> CTO and Co-Founder, Dremio >> >> >> On Tue, Dec 1, 2020 at 5:32 AM Zoltán Borók-Nagy <borokna...@cloudera.com >> <mailto:borokna...@cloudera.com>> wrote: >> Thanks, Ryan. I answered inline. >> >> On Mon, Nov 30, 2020 at 8:26 PM Ryan Blue <rb...@netflix.com >> <mailto:rb...@netflix.com>> wrote: >> This sounds like a good plan overall, but I have a couple of notes: >> >> We need to keep in mind that users plug in their own catalogs, so >> iceberg.catalog could be a Glue or Nessie catalog, not just Hive or Hadoop. >> I don’t think it makes much sense to use separate hadoop.catalog and >> hive.catalog values. Those should just be names for catalogs configured in >> Configuration, i.e., via hive-site.xml. We then only need a special value >> for loading Hadoop tables from paths. >> About extensibility, I think the usual Hive way is to use Java class names. >> So this way the value for 'iceberg.catalog' could be e.g. >> 'org.apache.iceberg.hive.HiveCatalog'. Then each catalog implementation >> would need to have a factory method that constructs the catalog object from >> a properties object (Map<String, String>). E.g. >> 'org.apache.iceberg.hadoop.HadoopCatalog' would require >> 'iceberg.catalog_location' to be present in properties. >> I don’t think that catalog configuration should be kept in table properties. >> A catalog should not be loaded for each table. So I don’t think we need >> iceberg.catalog_location. Instead, we should have a way to define catalogs >> in the Configuration for tables in the metastore to reference. >> I think it makes sense, on the other hand it would make adding new catalogs >> more heavy-weight, i.e. now you'd need to edit configuration files and >> restart/reinit services. Maybe it can be cumbersome in some environments. >> I’d rather use a prefix to exclude properties from being passed to Iceberg >> than to include them. Otherwise, users don’t know what to do to pass table >> properties from Hive or Impala. If we exclude a prefix or specific >> properties, then everything but the properties reserved for locating the >> table are passed as the user would expect. >> I don't have a strong opinion about this, but yeah, maybe this behavior >> would cause the least surprises. >> >> On Mon, Nov 30, 2020 at 7:51 AM Zoltán Borók-Nagy <borokna...@apache.org >> <mailto:borokna...@apache.org>> wrote: >> Thanks, Peter. I answered inline. >> >> On Mon, Nov 30, 2020 at 3:13 PM Peter Vary <pv...@cloudera.com.invalid >> <mailto:pv...@cloudera.com.invalid>> wrote: >> Hi Zoltan, >> >> Answers below: >> >>> On Nov 30, 2020, at 14:19, Zoltán Borók-Nagy >>> <borokna...@cloudera.com.INVALID <mailto:borokna...@cloudera.com.INVALID>> >>> wrote: >>> >>> Hi, >>> >>> Thanks for the replies. My take for the above questions are as follows >>> Should 'iceberg.catalog' be a required property? >>> Yeah, I think it would be nice if this would be required to avoid any >>> implicit behavior >> Currently we have a Catalogs class to get/initialize/use the different >> Catalogs. At that time the decision was to use HadoopTables as a default >> catalog. >> It might be worthwhile to use the same class in Impala as well, so the >> behavior is consistent. >> >> Yeah, I think it'd be beneficial for us to use the Iceberg classes whenever >> possible. The Catalogs class is very similar to what we have currently in >> Impala. >>> 'hadoop.catalog' LOCATION and catalog_location >>> In Impala we don't allow setting LOCATION for tables stored in >>> 'hadoop.catalog'. But Impala internally sets LOCATION to the Iceberg >>> table's actual location. We were also thinking about using only the table >>> LOCATION, and set it to the catalog location, but we also found it >>> confusing. >> It could definitely work, but it is somewhat strange that we have an >> external table location set to an arbitrary path, and we have a different >> location generated by other configs. It would be nice to have the real >> location set in the external table location as well. >> >> Impala sets the real Iceberg table location for external tables. E.g. if the >> user issues >> >> CREATE EXTERNAL TABLE my_hive_db.iceberg_table_hadoop_catalog >> STORED AS ICEBERG >> TBLPROPERTIES('iceberg.catalog'='hadoop.catalog', >> 'iceberg.catalog_location'='/path/to/hadoop/catalog', >> 'iceberg.table_identifier'='namespace1.namespace2.ice_t'); >> >> If the end user had specified LOCATION, then Impala would have raised an >> error. But the above DDL statement is correct, so Impala loads the iceberg >> table via Iceberg API, then creates the HMS table and sets LOCATION to the >> Iceberg table location (something like >> /path/to/hadoop/catalog/namespace1/namespace2/ice_t). >> >> >> I like the flexibility of setting the table_identifier on table level, which >> could help removing naming conflicts. We might want to have this in the >> Iceberg Catalog implementation. >>> 'iceberg.table_identifier' for HiveCatalog >>> Yeah, it doesn't add much if we only allow using the current HMS. I think >>> it can be only useful if we are allowing external HMSes. >>> Moving properties to SERDEPROPERTIES >>> I see that these properties are used by the SerDe classes in Hive, but I >>> feel that these properties are just not about serialization and >>> deserialization. And as I see the current SERDEPROPERTIES are things like >>> 'field.delim', 'separatorChar', 'quoteChar', etc. So properties about table >>> loading more naturally belong to TBLPROPERTIES in my opinion. >> I have seen it used both ways for HBaseSerDe. (even the wiki page uses both >> :) ). Since Impala prefers TBLPROPERTIES and if we start using prefix for >> separating real Iceberg table properties from other properties, then we can >> keep it at TBLPROPERTIES. >> >> In the google doc I also had a comment about prefixing iceberg table >> properties. We could use a prefix like 'iceberg.tblproperties.', and pass >> every property with this prefix to the Iceberg table. Currently Impala >> passes every table property to the Iceberg table. >> >> >> >>> Thanks, >>> Zoltan >>> >>> >>> On Mon, Nov 30, 2020 at 1:33 PM Peter Vary <pv...@cloudera.com.invalid >>> <mailto:pv...@cloudera.com.invalid>> wrote: >>> Hi, >>> >>> Based on the discussion below I understand we have the following kinds of >>> properties: >>> Iceberg table properties - Engine independent, storage related parameters >>> "how to get to" - I think these are mostly Hive table specific properties, >>> since for Spark, the Spark catalog configuration serves for the same >>> purpose. I think the best place for storing these would be the Hive >>> SERDEPROPERTIES, as this describes the access information for the SerDe. >>> Sidenote: I think we should decide if we allow HiveCatalogs pointing to a >>> different HMS and the 'iceberg.table_identifier' would make sense only if >>> we allow having multiple catalogs. >>> Query specific properties - These are engine specific and might be mapped >>> to / even override the Iceberg table properties on the engine specific code >>> paths, but currently these properties have independent names and mapped on >>> a case-by-case basis. >>> >>> Based on this: >>> Shall we move the "how to get to" properties to SERDEPROPERTIES? >>> Shall we define a prefix for setting Iceberg table properties from Hive >>> queries and omitting other engine specific properties? >>> >>> Thanks, >>> Peter >>> >>> >>>> On Nov 27, 2020, at 17:45, Mass Dosage <massdos...@gmail.com >>>> <mailto:massdos...@gmail.com>> wrote: >>>> >>>> I like these suggestions, comments inline below on the last round... >>>> >>>> On Thu, 26 Nov 2020 at 09:45, Zoltán Borók-Nagy <borokna...@apache.org >>>> <mailto:borokna...@apache.org>> wrote: >>>> Hi, >>>> >>>> The above aligns with what we did in Impala, i.e. we store information >>>> about table loading in HMS table properties. We are just a bit more >>>> explicit about which catalog to use. >>>> We have table property 'iceberg.catalog' to determine the catalog type, >>>> right now the supported values are 'hadoop.tables', 'hadoop.catalog', and >>>> 'hive.catalog'. Additional table properties can be set based on the >>>> catalog type. >>>> >>>> So, if the value of 'iceberg.catalog' is >>>> >>>> I'm all for renaming this, having "mr" in the property name is confusing. >>>> >>>> hadoop.tables >>>> the table location is used to load the table >>>> The only question I have is should we have this as the default? i.e. if >>>> you don't set a catalog it will assume its HadoopTables and use the >>>> location? Or should we require this property to be here to be consistent >>>> and avoid any "magic"? >>>> >>>> hadoop.catalog >>>> Required table property 'iceberg.catalog_location' specifies the location >>>> of the hadoop catalog in the file system >>>> Optional table property 'iceberg.table_identifier' specifies the table id. >>>> If it's not set, then <database_name>.<table_name> is used as table >>>> identifier >>>> I like this as it would allow you to use a different database and table >>>> name in Hive as opposed to the Hadoop Catalog - at the moment they have to >>>> match. The only thing here is that I think Hive requires a table LOCATION >>>> to be set and it's then confusing as there are now two locations on the >>>> table. I'm not sure whether in the Hive storage handler or SerDe etc. we >>>> can get Hive to not require that and maybe even disallow it from being >>>> set. That would probably be best in conjunction with this. Another >>>> solution would be to not have the 'iceberg.catalog_location' property but >>>> instead use the table LOCATION for this but that's a bit confusing from a >>>> Hive point of view. >>>> >>>> hive.catalog >>>> Optional table property 'iceberg.table_identifier' specifies the table id. >>>> If it's not set, then <database_name>.<table_name> is used as table >>>> identifier >>>> We have the assumption that the current Hive metastore stores the table, >>>> i.e. we don't support external Hive metastores currently >>>> These sound fine for Hive catalog tables that are created outside of the >>>> automatic Hive table creation (see https://iceberg.apache.org/hive/ >>>> <https://iceberg.apache.org/hive/> -> Using Hive Catalog) we'd just need >>>> to document how you can create these yourself and that one could use a >>>> different Hive database and table etc. >>>> >>>> Independent of catalog implementations, but we also have table property >>>> 'iceberg.file_format' to specify the file format for the data files. >>>> >>>> OK, I don't think we need that for Hive? >>>> >>>> We haven't released it yet, so we are open to changes, but I think these >>>> properties are reasonable and it would be great if we could standardize >>>> the properties across engines that use HMS as the primary metastore of >>>> tables. >>>> >>>> >>>> If others agree I think we should create an issue where we document the >>>> above changes so it's very clear what we're doing and can then go an >>>> implement them and update the docs etc. >>>> >>>> Cheers, >>>> Zoltan >>>> >>>> >>>> On Thu, Nov 26, 2020 at 2:20 AM Ryan Blue <rb...@netflix.com.invalid >>>> <mailto:rb...@netflix.com.invalid>> wrote: >>>> Yes, I think that is a good summary of the principles. >>>> >>>> #4 is correct because we provide some information that is informational >>>> (Hive schema) or tracked only by the metastore (best-effort current user). >>>> I also agree that it would be good to have a table identifier in HMS table >>>> metadata when loading from an external table. That gives us a way to >>>> handle name conflicts. >>>> >>>> On Wed, Nov 25, 2020 at 5:14 PM Jacques Nadeau <jacq...@dremio.com >>>> <mailto:jacq...@dremio.com>> wrote: >>>> Minor error, my last example should have been: >>>> >>>> db1.table1_etl_branch => nessie.folder1.folder2.folder3.table1@etl_branch >>>> >>>> -- >>>> Jacques Nadeau >>>> CTO and Co-Founder, Dremio >>>> >>>> >>>> On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau <jacq...@dremio.com >>>> <mailto:jacq...@dremio.com>> wrote: >>>> I agree with Ryan on the core principles here. As I understand them: >>>> Iceberg metadata describes all properties of a table >>>> Hive table properties describe "how to get to" Iceberg metadata (which >>>> catalog + possibly ptr, path, token, etc) >>>> There could be default "how to get to" information set at a global level >>>> Best-effort schema should stored be in the table properties in HMS. This >>>> should be done for information schema retrieval purposes within Hive but >>>> should be ignored during Hive/other tool execution. >>>> Is that a fair summary of your statements Ryan (except 4, which I just >>>> added)? >>>> >>>> One comment I have on #2 is that for different catalogs and use cases, I >>>> think it can be somewhat more complex where it would be desirable for a >>>> table that initially existed without Hive that was later exposed in Hive >>>> to support a ptr/path/token for how the table is named externally. For >>>> example, in a Nessie context we support arbitrary paths for an Iceberg >>>> table (such as folder1.folder2.folder3.table1). If you then want to expose >>>> that table to Hive, you might have this mapping for #2 >>>> >>>> db1.table1 => nessie:folder1.folder2.folder3.table1 >>>> >>>> Similarly, you might want to expose a particular branch version of a >>>> table. So it might say: >>>> >>>> db1.table1_etl_branch => nessie.folder1@etl_branch >>>> >>>> Just saying that the address to the table in the catalog could itself have >>>> several properties. The key being that no matter what those are, we should >>>> follow #1 and only store properties that are about the ptr, not the >>>> content/metadata. >>>> >>>> Lastly, I believe #4 is the case but haven't tested it. Can someone >>>> confirm that it is true? And that it is possible/not problematic? >>>> >>>> >>>> -- >>>> Jacques Nadeau >>>> CTO and Co-Founder, Dremio >>>> >>>> >>>> On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <rb...@netflix.com.invalid >>>> <mailto:rb...@netflix.com.invalid>> wrote: >>>> Thanks for working on this, Laszlo. I’ve been thinking about these >>>> problems as well, so this is a good time to have a discussion about Hive >>>> config. >>>> >>>> I think that Hive configuration should work mostly like other engines, >>>> where different configurations are used for different purposes. Different >>>> purposes means that there is not a global configuration priority. >>>> Hopefully, I can explain how we use the different config sources elsewhere >>>> to clarify. >>>> >>>> Let’s take Spark as an example. Spark uses Hadoop, so it has a Hadoop >>>> Configuration, but it also has its own global configuration. There are >>>> also Iceberg table properties, and all of the various Hive properties if >>>> you’re tracking tables with a Hive MetaStore. >>>> >>>> The first step is to simplify where we can, so we effectively eliminate 2 >>>> sources of config: >>>> >>>> The Hadoop Configuration is only used to instantiate Hadoop classes, like >>>> FileSystem. Iceberg should not use it for any other config. >>>> Config in the Hive MetaStore is only used to identify that a table is >>>> Iceberg and point to its metadata location. All other config in HMS is >>>> informational. For example, the input format is FileInputFormat so that >>>> non-Iceberg readers cannot actually instantiate the format (it’s abstract) >>>> but it is available so they also don’t fail trying to load the class. >>>> Table-specific config should not be stored in table or serde properties. >>>> That leaves Spark configuration and Iceberg table configuration. >>>> >>>> Iceberg differs from other tables because it is opinionated: data >>>> configuration should be maintained at the table level. This is cleaner for >>>> users because config is standardized across engines and in one place. And >>>> it also enables services that analyze a table and update its configuration >>>> to tune options that users almost never do, like row group or stripe size >>>> in the columnar formats. Iceberg table configuration is used to configure >>>> table-specific concerns and behavior. >>>> >>>> Spark configuration is used for engine-specific concerns, and runtime >>>> overrides. A good example of an engine-specific concern is the catalogs >>>> that are available to load Iceberg tables. Spark has a way to load and >>>> configure catalog implementations and Iceberg uses that for all >>>> catalog-level config. Runtime overrides are things like target split size. >>>> Iceberg has a table-level default split size in table properties, but this >>>> can be overridden by a Spark option for each table, as well as an option >>>> passed to the individual read. Note that these necessarily have different >>>> config names for how they are used: Iceberg uses read.split.target-size >>>> and the read-specific option is target-size. >>>> >>>> Applying this to Hive is a little strange for a couple reasons. First, >>>> Hive’s engine configuration is a Hadoop Configuration. As a result, I >>>> think the right place to store engine-specific config is there, including >>>> Iceberg catalogs using a strategy similar to what Spark does: what >>>> external Iceberg catalogs are available and their configuration should >>>> come from the HiveConf. >>>> >>>> The second way Hive is strange is that Hive needs to use its own MetaStore >>>> to track Hive table concerns. The MetaStore may have tables created by an >>>> Iceberg HiveCatalog, and Hive also needs to be able to load tables from >>>> other Iceberg catalogs by creating table entries for them. >>>> >>>> Here’s how I think Hive should work: >>>> >>>> There should be a default HiveCatalog that uses the current MetaStore URI >>>> to be used for HiveCatalog tables tracked in the MetaStore >>>> Other catalogs should be defined in HiveConf >>>> HMS table properties should be used to determine how to load a table: >>>> using a Hadoop location, using the default metastore catalog, or using an >>>> external Iceberg catalog >>>> If there is a metadata_location, then use the HiveCatalog for this >>>> metastore (where it is tracked) >>>> If there is a catalog property, then load that catalog and use it to load >>>> the table identifier, or maybe an identifier from HMS table properties >>>> If there is no catalog or metadata_location, then use HadoopTables to load >>>> the table location as an Iceberg table >>>> This would make it possible to access all types of Iceberg tables in the >>>> same query, and would match how Spark and Flink configure catalogs. Other >>>> than the configuration above, I don’t think that config in HMS should be >>>> used at all, like how the other engines work. Iceberg is the source of >>>> truth for table metadata, HMS stores how to load the Iceberg table, and >>>> HiveConf defines the catalogs (or runtime overrides). >>>> >>>> This isn’t quite how configuration works right now. Currently, the catalog >>>> is controlled by a HiveConf property, iceberg.mr.catalog. If that isn’t >>>> set, HadoopTables will be used to load table locations. If it is set, then >>>> that catalog will be used to load all tables by name. This makes it >>>> impossible to load tables from different catalogs at the same time. That’s >>>> why I think the Iceberg catalog for a table should be stored in HMS table >>>> properties. >>>> >>>> I should also explain iceberg.hive.engine.enabled flag, but I think this >>>> is long enough for now. >>>> >>>> rb >>>> >>>> >>>> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter >>>> <lpin...@cloudera.com.invalid <mailto:lpin...@cloudera.com.invalid>> wrote: >>>> Hi All, >>>> >>>> I would like to start a discussion, how should we handle properties from >>>> various sources like Iceberg, Hive or global configuration. I've put >>>> together a short document >>>> <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>, >>>> please have a look and let me know what you think. >>>> >>>> Thanks, >>>> Laszlo >>>> >>>> >>>> -- >>>> Ryan Blue >>>> Software Engineer >>>> Netflix >>>> >>>> >>>> -- >>>> Ryan Blue >>>> Software Engineer >>>> Netflix >>> >> >> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >