Re: Iceberg/Hive properties handling

Peter Vary Tue, 08 Dec 2020 01:19:23 -0800

Hi Jacques,

Thanks for picking this up!


Coming from the Hive side I prefer solution #3.
Knowing Hive there will be plenty of surprising new properties which we either 
handle specifically (hard-coded omission) or will litter the quite neat list of 
Iceberg table properties. This will become even more pronounced if we implement 
HiveMetaHook.preAlterTable 
<https://hive.apache.org/javadocs/r3.1.2/api/org/apache/hadoop/hive/metastore/HiveMetaHook.html#preAlterTable-org.apache.hadoop.hive.metastore.api.Table-org.apache.hadoop.hive.metastore.api.EnvironmentContext->
 for propagating Hive ALTER TABLE changes to IcebergTables. Alter table is 
called by Hive every time for updating statistics on the table - if we forget 
to remove anything changing continuously we would end up creating 2 
modifications on the Iceberg table every time when we insert a single row of 
data.

Thanks,
Peter

> On Dec 7, 2020, at 18:58, Jacques Nadeau <jacq...@dremio.com> wrote:
> 
> Hey Peter, thanks for updating the doc and your heads up in the other thread 
> on your capacity to look at this before EOY.
> 
> I'm going to try to create a specification document based on the discussion 
> document you put together. I think there is general consensus around what you 
> call "Spark-like catalog configuration" so I'd like to formalize that more. 
> 
> It seems like there is less consensus around the whitelist/blacklist side of 
> things. You outline four approaches: 
> Hard coded HMS only property list
> Hard coded Iceberg only property list
> Prefix for Iceberg properties
> Prefix for HMS only properties
> I generally think #2 is a no-go as it creates too much coupling between 
> catalog implementations and core iceberg. It seems like Ryan Blue would 
> prefer #4 (correct?). Any other strong opinions?
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
> 
> 
> On Thu, Dec 3, 2020 at 9:27 AM Peter Vary <pv...@cloudera.com.invalid> wrote:
> As Jacques suggested (with the help of Zoltan) I have collected the current 
> state and the proposed solutions in a document:
> https://docs.google.com/document/d/1KumHM9IKbQyleBEUHZDbeoMjd7n6feUPJ5zK8NQb-Qw/edit?usp=sharing
>  
> <https://docs.google.com/document/d/1KumHM9IKbQyleBEUHZDbeoMjd7n6feUPJ5zK8NQb-Qw/edit?usp=sharing>
> 
> My feeling is that we do not have a final decision, so tried to list all the 
> possible solutions. 
> Please comment!
> 
> Thanks,
> Peter
> 
>> On Dec 2, 2020, at 18:10, Peter Vary <pv...@cloudera.com 
>> <mailto:pv...@cloudera.com>> wrote:
>> 
>> When I was working on the CREATE TABLE patch I found the following 
>> TBLPROPERTIES on newly created tables:
>> external.table.purge
>> EXTERNAL
>> bucketing_version
>> numRows
>> rawDataSize
>> totalSize
>> numFiles
>> numFileErasureCoded
>> 
>> I am afraid that we can not change the name of most of these properties, and 
>> might not be useful to have most of them along with Iceberg statistics 
>> already there. Also my feeling is that this is only the top of the Iceberg 
>> (pun intended :)) so this is why I think we should be more targeted way to 
>> push properties to the Iceberg tables.
>> 
>>> On Dec 2, 2020, at 18:04, Ryan Blue <rb...@netflix.com 
>>> <mailto:rb...@netflix.com>> wrote:
>>> 
>>> Sorry, I accidentally didn’t copy the dev list on this reply. Resending:
>>> 
>>> Also I expect that we want to add Hive write specific configs to table 
>>> level when the general engine independent configuration is not ideal for 
>>> Hive, but every Hive query for a given table should use some specific 
>>> config.
>>> 
>>> Hive may need configuration, but I think these should still be kept in the 
>>> Iceberg table. There is no reason to make Hive config inaccessible from 
>>> other engines. If someone wants to view all of the config for a table from 
>>> Spark, the Hive config should also be included right?
>>> 
>>> 
>>> On Tue, Dec 1, 2020 at 10:36 AM Peter Vary <pv...@cloudera.com 
>>> <mailto:pv...@cloudera.com>> wrote:
>>> I will ask Laszlo if he wants to update his doc.
>>> 
>>> I see both pros and cons of catalog definition in config files. If there is 
>>> an easy default then I do not mind any of the proposed solutions.
>>> 
>>> OTOH I am in favor of the "use prefix for Iceberg table properties" 
>>> solution, because in Hive it is common to add new keys to the property list 
>>> - no restriction is in place (I am not even sure that the currently 
>>> implemented blacklist for preventing to propagate properties to Iceberg 
>>> tables is complete). Also I expect that we want to add Hive write specific 
>>> configs to table level when the general engine independent configuration is 
>>> not ideal for Hive, but every Hive query for a given table should use some 
>>> specific config.
>>> 
>>> Thanks, Peter 
>>> 
>>> Jacques Nadeau <jacq...@dremio.com <mailto:jacq...@dremio.com>> ezt írta 
>>> (időpont: 2020. dec. 1., Ke 17:06):
>>> Would someone be willing to create a document that states the current 
>>> proposal? 
>>> 
>>> It is becoming somewhat difficult to follow this thread. I also worry that 
>>> without a complete statement of the current shape that people may be 
>>> incorrectly thinking they are in alignment.
>>> 
>>> 
>>> 
>>> --
>>> Jacques Nadeau
>>> CTO and Co-Founder, Dremio
>>> 
>>> 
>>> On Tue, Dec 1, 2020 at 5:32 AM Zoltán Borók-Nagy <borokna...@cloudera.com 
>>> <mailto:borokna...@cloudera.com>> wrote:
>>> Thanks, Ryan. I answered inline.
>>> 
>>> On Mon, Nov 30, 2020 at 8:26 PM Ryan Blue <rb...@netflix.com 
>>> <mailto:rb...@netflix.com>> wrote:
>>> This sounds like a good plan overall, but I have a couple of notes:
>>> 
>>> We need to keep in mind that users plug in their own catalogs, so 
>>> iceberg.catalog could be a Glue or Nessie catalog, not just Hive or Hadoop. 
>>> I don’t think it makes much sense to use separate hadoop.catalog and 
>>> hive.catalog values. Those should just be names for catalogs configured in 
>>> Configuration, i.e., via hive-site.xml. We then only need a special value 
>>> for loading Hadoop tables from paths.
>>> About extensibility, I think the usual Hive way is to use Java class names. 
>>> So this way the value for 'iceberg.catalog' could be e.g. 
>>> 'org.apache.iceberg.hive.HiveCatalog'. Then each catalog implementation 
>>> would need to have a factory method that constructs the catalog object from 
>>> a properties object (Map<String, String>). E.g. 
>>> 'org.apache.iceberg.hadoop.HadoopCatalog' would require 
>>> 'iceberg.catalog_location' to be present in properties.
>>> I don’t think that catalog configuration should be kept in table 
>>> properties. A catalog should not be loaded for each table. So I don’t think 
>>> we need iceberg.catalog_location. Instead, we should have a way to define 
>>> catalogs in the Configuration for tables in the metastore to reference.
>>>  I think it makes sense, on the other hand it would make adding new 
>>> catalogs more heavy-weight, i.e. now you'd need to edit configuration files 
>>> and restart/reinit services. Maybe it can be cumbersome in some 
>>> environments.
>>> I’d rather use a prefix to exclude properties from being passed to Iceberg 
>>> than to include them. Otherwise, users don’t know what to do to pass table 
>>> properties from Hive or Impala. If we exclude a prefix or specific 
>>> properties, then everything but the properties reserved for locating the 
>>> table are passed as the user would expect.
>>> I don't have a strong opinion about this, but yeah, maybe this behavior 
>>> would cause the least surprises.
>>> 
>>> On Mon, Nov 30, 2020 at 7:51 AM Zoltán Borók-Nagy <borokna...@apache.org 
>>> <mailto:borokna...@apache.org>> wrote:
>>> Thanks, Peter. I answered inline.
>>> 
>>> On Mon, Nov 30, 2020 at 3:13 PM Peter Vary <pv...@cloudera.com.invalid 
>>> <mailto:pv...@cloudera.com.invalid>> wrote:
>>> Hi Zoltan,
>>> 
>>> Answers below:
>>> 
>>>> On Nov 30, 2020, at 14:19, Zoltán Borók-Nagy 
>>>> <borokna...@cloudera.com.INVALID <mailto:borokna...@cloudera.com.INVALID>> 
>>>> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> Thanks for the replies. My take for the above questions are as follows
>>>> Should 'iceberg.catalog' be a required property?
>>>> Yeah, I think it would be nice if this would be required to avoid any 
>>>> implicit behavior
>>> Currently we have a Catalogs class to get/initialize/use the different 
>>> Catalogs. At that time the decision was to use HadoopTables as a default 
>>> catalog.
>>> It might be worthwhile to use the same class in Impala as well, so the 
>>> behavior is consistent.
>>>  
>>> Yeah, I think it'd be beneficial for us to use the Iceberg classes whenever 
>>> possible. The Catalogs class is very similar to what we have currently in 
>>> Impala.  
>>>> 'hadoop.catalog' LOCATION and catalog_location
>>>> In Impala we don't allow setting LOCATION for tables stored in 
>>>> 'hadoop.catalog'. But Impala internally sets LOCATION to the Iceberg 
>>>> table's actual location. We were also thinking about using only the table 
>>>> LOCATION, and set it to the catalog location, but we also found it 
>>>> confusing.
>>> It could definitely work, but it is somewhat strange that we have an 
>>> external table location set to an arbitrary path, and we have a different 
>>> location generated by other configs. It would be nice to have the real 
>>> location set in the external table location as well.
>>> 
>>> Impala sets the real Iceberg table location for external tables. E.g. if 
>>> the user issues
>>> 
>>> CREATE EXTERNAL TABLE my_hive_db.iceberg_table_hadoop_catalog
>>> STORED AS ICEBERG
>>> TBLPROPERTIES('iceberg.catalog'='hadoop.catalog',
>>>               'iceberg.catalog_location'='/path/to/hadoop/catalog',
>>>               'iceberg.table_identifier'='namespace1.namespace2.ice_t');
>>> 
>>> If the end user had specified LOCATION, then Impala would have raised an 
>>> error. But the above DDL statement is correct, so Impala loads the iceberg 
>>> table via Iceberg API, then creates the HMS table and sets LOCATION to the 
>>> Iceberg table location (something like 
>>> /path/to/hadoop/catalog/namespace1/namespace2/ice_t).
>>> 
>>> 
>>> I like the flexibility of setting the table_identifier on table level, 
>>> which could help removing naming conflicts. We might want to have this in 
>>> the Iceberg Catalog implementation.
>>>> 'iceberg.table_identifier' for HiveCatalog
>>>> Yeah, it doesn't add much if we only allow using the current HMS. I think 
>>>> it can be only useful if we are allowing external HMSes.
>>>> Moving properties to SERDEPROPERTIES
>>>> I see that these properties are used by the SerDe classes in Hive, but I 
>>>> feel that these properties are just not about serialization and 
>>>> deserialization. And as I see the current SERDEPROPERTIES are things like 
>>>> 'field.delim', 'separatorChar', 'quoteChar', etc. So properties about 
>>>> table loading more naturally belong to TBLPROPERTIES in my opinion.
>>> I have seen it used both ways for HBaseSerDe. (even the wiki page uses both 
>>> :) ). Since Impala prefers TBLPROPERTIES and if we start using prefix for 
>>> separating real Iceberg table properties from other properties, then we can 
>>> keep it at TBLPROPERTIES.
>>> 
>>> In the google doc I also had a comment about prefixing iceberg table 
>>> properties. We could use a prefix like 'iceberg.tblproperties.', and pass 
>>> every property with this prefix to the Iceberg table. Currently Impala 
>>> passes every table property to the Iceberg table.
>>> 
>>> 
>>> 
>>>> Thanks,
>>>>     Zoltan
>>>> 
>>>> 
>>>> On Mon, Nov 30, 2020 at 1:33 PM Peter Vary <pv...@cloudera.com.invalid 
>>>> <mailto:pv...@cloudera.com.invalid>> wrote:
>>>> Hi,
>>>> 
>>>> Based on the discussion below I understand we have the following kinds of 
>>>> properties:
>>>> Iceberg table properties - Engine independent, storage related parameters
>>>> "how to get to" - I think these are mostly Hive table specific properties, 
>>>> since for Spark, the Spark catalog configuration serves for the same 
>>>> purpose. I think the best place for storing these would be the Hive 
>>>> SERDEPROPERTIES, as this describes the access information for the SerDe. 
>>>> Sidenote: I think we should decide if we allow HiveCatalogs pointing to a 
>>>> different HMS and the 'iceberg.table_identifier' would make sense only if 
>>>> we allow having multiple catalogs.
>>>> Query specific properties - These are engine specific and might be mapped 
>>>> to / even override the Iceberg table properties on the engine specific 
>>>> code paths, but currently these properties have independent names and 
>>>> mapped on a case-by-case basis. 
>>>> 
>>>> Based on this:
>>>> Shall we move the "how to get to" properties to SERDEPROPERTIES?
>>>> Shall we define a prefix for setting Iceberg table properties from Hive 
>>>> queries and omitting other engine specific properties?
>>>> 
>>>> Thanks,
>>>> Peter
>>>> 
>>>> 
>>>>> On Nov 27, 2020, at 17:45, Mass Dosage <massdos...@gmail.com 
>>>>> <mailto:massdos...@gmail.com>> wrote:
>>>>> 
>>>>> I like these suggestions, comments inline below on the last round...
>>>>> 
>>>>> On Thu, 26 Nov 2020 at 09:45, Zoltán Borók-Nagy <borokna...@apache.org 
>>>>> <mailto:borokna...@apache.org>> wrote:
>>>>> Hi,
>>>>> 
>>>>> The above aligns with what we did in Impala, i.e. we store information 
>>>>> about table loading in HMS table properties. We are just a bit more 
>>>>> explicit about which catalog to use.
>>>>> We have table property 'iceberg.catalog' to determine the catalog type, 
>>>>> right now the supported values are 'hadoop.tables', 'hadoop.catalog', and 
>>>>> 'hive.catalog'. Additional table properties can be set based on the 
>>>>> catalog type.
>>>>> 
>>>>> So, if the value of 'iceberg.catalog' is
>>>>> 
>>>>> I'm all for renaming this, having "mr" in the property name is confusing.
>>>>>  
>>>>> hadoop.tables
>>>>> the table location is used to load the table
>>>>> The only question I have is should we have this as the default? i.e. if 
>>>>> you don't set a catalog it will assume its HadoopTables and use the 
>>>>> location? Or should we require this property to be here to be consistent 
>>>>> and avoid any "magic"?
>>>>>  
>>>>> hadoop.catalog
>>>>> Required table property 'iceberg.catalog_location' specifies the location 
>>>>> of the hadoop catalog in the file system
>>>>> Optional table property 'iceberg.table_identifier' specifies the table 
>>>>> id. If it's not set, then <database_name>.<table_name> is used as table 
>>>>> identifier
>>>>> I like this as it would allow you to use a different database and table 
>>>>> name in Hive as opposed to the Hadoop Catalog - at the moment they have 
>>>>> to match. The only thing here is that I think Hive requires a table 
>>>>> LOCATION to be set and it's then confusing as there are now two locations 
>>>>> on the table. I'm not sure whether in the Hive storage handler or SerDe 
>>>>> etc. we can get Hive to not require that and maybe even disallow it from 
>>>>> being set. That would probably be best in conjunction with this. Another 
>>>>> solution would be to not have the 'iceberg.catalog_location' property but 
>>>>> instead use the table LOCATION for this but that's a bit confusing from a 
>>>>> Hive point of view.
>>>>> 
>>>>> hive.catalog
>>>>> Optional table property 'iceberg.table_identifier' specifies the table 
>>>>> id. If it's not set, then <database_name>.<table_name> is used as table 
>>>>> identifier
>>>>> We have the assumption that the current Hive metastore stores the table, 
>>>>> i.e. we don't support external Hive metastores currently
>>>>> These sound fine for Hive catalog tables that are created outside of the 
>>>>> automatic Hive table creation (see https://iceberg.apache.org/hive/ 
>>>>> <https://iceberg.apache.org/hive/> -> Using Hive Catalog) we'd just need 
>>>>> to document how you can create these yourself and that one could use a 
>>>>> different Hive database and table etc.
>>>>>  
>>>>> Independent of catalog implementations, but we also have table property 
>>>>> 'iceberg.file_format' to specify the file format for the data files.
>>>>> 
>>>>> OK, I don't think we need that for Hive?
>>>>>  
>>>>> We haven't released it yet, so we are open to changes, but I think these 
>>>>> properties are reasonable and it would be great if we could standardize 
>>>>> the properties across engines that use HMS as the primary metastore of 
>>>>> tables.
>>>>> 
>>>>> 
>>>>> If others agree I think we should create an issue where we document the 
>>>>> above changes so it's very clear what we're doing and can then go an 
>>>>> implement them and update the docs etc.
>>>>>  
>>>>> Cheers,
>>>>>     Zoltan
>>>>> 
>>>>> 
>>>>> On Thu, Nov 26, 2020 at 2:20 AM Ryan Blue <rb...@netflix.com.invalid 
>>>>> <mailto:rb...@netflix.com.invalid>> wrote:
>>>>> Yes, I think that is a good summary of the principles.
>>>>> 
>>>>> #4 is correct because we provide some information that is informational 
>>>>> (Hive schema) or tracked only by the metastore (best-effort current 
>>>>> user). I also agree that it would be good to have a table identifier in 
>>>>> HMS table metadata when loading from an external table. That gives us a 
>>>>> way to handle name conflicts.
>>>>> 
>>>>> On Wed, Nov 25, 2020 at 5:14 PM Jacques Nadeau <jacq...@dremio.com 
>>>>> <mailto:jacq...@dremio.com>> wrote:
>>>>> Minor error, my last example should have been:
>>>>> 
>>>>> db1.table1_etl_branch => nessie.folder1.folder2.folder3.table1@etl_branch
>>>>> 
>>>>> --
>>>>> Jacques Nadeau
>>>>> CTO and Co-Founder, Dremio
>>>>> 
>>>>> 
>>>>> On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau <jacq...@dremio.com 
>>>>> <mailto:jacq...@dremio.com>> wrote:
>>>>> I agree with Ryan on the core principles here. As I understand them:
>>>>> Iceberg metadata describes all properties of a table
>>>>> Hive table properties describe "how to get to" Iceberg metadata (which 
>>>>> catalog + possibly ptr, path, token, etc)
>>>>> There could be default "how to get to" information set at a global level
>>>>> Best-effort schema should stored be in the table properties in HMS. This 
>>>>> should be done for information schema retrieval purposes within Hive but 
>>>>> should be ignored during Hive/other tool execution.
>>>>> Is that a fair summary of your statements Ryan (except 4, which I just 
>>>>> added)?
>>>>> 
>>>>> One comment I have on #2 is that for different catalogs and use cases, I 
>>>>> think it can be somewhat more complex where it would be desirable for a 
>>>>> table that initially existed without Hive that was later exposed in Hive 
>>>>> to support a ptr/path/token for how the table is named externally. For 
>>>>> example, in a Nessie context we support arbitrary paths for an Iceberg 
>>>>> table (such as folder1.folder2.folder3.table1). If you then want to 
>>>>> expose that table to Hive, you might have this mapping for #2
>>>>> 
>>>>> db1.table1 => nessie:folder1.folder2.folder3.table1
>>>>> 
>>>>> Similarly, you might want to expose a particular branch version of a 
>>>>> table. So it might say:
>>>>> 
>>>>> db1.table1_etl_branch => nessie.folder1@etl_branch
>>>>> 
>>>>> Just saying that the address to the table in the catalog could itself 
>>>>> have several properties. The key being that no matter what those are, we 
>>>>> should follow #1 and only store properties that are about the ptr, not 
>>>>> the content/metadata.
>>>>> 
>>>>> Lastly, I believe #4 is the case but haven't tested it. Can someone 
>>>>> confirm that it is true? And that it is possible/not problematic?
>>>>> 
>>>>> 
>>>>> --
>>>>> Jacques Nadeau
>>>>> CTO and Co-Founder, Dremio
>>>>> 
>>>>> 
>>>>> On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <rb...@netflix.com.invalid 
>>>>> <mailto:rb...@netflix.com.invalid>> wrote:
>>>>> Thanks for working on this, Laszlo. I’ve been thinking about these 
>>>>> problems as well, so this is a good time to have a discussion about Hive 
>>>>> config.
>>>>> 
>>>>> I think that Hive configuration should work mostly like other engines, 
>>>>> where different configurations are used for different purposes. Different 
>>>>> purposes means that there is not a global configuration priority. 
>>>>> Hopefully, I can explain how we use the different config sources 
>>>>> elsewhere to clarify.
>>>>> 
>>>>> Let’s take Spark as an example. Spark uses Hadoop, so it has a Hadoop 
>>>>> Configuration, but it also has its own global configuration. There are 
>>>>> also Iceberg table properties, and all of the various Hive properties if 
>>>>> you’re tracking tables with a Hive MetaStore.
>>>>> 
>>>>> The first step is to simplify where we can, so we effectively eliminate 2 
>>>>> sources of config:
>>>>> 
>>>>> The Hadoop Configuration is only used to instantiate Hadoop classes, like 
>>>>> FileSystem. Iceberg should not use it for any other config.
>>>>> Config in the Hive MetaStore is only used to identify that a table is 
>>>>> Iceberg and point to its metadata location. All other config in HMS is 
>>>>> informational. For example, the input format is FileInputFormat so that 
>>>>> non-Iceberg readers cannot actually instantiate the format (it’s 
>>>>> abstract) but it is available so they also don’t fail trying to load the 
>>>>> class. Table-specific config should not be stored in table or serde 
>>>>> properties.
>>>>> That leaves Spark configuration and Iceberg table configuration.
>>>>> 
>>>>> Iceberg differs from other tables because it is opinionated: data 
>>>>> configuration should be maintained at the table level. This is cleaner 
>>>>> for users because config is standardized across engines and in one place. 
>>>>> And it also enables services that analyze a table and update its 
>>>>> configuration to tune options that users almost never do, like row group 
>>>>> or stripe size in the columnar formats. Iceberg table configuration is 
>>>>> used to configure table-specific concerns and behavior.
>>>>> 
>>>>> Spark configuration is used for engine-specific concerns, and runtime 
>>>>> overrides. A good example of an engine-specific concern is the catalogs 
>>>>> that are available to load Iceberg tables. Spark has a way to load and 
>>>>> configure catalog implementations and Iceberg uses that for all 
>>>>> catalog-level config. Runtime overrides are things like target split 
>>>>> size. Iceberg has a table-level default split size in table properties, 
>>>>> but this can be overridden by a Spark option for each table, as well as 
>>>>> an option passed to the individual read. Note that these necessarily have 
>>>>> different config names for how they are used: Iceberg uses 
>>>>> read.split.target-size and the read-specific option is target-size.
>>>>> 
>>>>> Applying this to Hive is a little strange for a couple reasons. First, 
>>>>> Hive’s engine configuration is a Hadoop Configuration. As a result, I 
>>>>> think the right place to store engine-specific config is there, including 
>>>>> Iceberg catalogs using a strategy similar to what Spark does: what 
>>>>> external Iceberg catalogs are available and their configuration should 
>>>>> come from the HiveConf.
>>>>> 
>>>>> The second way Hive is strange is that Hive needs to use its own 
>>>>> MetaStore to track Hive table concerns. The MetaStore may have tables 
>>>>> created by an Iceberg HiveCatalog, and Hive also needs to be able to load 
>>>>> tables from other Iceberg catalogs by creating table entries for them.
>>>>> 
>>>>> Here’s how I think Hive should work:
>>>>> 
>>>>> There should be a default HiveCatalog that uses the current MetaStore URI 
>>>>> to be used for HiveCatalog tables tracked in the MetaStore
>>>>> Other catalogs should be defined in HiveConf
>>>>> HMS table properties should be used to determine how to load a table: 
>>>>> using a Hadoop location, using the default metastore catalog, or using an 
>>>>> external Iceberg catalog
>>>>> If there is a metadata_location, then use the HiveCatalog for this 
>>>>> metastore (where it is tracked)
>>>>> If there is a catalog property, then load that catalog and use it to load 
>>>>> the table identifier, or maybe an identifier from HMS table properties
>>>>> If there is no catalog or metadata_location, then use HadoopTables to 
>>>>> load the table location as an Iceberg table
>>>>> This would make it possible to access all types of Iceberg tables in the 
>>>>> same query, and would match how Spark and Flink configure catalogs. Other 
>>>>> than the configuration above, I don’t think that config in HMS should be 
>>>>> used at all, like how the other engines work. Iceberg is the source of 
>>>>> truth for table metadata, HMS stores how to load the Iceberg table, and 
>>>>> HiveConf defines the catalogs (or runtime overrides).
>>>>> 
>>>>> This isn’t quite how configuration works right now. Currently, the 
>>>>> catalog is controlled by a HiveConf property, iceberg.mr.catalog. If that 
>>>>> isn’t set, HadoopTables will be used to load table locations. If it is 
>>>>> set, then that catalog will be used to load all tables by name. This 
>>>>> makes it impossible to load tables from different catalogs at the same 
>>>>> time. That’s why I think the Iceberg catalog for a table should be stored 
>>>>> in HMS table properties.
>>>>> 
>>>>> I should also explain iceberg.hive.engine.enabled flag, but I think this 
>>>>> is long enough for now.
>>>>> 
>>>>> rb
>>>>> 
>>>>> 
>>>>> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter 
>>>>> <lpin...@cloudera.com.invalid <mailto:lpin...@cloudera.com.invalid>> 
>>>>> wrote:
>>>>> Hi All,
>>>>> 
>>>>> I would like to start a discussion, how should we handle properties from 
>>>>> various sources like Iceberg, Hive or global configuration. I've put 
>>>>> together a short document 
>>>>> <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>,
>>>>>  please have a look and let me know what you think.
>>>>> 
>>>>> Thanks,
>>>>> Laszlo
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>> 
>>> 
>>> -- 
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>> 
>

Re: Iceberg/Hive properties handling

Reply via email to