Re: Iceberg/Hive properties handling

Peter Vary Thu, 03 Dec 2020 09:27:53 -0800

As Jacques suggested (with the help of Zoltan) I have collected the current 
state and the proposed solutions in a document:
https://docs.google.com/document/d/1KumHM9IKbQyleBEUHZDbeoMjd7n6feUPJ5zK8NQb-Qw/edit?usp=sharing
 
<https://docs.google.com/document/d/1KumHM9IKbQyleBEUHZDbeoMjd7n6feUPJ5zK8NQb-Qw/edit?usp=sharing>


My feeling is that we do not have a final decision, so tried to list all the 
possible solutions. 
Please comment!

Thanks,
Peter

> On Dec 2, 2020, at 18:10, Peter Vary <pv...@cloudera.com> wrote:
> 
> When I was working on the CREATE TABLE patch I found the following 
> TBLPROPERTIES on newly created tables:
> external.table.purge
> EXTERNAL
> bucketing_version
> numRows
> rawDataSize
> totalSize
> numFiles
> numFileErasureCoded
> 
> I am afraid that we can not change the name of most of these properties, and 
> might not be useful to have most of them along with Iceberg statistics 
> already there. Also my feeling is that this is only the top of the Iceberg 
> (pun intended :)) so this is why I think we should be more targeted way to 
> push properties to the Iceberg tables.
> 
>> On Dec 2, 2020, at 18:04, Ryan Blue <rb...@netflix.com 
>> <mailto:rb...@netflix.com>> wrote:
>> 
>> Sorry, I accidentally didn’t copy the dev list on this reply. Resending:
>> 
>> Also I expect that we want to add Hive write specific configs to table level 
>> when the general engine independent configuration is not ideal for Hive, but 
>> every Hive query for a given table should use some specific config.
>> 
>> Hive may need configuration, but I think these should still be kept in the 
>> Iceberg table. There is no reason to make Hive config inaccessible from 
>> other engines. If someone wants to view all of the config for a table from 
>> Spark, the Hive config should also be included right?
>> 
>> 
>> On Tue, Dec 1, 2020 at 10:36 AM Peter Vary <pv...@cloudera.com 
>> <mailto:pv...@cloudera.com>> wrote:
>> I will ask Laszlo if he wants to update his doc.
>> 
>> I see both pros and cons of catalog definition in config files. If there is 
>> an easy default then I do not mind any of the proposed solutions.
>> 
>> OTOH I am in favor of the "use prefix for Iceberg table properties" 
>> solution, because in Hive it is common to add new keys to the property list 
>> - no restriction is in place (I am not even sure that the currently 
>> implemented blacklist for preventing to propagate properties to Iceberg 
>> tables is complete). Also I expect that we want to add Hive write specific 
>> configs to table level when the general engine independent configuration is 
>> not ideal for Hive, but every Hive query for a given table should use some 
>> specific config.
>> 
>> Thanks, Peter 
>> 
>> Jacques Nadeau <jacq...@dremio.com <mailto:jacq...@dremio.com>> ezt írta 
>> (időpont: 2020. dec. 1., Ke 17:06):
>> Would someone be willing to create a document that states the current 
>> proposal? 
>> 
>> It is becoming somewhat difficult to follow this thread. I also worry that 
>> without a complete statement of the current shape that people may be 
>> incorrectly thinking they are in alignment.
>> 
>> 
>> 
>> --
>> Jacques Nadeau
>> CTO and Co-Founder, Dremio
>> 
>> 
>> On Tue, Dec 1, 2020 at 5:32 AM Zoltán Borók-Nagy <borokna...@cloudera.com 
>> <mailto:borokna...@cloudera.com>> wrote:
>> Thanks, Ryan. I answered inline.
>> 
>> On Mon, Nov 30, 2020 at 8:26 PM Ryan Blue <rb...@netflix.com 
>> <mailto:rb...@netflix.com>> wrote:
>> This sounds like a good plan overall, but I have a couple of notes:
>> 
>> We need to keep in mind that users plug in their own catalogs, so 
>> iceberg.catalog could be a Glue or Nessie catalog, not just Hive or Hadoop. 
>> I don’t think it makes much sense to use separate hadoop.catalog and 
>> hive.catalog values. Those should just be names for catalogs configured in 
>> Configuration, i.e., via hive-site.xml. We then only need a special value 
>> for loading Hadoop tables from paths.
>> About extensibility, I think the usual Hive way is to use Java class names. 
>> So this way the value for 'iceberg.catalog' could be e.g. 
>> 'org.apache.iceberg.hive.HiveCatalog'. Then each catalog implementation 
>> would need to have a factory method that constructs the catalog object from 
>> a properties object (Map<String, String>). E.g. 
>> 'org.apache.iceberg.hadoop.HadoopCatalog' would require 
>> 'iceberg.catalog_location' to be present in properties.
>> I don’t think that catalog configuration should be kept in table properties. 
>> A catalog should not be loaded for each table. So I don’t think we need 
>> iceberg.catalog_location. Instead, we should have a way to define catalogs 
>> in the Configuration for tables in the metastore to reference.
>>  I think it makes sense, on the other hand it would make adding new catalogs 
>> more heavy-weight, i.e. now you'd need to edit configuration files and 
>> restart/reinit services. Maybe it can be cumbersome in some environments.
>> I’d rather use a prefix to exclude properties from being passed to Iceberg 
>> than to include them. Otherwise, users don’t know what to do to pass table 
>> properties from Hive or Impala. If we exclude a prefix or specific 
>> properties, then everything but the properties reserved for locating the 
>> table are passed as the user would expect.
>> I don't have a strong opinion about this, but yeah, maybe this behavior 
>> would cause the least surprises.
>> 
>> On Mon, Nov 30, 2020 at 7:51 AM Zoltán Borók-Nagy <borokna...@apache.org 
>> <mailto:borokna...@apache.org>> wrote:
>> Thanks, Peter. I answered inline.
>> 
>> On Mon, Nov 30, 2020 at 3:13 PM Peter Vary <pv...@cloudera.com.invalid 
>> <mailto:pv...@cloudera.com.invalid>> wrote:
>> Hi Zoltan,
>> 
>> Answers below:
>> 
>>> On Nov 30, 2020, at 14:19, Zoltán Borók-Nagy 
>>> <borokna...@cloudera.com.INVALID <mailto:borokna...@cloudera.com.INVALID>> 
>>> wrote:
>>> 
>>> Hi,
>>> 
>>> Thanks for the replies. My take for the above questions are as follows
>>> Should 'iceberg.catalog' be a required property?
>>> Yeah, I think it would be nice if this would be required to avoid any 
>>> implicit behavior
>> Currently we have a Catalogs class to get/initialize/use the different 
>> Catalogs. At that time the decision was to use HadoopTables as a default 
>> catalog.
>> It might be worthwhile to use the same class in Impala as well, so the 
>> behavior is consistent.
>>  
>> Yeah, I think it'd be beneficial for us to use the Iceberg classes whenever 
>> possible. The Catalogs class is very similar to what we have currently in 
>> Impala.  
>>> 'hadoop.catalog' LOCATION and catalog_location
>>> In Impala we don't allow setting LOCATION for tables stored in 
>>> 'hadoop.catalog'. But Impala internally sets LOCATION to the Iceberg 
>>> table's actual location. We were also thinking about using only the table 
>>> LOCATION, and set it to the catalog location, but we also found it 
>>> confusing.
>> It could definitely work, but it is somewhat strange that we have an 
>> external table location set to an arbitrary path, and we have a different 
>> location generated by other configs. It would be nice to have the real 
>> location set in the external table location as well.
>> 
>> Impala sets the real Iceberg table location for external tables. E.g. if the 
>> user issues
>> 
>> CREATE EXTERNAL TABLE my_hive_db.iceberg_table_hadoop_catalog
>> STORED AS ICEBERG
>> TBLPROPERTIES('iceberg.catalog'='hadoop.catalog',
>>               'iceberg.catalog_location'='/path/to/hadoop/catalog',
>>               'iceberg.table_identifier'='namespace1.namespace2.ice_t');
>> 
>> If the end user had specified LOCATION, then Impala would have raised an 
>> error. But the above DDL statement is correct, so Impala loads the iceberg 
>> table via Iceberg API, then creates the HMS table and sets LOCATION to the 
>> Iceberg table location (something like 
>> /path/to/hadoop/catalog/namespace1/namespace2/ice_t).
>> 
>> 
>> I like the flexibility of setting the table_identifier on table level, which 
>> could help removing naming conflicts. We might want to have this in the 
>> Iceberg Catalog implementation.
>>> 'iceberg.table_identifier' for HiveCatalog
>>> Yeah, it doesn't add much if we only allow using the current HMS. I think 
>>> it can be only useful if we are allowing external HMSes.
>>> Moving properties to SERDEPROPERTIES
>>> I see that these properties are used by the SerDe classes in Hive, but I 
>>> feel that these properties are just not about serialization and 
>>> deserialization. And as I see the current SERDEPROPERTIES are things like 
>>> 'field.delim', 'separatorChar', 'quoteChar', etc. So properties about table 
>>> loading more naturally belong to TBLPROPERTIES in my opinion.
>> I have seen it used both ways for HBaseSerDe. (even the wiki page uses both 
>> :) ). Since Impala prefers TBLPROPERTIES and if we start using prefix for 
>> separating real Iceberg table properties from other properties, then we can 
>> keep it at TBLPROPERTIES.
>> 
>> In the google doc I also had a comment about prefixing iceberg table 
>> properties. We could use a prefix like 'iceberg.tblproperties.', and pass 
>> every property with this prefix to the Iceberg table. Currently Impala 
>> passes every table property to the Iceberg table.
>> 
>> 
>> 
>>> Thanks,
>>>     Zoltan
>>> 
>>> 
>>> On Mon, Nov 30, 2020 at 1:33 PM Peter Vary <pv...@cloudera.com.invalid 
>>> <mailto:pv...@cloudera.com.invalid>> wrote:
>>> Hi,
>>> 
>>> Based on the discussion below I understand we have the following kinds of 
>>> properties:
>>> Iceberg table properties - Engine independent, storage related parameters
>>> "how to get to" - I think these are mostly Hive table specific properties, 
>>> since for Spark, the Spark catalog configuration serves for the same 
>>> purpose. I think the best place for storing these would be the Hive 
>>> SERDEPROPERTIES, as this describes the access information for the SerDe. 
>>> Sidenote: I think we should decide if we allow HiveCatalogs pointing to a 
>>> different HMS and the 'iceberg.table_identifier' would make sense only if 
>>> we allow having multiple catalogs.
>>> Query specific properties - These are engine specific and might be mapped 
>>> to / even override the Iceberg table properties on the engine specific code 
>>> paths, but currently these properties have independent names and mapped on 
>>> a case-by-case basis. 
>>> 
>>> Based on this:
>>> Shall we move the "how to get to" properties to SERDEPROPERTIES?
>>> Shall we define a prefix for setting Iceberg table properties from Hive 
>>> queries and omitting other engine specific properties?
>>> 
>>> Thanks,
>>> Peter
>>> 
>>> 
>>>> On Nov 27, 2020, at 17:45, Mass Dosage <massdos...@gmail.com 
>>>> <mailto:massdos...@gmail.com>> wrote:
>>>> 
>>>> I like these suggestions, comments inline below on the last round...
>>>> 
>>>> On Thu, 26 Nov 2020 at 09:45, Zoltán Borók-Nagy <borokna...@apache.org 
>>>> <mailto:borokna...@apache.org>> wrote:
>>>> Hi,
>>>> 
>>>> The above aligns with what we did in Impala, i.e. we store information 
>>>> about table loading in HMS table properties. We are just a bit more 
>>>> explicit about which catalog to use.
>>>> We have table property 'iceberg.catalog' to determine the catalog type, 
>>>> right now the supported values are 'hadoop.tables', 'hadoop.catalog', and 
>>>> 'hive.catalog'. Additional table properties can be set based on the 
>>>> catalog type.
>>>> 
>>>> So, if the value of 'iceberg.catalog' is
>>>> 
>>>> I'm all for renaming this, having "mr" in the property name is confusing.
>>>>  
>>>> hadoop.tables
>>>> the table location is used to load the table
>>>> The only question I have is should we have this as the default? i.e. if 
>>>> you don't set a catalog it will assume its HadoopTables and use the 
>>>> location? Or should we require this property to be here to be consistent 
>>>> and avoid any "magic"?
>>>>  
>>>> hadoop.catalog
>>>> Required table property 'iceberg.catalog_location' specifies the location 
>>>> of the hadoop catalog in the file system
>>>> Optional table property 'iceberg.table_identifier' specifies the table id. 
>>>> If it's not set, then <database_name>.<table_name> is used as table 
>>>> identifier
>>>> I like this as it would allow you to use a different database and table 
>>>> name in Hive as opposed to the Hadoop Catalog - at the moment they have to 
>>>> match. The only thing here is that I think Hive requires a table LOCATION 
>>>> to be set and it's then confusing as there are now two locations on the 
>>>> table. I'm not sure whether in the Hive storage handler or SerDe etc. we 
>>>> can get Hive to not require that and maybe even disallow it from being 
>>>> set. That would probably be best in conjunction with this. Another 
>>>> solution would be to not have the 'iceberg.catalog_location' property but 
>>>> instead use the table LOCATION for this but that's a bit confusing from a 
>>>> Hive point of view.
>>>> 
>>>> hive.catalog
>>>> Optional table property 'iceberg.table_identifier' specifies the table id. 
>>>> If it's not set, then <database_name>.<table_name> is used as table 
>>>> identifier
>>>> We have the assumption that the current Hive metastore stores the table, 
>>>> i.e. we don't support external Hive metastores currently
>>>> These sound fine for Hive catalog tables that are created outside of the 
>>>> automatic Hive table creation (see https://iceberg.apache.org/hive/ 
>>>> <https://iceberg.apache.org/hive/> -> Using Hive Catalog) we'd just need 
>>>> to document how you can create these yourself and that one could use a 
>>>> different Hive database and table etc.
>>>>  
>>>> Independent of catalog implementations, but we also have table property 
>>>> 'iceberg.file_format' to specify the file format for the data files.
>>>> 
>>>> OK, I don't think we need that for Hive?
>>>>  
>>>> We haven't released it yet, so we are open to changes, but I think these 
>>>> properties are reasonable and it would be great if we could standardize 
>>>> the properties across engines that use HMS as the primary metastore of 
>>>> tables.
>>>> 
>>>> 
>>>> If others agree I think we should create an issue where we document the 
>>>> above changes so it's very clear what we're doing and can then go an 
>>>> implement them and update the docs etc.
>>>>  
>>>> Cheers,
>>>>     Zoltan
>>>> 
>>>> 
>>>> On Thu, Nov 26, 2020 at 2:20 AM Ryan Blue <rb...@netflix.com.invalid 
>>>> <mailto:rb...@netflix.com.invalid>> wrote:
>>>> Yes, I think that is a good summary of the principles.
>>>> 
>>>> #4 is correct because we provide some information that is informational 
>>>> (Hive schema) or tracked only by the metastore (best-effort current user). 
>>>> I also agree that it would be good to have a table identifier in HMS table 
>>>> metadata when loading from an external table. That gives us a way to 
>>>> handle name conflicts.
>>>> 
>>>> On Wed, Nov 25, 2020 at 5:14 PM Jacques Nadeau <jacq...@dremio.com 
>>>> <mailto:jacq...@dremio.com>> wrote:
>>>> Minor error, my last example should have been:
>>>> 
>>>> db1.table1_etl_branch => nessie.folder1.folder2.folder3.table1@etl_branch
>>>> 
>>>> --
>>>> Jacques Nadeau
>>>> CTO and Co-Founder, Dremio
>>>> 
>>>> 
>>>> On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau <jacq...@dremio.com 
>>>> <mailto:jacq...@dremio.com>> wrote:
>>>> I agree with Ryan on the core principles here. As I understand them:
>>>> Iceberg metadata describes all properties of a table
>>>> Hive table properties describe "how to get to" Iceberg metadata (which 
>>>> catalog + possibly ptr, path, token, etc)
>>>> There could be default "how to get to" information set at a global level
>>>> Best-effort schema should stored be in the table properties in HMS. This 
>>>> should be done for information schema retrieval purposes within Hive but 
>>>> should be ignored during Hive/other tool execution.
>>>> Is that a fair summary of your statements Ryan (except 4, which I just 
>>>> added)?
>>>> 
>>>> One comment I have on #2 is that for different catalogs and use cases, I 
>>>> think it can be somewhat more complex where it would be desirable for a 
>>>> table that initially existed without Hive that was later exposed in Hive 
>>>> to support a ptr/path/token for how the table is named externally. For 
>>>> example, in a Nessie context we support arbitrary paths for an Iceberg 
>>>> table (such as folder1.folder2.folder3.table1). If you then want to expose 
>>>> that table to Hive, you might have this mapping for #2
>>>> 
>>>> db1.table1 => nessie:folder1.folder2.folder3.table1
>>>> 
>>>> Similarly, you might want to expose a particular branch version of a 
>>>> table. So it might say:
>>>> 
>>>> db1.table1_etl_branch => nessie.folder1@etl_branch
>>>> 
>>>> Just saying that the address to the table in the catalog could itself have 
>>>> several properties. The key being that no matter what those are, we should 
>>>> follow #1 and only store properties that are about the ptr, not the 
>>>> content/metadata.
>>>> 
>>>> Lastly, I believe #4 is the case but haven't tested it. Can someone 
>>>> confirm that it is true? And that it is possible/not problematic?
>>>> 
>>>> 
>>>> --
>>>> Jacques Nadeau
>>>> CTO and Co-Founder, Dremio
>>>> 
>>>> 
>>>> On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <rb...@netflix.com.invalid 
>>>> <mailto:rb...@netflix.com.invalid>> wrote:
>>>> Thanks for working on this, Laszlo. I’ve been thinking about these 
>>>> problems as well, so this is a good time to have a discussion about Hive 
>>>> config.
>>>> 
>>>> I think that Hive configuration should work mostly like other engines, 
>>>> where different configurations are used for different purposes. Different 
>>>> purposes means that there is not a global configuration priority. 
>>>> Hopefully, I can explain how we use the different config sources elsewhere 
>>>> to clarify.
>>>> 
>>>> Let’s take Spark as an example. Spark uses Hadoop, so it has a Hadoop 
>>>> Configuration, but it also has its own global configuration. There are 
>>>> also Iceberg table properties, and all of the various Hive properties if 
>>>> you’re tracking tables with a Hive MetaStore.
>>>> 
>>>> The first step is to simplify where we can, so we effectively eliminate 2 
>>>> sources of config:
>>>> 
>>>> The Hadoop Configuration is only used to instantiate Hadoop classes, like 
>>>> FileSystem. Iceberg should not use it for any other config.
>>>> Config in the Hive MetaStore is only used to identify that a table is 
>>>> Iceberg and point to its metadata location. All other config in HMS is 
>>>> informational. For example, the input format is FileInputFormat so that 
>>>> non-Iceberg readers cannot actually instantiate the format (it’s abstract) 
>>>> but it is available so they also don’t fail trying to load the class. 
>>>> Table-specific config should not be stored in table or serde properties.
>>>> That leaves Spark configuration and Iceberg table configuration.
>>>> 
>>>> Iceberg differs from other tables because it is opinionated: data 
>>>> configuration should be maintained at the table level. This is cleaner for 
>>>> users because config is standardized across engines and in one place. And 
>>>> it also enables services that analyze a table and update its configuration 
>>>> to tune options that users almost never do, like row group or stripe size 
>>>> in the columnar formats. Iceberg table configuration is used to configure 
>>>> table-specific concerns and behavior.
>>>> 
>>>> Spark configuration is used for engine-specific concerns, and runtime 
>>>> overrides. A good example of an engine-specific concern is the catalogs 
>>>> that are available to load Iceberg tables. Spark has a way to load and 
>>>> configure catalog implementations and Iceberg uses that for all 
>>>> catalog-level config. Runtime overrides are things like target split size. 
>>>> Iceberg has a table-level default split size in table properties, but this 
>>>> can be overridden by a Spark option for each table, as well as an option 
>>>> passed to the individual read. Note that these necessarily have different 
>>>> config names for how they are used: Iceberg uses read.split.target-size 
>>>> and the read-specific option is target-size.
>>>> 
>>>> Applying this to Hive is a little strange for a couple reasons. First, 
>>>> Hive’s engine configuration is a Hadoop Configuration. As a result, I 
>>>> think the right place to store engine-specific config is there, including 
>>>> Iceberg catalogs using a strategy similar to what Spark does: what 
>>>> external Iceberg catalogs are available and their configuration should 
>>>> come from the HiveConf.
>>>> 
>>>> The second way Hive is strange is that Hive needs to use its own MetaStore 
>>>> to track Hive table concerns. The MetaStore may have tables created by an 
>>>> Iceberg HiveCatalog, and Hive also needs to be able to load tables from 
>>>> other Iceberg catalogs by creating table entries for them.
>>>> 
>>>> Here’s how I think Hive should work:
>>>> 
>>>> There should be a default HiveCatalog that uses the current MetaStore URI 
>>>> to be used for HiveCatalog tables tracked in the MetaStore
>>>> Other catalogs should be defined in HiveConf
>>>> HMS table properties should be used to determine how to load a table: 
>>>> using a Hadoop location, using the default metastore catalog, or using an 
>>>> external Iceberg catalog
>>>> If there is a metadata_location, then use the HiveCatalog for this 
>>>> metastore (where it is tracked)
>>>> If there is a catalog property, then load that catalog and use it to load 
>>>> the table identifier, or maybe an identifier from HMS table properties
>>>> If there is no catalog or metadata_location, then use HadoopTables to load 
>>>> the table location as an Iceberg table
>>>> This would make it possible to access all types of Iceberg tables in the 
>>>> same query, and would match how Spark and Flink configure catalogs. Other 
>>>> than the configuration above, I don’t think that config in HMS should be 
>>>> used at all, like how the other engines work. Iceberg is the source of 
>>>> truth for table metadata, HMS stores how to load the Iceberg table, and 
>>>> HiveConf defines the catalogs (or runtime overrides).
>>>> 
>>>> This isn’t quite how configuration works right now. Currently, the catalog 
>>>> is controlled by a HiveConf property, iceberg.mr.catalog. If that isn’t 
>>>> set, HadoopTables will be used to load table locations. If it is set, then 
>>>> that catalog will be used to load all tables by name. This makes it 
>>>> impossible to load tables from different catalogs at the same time. That’s 
>>>> why I think the Iceberg catalog for a table should be stored in HMS table 
>>>> properties.
>>>> 
>>>> I should also explain iceberg.hive.engine.enabled flag, but I think this 
>>>> is long enough for now.
>>>> 
>>>> rb
>>>> 
>>>> 
>>>> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter 
>>>> <lpin...@cloudera.com.invalid <mailto:lpin...@cloudera.com.invalid>> wrote:
>>>> Hi All,
>>>> 
>>>> I would like to start a discussion, how should we handle properties from 
>>>> various sources like Iceberg, Hive or global configuration. I've put 
>>>> together a short document 
>>>> <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>,
>>>>  please have a look and let me know what you think.
>>>> 
>>>> Thanks,
>>>> Laszlo
>>>> 
>>>> 
>>>> -- 
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>> 
>>>> 
>>>> -- 
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>> 
>> 
>> 
>> 
>> -- 
>> Ryan Blue
>> Software Engineer
>> Netflix
>> 
>> 
>> -- 
>> Ryan Blue
>> Software Engineer
>> Netflix
>

Re: Iceberg/Hive properties handling

Reply via email to