Re: Iceberg/Hive properties handling

Peter Vary Mon, 30 Nov 2020 04:33:27 -0800

Hi,

Based on the discussion below I understand we have the following kinds of 
properties:
Iceberg table properties - Engine independent, storage related parameters
"how to get to" - I think these are mostly Hive table specific properties, 
since for Spark, the Spark catalog configuration serves for the same purpose. I 
think the best place for storing these would be the Hive SERDEPROPERTIES, as 
this describes the access information for the SerDe. Sidenote: I think we 
should decide if we allow HiveCatalogs pointing to a different HMS and the 
'iceberg.table_identifier' would make sense only if we allow having multiple 
catalogs.
Query specific properties - These are engine specific and might be mapped to / 
even override the Iceberg table properties on the engine specific code paths, 
but currently these properties have independent names and mapped on a 
case-by-case basis.


Based on this:
Shall we move the "how to get to" properties to SERDEPROPERTIES?
Shall we define a prefix for setting Iceberg table properties from Hive queries 
and omitting other engine specific properties?

Thanks,
Peter


> On Nov 27, 2020, at 17:45, Mass Dosage <massdos...@gmail.com> wrote:
> 
> I like these suggestions, comments inline below on the last round...
> 
> On Thu, 26 Nov 2020 at 09:45, Zoltán Borók-Nagy <borokna...@apache.org 
> <mailto:borokna...@apache.org>> wrote:
> Hi,
> 
> The above aligns with what we did in Impala, i.e. we store information about 
> table loading in HMS table properties. We are just a bit more explicit about 
> which catalog to use.
> We have table property 'iceberg.catalog' to determine the catalog type, right 
> now the supported values are 'hadoop.tables', 'hadoop.catalog', and 
> 'hive.catalog'. Additional table properties can be set based on the catalog 
> type.
> 
> So, if the value of 'iceberg.catalog' is
> 
> I'm all for renaming this, having "mr" in the property name is confusing.
>  
> hadoop.tables
> the table location is used to load the table
> The only question I have is should we have this as the default? i.e. if you 
> don't set a catalog it will assume its HadoopTables and use the location? Or 
> should we require this property to be here to be consistent and avoid any 
> "magic"?
>  
> hadoop.catalog
> Required table property 'iceberg.catalog_location' specifies the location of 
> the hadoop catalog in the file system
> Optional table property 'iceberg.table_identifier' specifies the table id. If 
> it's not set, then <database_name>.<table_name> is used as table identifier
> I like this as it would allow you to use a different database and table name 
> in Hive as opposed to the Hadoop Catalog - at the moment they have to match. 
> The only thing here is that I think Hive requires a table LOCATION to be set 
> and it's then confusing as there are now two locations on the table. I'm not 
> sure whether in the Hive storage handler or SerDe etc. we can get Hive to not 
> require that and maybe even disallow it from being set. That would probably 
> be best in conjunction with this. Another solution would be to not have the 
> 'iceberg.catalog_location' property but instead use the table LOCATION for 
> this but that's a bit confusing from a Hive point of view.
> 
> hive.catalog
> Optional table property 'iceberg.table_identifier' specifies the table id. If 
> it's not set, then <database_name>.<table_name> is used as table identifier
> We have the assumption that the current Hive metastore stores the table, i.e. 
> we don't support external Hive metastores currently
> These sound fine for Hive catalog tables that are created outside of the 
> automatic Hive table creation (see https://iceberg.apache.org/hive/ 
> <https://iceberg.apache.org/hive/> -> Using Hive Catalog) we'd just need to 
> document how you can create these yourself and that one could use a different 
> Hive database and table etc.
>  
> Independent of catalog implementations, but we also have table property 
> 'iceberg.file_format' to specify the file format for the data files.
> 
> OK, I don't think we need that for Hive?
>  
> We haven't released it yet, so we are open to changes, but I think these 
> properties are reasonable and it would be great if we could standardize the 
> properties across engines that use HMS as the primary metastore of tables.
> 
> 
> If others agree I think we should create an issue where we document the above 
> changes so it's very clear what we're doing and can then go an implement them 
> and update the docs etc.
>  
> Cheers,
>     Zoltan
> 
> 
> On Thu, Nov 26, 2020 at 2:20 AM Ryan Blue <rb...@netflix.com.invalid> wrote:
> Yes, I think that is a good summary of the principles.
> 
> #4 is correct because we provide some information that is informational (Hive 
> schema) or tracked only by the metastore (best-effort current user). I also 
> agree that it would be good to have a table identifier in HMS table metadata 
> when loading from an external table. That gives us a way to handle name 
> conflicts.
> 
> On Wed, Nov 25, 2020 at 5:14 PM Jacques Nadeau <jacq...@dremio.com 
> <mailto:jacq...@dremio.com>> wrote:
> Minor error, my last example should have been:
> 
> db1.table1_etl_branch => nessie.folder1.folder2.folder3.table1@etl_branch
> 
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
> 
> 
> On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau <jacq...@dremio.com 
> <mailto:jacq...@dremio.com>> wrote:
> I agree with Ryan on the core principles here. As I understand them:
> Iceberg metadata describes all properties of a table
> Hive table properties describe "how to get to" Iceberg metadata (which 
> catalog + possibly ptr, path, token, etc)
> There could be default "how to get to" information set at a global level
> Best-effort schema should stored be in the table properties in HMS. This 
> should be done for information schema retrieval purposes within Hive but 
> should be ignored during Hive/other tool execution.
> Is that a fair summary of your statements Ryan (except 4, which I just added)?
> 
> One comment I have on #2 is that for different catalogs and use cases, I 
> think it can be somewhat more complex where it would be desirable for a table 
> that initially existed without Hive that was later exposed in Hive to support 
> a ptr/path/token for how the table is named externally. For example, in a 
> Nessie context we support arbitrary paths for an Iceberg table (such as 
> folder1.folder2.folder3.table1). If you then want to expose that table to 
> Hive, you might have this mapping for #2
> 
> db1.table1 => nessie:folder1.folder2.folder3.table1
> 
> Similarly, you might want to expose a particular branch version of a table. 
> So it might say:
> 
> db1.table1_etl_branch => nessie.folder1@etl_branch
> 
> Just saying that the address to the table in the catalog could itself have 
> several properties. The key being that no matter what those are, we should 
> follow #1 and only store properties that are about the ptr, not the 
> content/metadata.
> 
> Lastly, I believe #4 is the case but haven't tested it. Can someone confirm 
> that it is true? And that it is possible/not problematic?
> 
> 
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
> 
> 
> On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <rb...@netflix.com.invalid> wrote:
> Thanks for working on this, Laszlo. I’ve been thinking about these problems 
> as well, so this is a good time to have a discussion about Hive config.
> 
> I think that Hive configuration should work mostly like other engines, where 
> different configurations are used for different purposes. Different purposes 
> means that there is not a global configuration priority. Hopefully, I can 
> explain how we use the different config sources elsewhere to clarify.
> 
> Let’s take Spark as an example. Spark uses Hadoop, so it has a Hadoop 
> Configuration, but it also has its own global configuration. There are also 
> Iceberg table properties, and all of the various Hive properties if you’re 
> tracking tables with a Hive MetaStore.
> 
> The first step is to simplify where we can, so we effectively eliminate 2 
> sources of config:
> 
> The Hadoop Configuration is only used to instantiate Hadoop classes, like 
> FileSystem. Iceberg should not use it for any other config.
> Config in the Hive MetaStore is only used to identify that a table is Iceberg 
> and point to its metadata location. All other config in HMS is informational. 
> For example, the input format is FileInputFormat so that non-Iceberg readers 
> cannot actually instantiate the format (it’s abstract) but it is available so 
> they also don’t fail trying to load the class. Table-specific config should 
> not be stored in table or serde properties.
> That leaves Spark configuration and Iceberg table configuration.
> 
> Iceberg differs from other tables because it is opinionated: data 
> configuration should be maintained at the table level. This is cleaner for 
> users because config is standardized across engines and in one place. And it 
> also enables services that analyze a table and update its configuration to 
> tune options that users almost never do, like row group or stripe size in the 
> columnar formats. Iceberg table configuration is used to configure 
> table-specific concerns and behavior.
> 
> Spark configuration is used for engine-specific concerns, and runtime 
> overrides. A good example of an engine-specific concern is the catalogs that 
> are available to load Iceberg tables. Spark has a way to load and configure 
> catalog implementations and Iceberg uses that for all catalog-level config. 
> Runtime overrides are things like target split size. Iceberg has a 
> table-level default split size in table properties, but this can be 
> overridden by a Spark option for each table, as well as an option passed to 
> the individual read. Note that these necessarily have different config names 
> for how they are used: Iceberg uses read.split.target-size and the 
> read-specific option is target-size.
> 
> Applying this to Hive is a little strange for a couple reasons. First, Hive’s 
> engine configuration is a Hadoop Configuration. As a result, I think the 
> right place to store engine-specific config is there, including Iceberg 
> catalogs using a strategy similar to what Spark does: what external Iceberg 
> catalogs are available and their configuration should come from the HiveConf.
> 
> The second way Hive is strange is that Hive needs to use its own MetaStore to 
> track Hive table concerns. The MetaStore may have tables created by an 
> Iceberg HiveCatalog, and Hive also needs to be able to load tables from other 
> Iceberg catalogs by creating table entries for them.
> 
> Here’s how I think Hive should work:
> 
> There should be a default HiveCatalog that uses the current MetaStore URI to 
> be used for HiveCatalog tables tracked in the MetaStore
> Other catalogs should be defined in HiveConf
> HMS table properties should be used to determine how to load a table: using a 
> Hadoop location, using the default metastore catalog, or using an external 
> Iceberg catalog
> If there is a metadata_location, then use the HiveCatalog for this metastore 
> (where it is tracked)
> If there is a catalog property, then load that catalog and use it to load the 
> table identifier, or maybe an identifier from HMS table properties
> If there is no catalog or metadata_location, then use HadoopTables to load 
> the table location as an Iceberg table
> This would make it possible to access all types of Iceberg tables in the same 
> query, and would match how Spark and Flink configure catalogs. Other than the 
> configuration above, I don’t think that config in HMS should be used at all, 
> like how the other engines work. Iceberg is the source of truth for table 
> metadata, HMS stores how to load the Iceberg table, and HiveConf defines the 
> catalogs (or runtime overrides).
> 
> This isn’t quite how configuration works right now. Currently, the catalog is 
> controlled by a HiveConf property, iceberg.mr.catalog. If that isn’t set, 
> HadoopTables will be used to load table locations. If it is set, then that 
> catalog will be used to load all tables by name. This makes it impossible to 
> load tables from different catalogs at the same time. That’s why I think the 
> Iceberg catalog for a table should be stored in HMS table properties.
> 
> I should also explain iceberg.hive.engine.enabled flag, but I think this is 
> long enough for now.
> 
> rb
> 
> 
> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter <lpin...@cloudera.com.invalid> 
> wrote:
> Hi All,
> 
> I would like to start a discussion, how should we handle properties from 
> various sources like Iceberg, Hive or global configuration. I've put together 
> a short document 
> <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>,
>  please have a look and let me know what you think.
> 
> Thanks,
> Laszlo
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix

Re: Iceberg/Hive properties handling

Reply via email to