Re: Iceberg/Hive properties handling

Ryan Blue Mon, 30 Nov 2020 11:26:27 -0800

This sounds like a good plan overall, but I have a couple of notes:

   1. We need to keep in mind that users plug in their own catalogs, so
   iceberg.catalog could be a Glue or Nessie catalog, not just Hive or
   Hadoop. I don’t think it makes much sense to use separate hadoop.catalog
   and hive.catalog values. Those should just be names for catalogs configured
   in Configuration, i.e., via hive-site.xml. We then only need a special
   value for loading Hadoop tables from paths.
   2. I don’t think that catalog configuration should be kept in table
   properties. A catalog should not be loaded for each table. So I don’t think
   we need iceberg.catalog_location. Instead, we should have a way to
   define catalogs in the Configuration for tables in the metastore to
   reference.
   3. I’d rather use a prefix to exclude properties from being passed to
   Iceberg than to include them. Otherwise, users don’t know what to do to
   pass table properties from Hive or Impala. If we exclude a prefix or
   specific properties, then everything but the properties reserved for
   locating the table are passed as the user would expect.



On Mon, Nov 30, 2020 at 7:51 AM Zoltán Borók-Nagy <borokna...@apache.org>
wrote:

> Thanks, Peter. I answered inline.
>
> On Mon, Nov 30, 2020 at 3:13 PM Peter Vary <pv...@cloudera.com.invalid>
> wrote:
>
>> Hi Zoltan,
>>
>> Answers below:
>>
>> On Nov 30, 2020, at 14:19, Zoltán Borók-Nagy <
>> borokna...@cloudera.com.INVALID> wrote:
>>
>> Hi,
>>
>> Thanks for the replies. My take for the above questions are as follows
>>
>>    - Should 'iceberg.catalog' be a required property?
>>    - Yeah, I think it would be nice if this would be required to avoid
>>       any implicit behavior
>>
>> Currently we have a Catalogs class to get/initialize/use the different
>> Catalogs. At that time the decision was to use HadoopTables as a default
>> catalog.
>> It might be worthwhile to use the same class in Impala as well, so the
>> behavior is consistent.
>>
>
> Yeah, I think it'd be beneficial for us to use the Iceberg classes
> whenever possible. The Catalogs class is very similar to what we have
> currently in Impala.
>
>>
>>    - 'hadoop.catalog' LOCATION and catalog_location
>>       - In Impala we don't allow setting LOCATION for tables stored in
>>       'hadoop.catalog'. But Impala internally sets LOCATION to the Iceberg
>>       table's actual location. We were also thinking about using only the 
>> table
>>       LOCATION, and set it to the catalog location, but we also found it
>>       confusing.
>>
>> It could definitely work, but it is somewhat strange that we have an
>> external table location set to an arbitrary path, and we have a different
>> location generated by other configs. It would be nice to have the real
>> location set in the external table location as well.
>>
>
> Impala sets the real Iceberg table location for external tables. E.g. if
> the user issues
>
> CREATE EXTERNAL TABLE my_hive_db.iceberg_table_hadoop_catalog
> STORED AS ICEBERG
> TBLPROPERTIES('iceberg.catalog'='hadoop.catalog',
>               'iceberg.catalog_location'='/path/to/hadoop/catalog',
>               'iceberg.table_identifier'='namespace1.namespace2.ice_t');
>
> If the end user had specified LOCATION, then Impala would have raised an
> error. But the above DDL statement is correct, so Impala loads the iceberg
> table via Iceberg API, then creates the HMS table and sets LOCATION to the
> Iceberg table location (something like
> /path/to/hadoop/catalog/namespace1/namespace2/ice_t).
>
>
>> I like the flexibility of setting the table_identifier on table level,
>> which could help removing naming conflicts. We might want to have this in
>> the Iceberg Catalog implementation.
>>
>>
>>    - 'iceberg.table_identifier' for HiveCatalog
>>       - Yeah, it doesn't add much if we only allow using the current
>>       HMS. I think it can be only useful if we are allowing external HMSes.
>>    - Moving properties to SERDEPROPERTIES
>>       - I see that these properties are used by the SerDe classes in
>>       Hive, but I feel that these properties are just not about 
>> serialization and
>>       deserialization. And as I see the current SERDEPROPERTIES are things 
>> like
>>       'field.delim', 'separatorChar', 'quoteChar', etc. So properties about 
>> table
>>       loading more naturally belong to TBLPROPERTIES in my opinion.
>>
>> I have seen it used both ways for HBaseSerDe. (even the wiki page uses
>> both :) ). Since Impala prefers TBLPROPERTIES and if we start using prefix
>> for separating real Iceberg table properties from other properties, then we
>> can keep it at TBLPROPERTIES.
>>
>
> In the google doc I also had a comment about prefixing iceberg table
> properties. We could use a prefix like 'iceberg.tblproperties.', and pass
> every property with this prefix to the Iceberg table. Currently Impala
> passes every table property to the Iceberg table.
>
>
>>
>> Thanks,
>>     Zoltan
>>
>>
>> On Mon, Nov 30, 2020 at 1:33 PM Peter Vary <pv...@cloudera.com.invalid>
>> wrote:
>>
>>> Hi,
>>>
>>> Based on the discussion below I understand we have the following kinds
>>> of properties:
>>>
>>>    1. Iceberg table properties - Engine independent, storage related
>>>    parameters
>>>    2. "how to get to" - I think these are mostly Hive table specific
>>>    properties, since for Spark, the Spark catalog configuration serves for 
>>> the
>>>    same purpose. I think the best place for storing these would be the
>>>    Hive SERDEPROPERTIES, as this describes the access information for the
>>>    SerDe. Sidenote: I think we should decide if we allow HiveCatalogs
>>>    pointing to a different HMS and the 'iceberg.table_identifier' would make
>>>    sense only if we allow having multiple catalogs.
>>>    3. Query specific properties - These are engine specific and might
>>>    be mapped to / even override the Iceberg table properties on the engine
>>>    specific code paths, but currently these properties have independent 
>>> names
>>>    and mapped on a case-by-case basis.
>>>
>>>
>>> Based on this:
>>>
>>>    - Shall we move the "how to get to" properties to SERDEPROPERTIES?
>>>    - Shall we define a prefix for setting Iceberg table properties from
>>>    Hive queries and omitting other engine specific properties?
>>>
>>>
>>> Thanks,
>>> Peter
>>>
>>>
>>> On Nov 27, 2020, at 17:45, Mass Dosage <massdos...@gmail.com> wrote:
>>>
>>> I like these suggestions, comments inline below on the last round...
>>>
>>> On Thu, 26 Nov 2020 at 09:45, Zoltán Borók-Nagy <borokna...@apache.org>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> The above aligns with what we did in Impala, i.e. we store information
>>>> about table loading in HMS table properties. We are just a bit more
>>>> explicit about which catalog to use.
>>>> We have table property 'iceberg.catalog' to determine the catalog type,
>>>> right now the supported values are 'hadoop.tables', 'hadoop.catalog', and
>>>> 'hive.catalog'. Additional table properties can be set based on the catalog
>>>> type.
>>>>
>>>> So, if the value of 'iceberg.catalog' is
>>>>
>>>
>>> I'm all for renaming this, having "mr" in the property name is confusing.
>>>
>>>
>>>>
>>>>    - hadoop.tables
>>>>       - the table location is used to load the table
>>>>
>>>> The only question I have is should we have this as the default? i.e. if
>>> you don't set a catalog it will assume its HadoopTables and use the
>>> location? Or should we require this property to be here to be consistent
>>> and avoid any "magic"?
>>>
>>>
>>>>
>>>>    - hadoop.catalog
>>>>       - Required table property 'iceberg.catalog_location' specifies
>>>>       the location of the hadoop catalog in the file system
>>>>       - Optional table property 'iceberg.table_identifier' specifies
>>>>       the table id. If it's not set, then <database_name>.<table_name> is 
>>>> used as
>>>>       table identifier
>>>>
>>>> I like this as it would allow you to use a different database and table
>>> name in Hive as opposed to the Hadoop Catalog - at the moment they have to
>>> match. The only thing here is that I think Hive requires a table LOCATION
>>> to be set and it's then confusing as there are now two locations on the
>>> table. I'm not sure whether in the Hive storage handler or SerDe etc. we
>>> can get Hive to not require that and maybe even disallow it from being set.
>>> That would probably be best in conjunction with this. Another solution
>>> would be to not have the 'iceberg.catalog_location' property but instead
>>> use the table LOCATION for this but that's a bit confusing from a Hive
>>> point of view.
>>>
>>>
>>>>    - hive.catalog
>>>>       - Optional table property 'iceberg.table_identifier' specifies
>>>>       the table id. If it's not set, then <database_name>.<table_name> is 
>>>> used as
>>>>       table identifier
>>>>       - We have the assumption that the current Hive metastore stores
>>>>       the table, i.e. we don't support external Hive metastores currently
>>>>
>>>> These sound fine for Hive catalog tables that are created outside of
>>> the automatic Hive table creation (see https://iceberg.apache.org/hive/
>>> -> Using Hive Catalog) we'd just need to document how you can create these
>>> yourself and that one could use a different Hive database and table etc.
>>>
>>>
>>>> Independent of catalog implementations, but we also have table property
>>>> 'iceberg.file_format' to specify the file format for the data files.
>>>>
>>>
>>> OK, I don't think we need that for Hive?
>>>
>>>
>>>> We haven't released it yet, so we are open to changes, but I think
>>>> these properties are reasonable and it would be great if we could
>>>> standardize the properties across engines that use HMS as the primary
>>>> metastore of tables.
>>>>
>>>>
>>> If others agree I think we should create an issue where we document the
>>> above changes so it's very clear what we're doing and can then go an
>>> implement them and update the docs etc.
>>>
>>>
>>>> Cheers,
>>>>     Zoltan
>>>>
>>>>
>>>> On Thu, Nov 26, 2020 at 2:20 AM Ryan Blue <rb...@netflix.com.invalid>
>>>> wrote:
>>>>
>>>>> Yes, I think that is a good summary of the principles.
>>>>>
>>>>> #4 is correct because we provide some information that is
>>>>> informational (Hive schema) or tracked only by the metastore (best-effort
>>>>> current user). I also agree that it would be good to have a table
>>>>> identifier in HMS table metadata when loading from an external table. That
>>>>> gives us a way to handle name conflicts.
>>>>>
>>>>> On Wed, Nov 25, 2020 at 5:14 PM Jacques Nadeau <jacq...@dremio.com>
>>>>> wrote:
>>>>>
>>>>>> Minor error, my last example should have been:
>>>>>>
>>>>>> db1.table1_etl_branch =>
>>>>>> nessie.folder1.folder2.folder3.table1@etl_branch
>>>>>>
>>>>>> --
>>>>>> Jacques Nadeau
>>>>>> CTO and Co-Founder, Dremio
>>>>>>
>>>>>>
>>>>>> On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau <jacq...@dremio.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I agree with Ryan on the core principles here. As I understand them:
>>>>>>>
>>>>>>>    1. Iceberg metadata describes all properties of a table
>>>>>>>    2. Hive table properties describe "how to get to" Iceberg
>>>>>>>    metadata (which catalog + possibly ptr, path, token, etc)
>>>>>>>    3. There could be default "how to get to" information set at a
>>>>>>>    global level
>>>>>>>    4. Best-effort schema should stored be in the table properties
>>>>>>>    in HMS. This should be done for information schema retrieval purposes
>>>>>>>    within Hive but should be ignored during Hive/other tool execution.
>>>>>>>
>>>>>>> Is that a fair summary of your statements Ryan (except 4, which I
>>>>>>> just added)?
>>>>>>>
>>>>>>> One comment I have on #2 is that for different catalogs and use
>>>>>>> cases, I think it can be somewhat more complex where it would be
>>>>>>> desirable for a table that initially existed without Hive that was later
>>>>>>> exposed in Hive to support a ptr/path/token for how the table is named
>>>>>>> externally. For example, in a Nessie context we support arbitrary paths 
>>>>>>> for
>>>>>>> an Iceberg table (such as folder1.folder2.folder3.table1). If you then 
>>>>>>> want
>>>>>>> to expose that table to Hive, you might have this mapping for #2
>>>>>>>
>>>>>>> db1.table1 => nessie:folder1.folder2.folder3.table1
>>>>>>>
>>>>>>> Similarly, you might want to expose a particular branch version of a
>>>>>>> table. So it might say:
>>>>>>>
>>>>>>> db1.table1_etl_branch => nessie.folder1@etl_branch
>>>>>>>
>>>>>>> Just saying that the address to the table in the catalog could
>>>>>>> itself have several properties. The key being that no matter what those
>>>>>>> are, we should follow #1 and only store properties that are about the 
>>>>>>> ptr,
>>>>>>> not the content/metadata.
>>>>>>>
>>>>>>> Lastly, I believe #4 is the case but haven't tested it. Can someone
>>>>>>> confirm that it is true? And that it is possible/not problematic?
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Jacques Nadeau
>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <rb...@netflix.com.invalid>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks for working on this, Laszlo. I’ve been thinking about these
>>>>>>>> problems as well, so this is a good time to have a discussion about 
>>>>>>>> Hive
>>>>>>>> config.
>>>>>>>>
>>>>>>>> I think that Hive configuration should work mostly like other
>>>>>>>> engines, where different configurations are used for different 
>>>>>>>> purposes.
>>>>>>>> Different purposes means that there is not a global configuration 
>>>>>>>> priority.
>>>>>>>> Hopefully, I can explain how we use the different config sources 
>>>>>>>> elsewhere
>>>>>>>> to clarify.
>>>>>>>>
>>>>>>>> Let’s take Spark as an example. Spark uses Hadoop, so it has a
>>>>>>>> Hadoop Configuration, but it also has its own global configuration. 
>>>>>>>> There
>>>>>>>> are also Iceberg table properties, and all of the various Hive 
>>>>>>>> properties
>>>>>>>> if you’re tracking tables with a Hive MetaStore.
>>>>>>>>
>>>>>>>> The first step is to simplify where we can, so we effectively
>>>>>>>> eliminate 2 sources of config:
>>>>>>>>
>>>>>>>>    - The Hadoop Configuration is only used to instantiate Hadoop
>>>>>>>>    classes, like FileSystem. Iceberg should not use it for any other 
>>>>>>>> config.
>>>>>>>>    - Config in the Hive MetaStore is only used to identify that a
>>>>>>>>    table is Iceberg and point to its metadata location. All other 
>>>>>>>> config in
>>>>>>>>    HMS is informational. For example, the input format is 
>>>>>>>> FileInputFormat so
>>>>>>>>    that non-Iceberg readers cannot actually instantiate the format 
>>>>>>>> (it’s
>>>>>>>>    abstract) but it is available so they also don’t fail trying to 
>>>>>>>> load the
>>>>>>>>    class. Table-specific config should not be stored in table or serde
>>>>>>>>    properties.
>>>>>>>>
>>>>>>>> That leaves Spark configuration and Iceberg table configuration.
>>>>>>>>
>>>>>>>> Iceberg differs from other tables because it is opinionated: data
>>>>>>>> configuration should be maintained at the table level. This is cleaner 
>>>>>>>> for
>>>>>>>> users because config is standardized across engines and in one place. 
>>>>>>>> And
>>>>>>>> it also enables services that analyze a table and update its 
>>>>>>>> configuration
>>>>>>>> to tune options that users almost never do, like row group or stripe 
>>>>>>>> size
>>>>>>>> in the columnar formats. Iceberg table configuration is used to 
>>>>>>>> configure
>>>>>>>> table-specific concerns and behavior.
>>>>>>>>
>>>>>>>> Spark configuration is used for engine-specific concerns, and
>>>>>>>> runtime overrides. A good example of an engine-specific concern is the
>>>>>>>> catalogs that are available to load Iceberg tables. Spark has a way to 
>>>>>>>> load
>>>>>>>> and configure catalog implementations and Iceberg uses that for all
>>>>>>>> catalog-level config. Runtime overrides are things like target split 
>>>>>>>> size.
>>>>>>>> Iceberg has a table-level default split size in table properties, but 
>>>>>>>> this
>>>>>>>> can be overridden by a Spark option for each table, as well as an 
>>>>>>>> option
>>>>>>>> passed to the individual read. Note that these necessarily have 
>>>>>>>> different
>>>>>>>> config names for how they are used: Iceberg uses
>>>>>>>> read.split.target-size and the read-specific option is target-size.
>>>>>>>>
>>>>>>>> Applying this to Hive is a little strange for a couple reasons.
>>>>>>>> First, Hive’s engine configuration *is* a Hadoop Configuration. As
>>>>>>>> a result, I think the right place to store engine-specific config is 
>>>>>>>> there,
>>>>>>>> including Iceberg catalogs using a strategy similar to what Spark does:
>>>>>>>> what external Iceberg catalogs are available and their configuration 
>>>>>>>> should
>>>>>>>> come from the HiveConf.
>>>>>>>>
>>>>>>>> The second way Hive is strange is that Hive needs to use its own
>>>>>>>> MetaStore to track Hive table concerns. The MetaStore may have tables
>>>>>>>> created by an Iceberg HiveCatalog, and Hive also needs to be able to 
>>>>>>>> load
>>>>>>>> tables from other Iceberg catalogs by creating table entries for them.
>>>>>>>>
>>>>>>>> Here’s how I think Hive should work:
>>>>>>>>
>>>>>>>>    - There should be a default HiveCatalog that uses the current
>>>>>>>>    MetaStore URI to be used for HiveCatalog tables tracked in the 
>>>>>>>> MetaStore
>>>>>>>>    - Other catalogs should be defined in HiveConf
>>>>>>>>    - HMS table properties should be used to determine how to load
>>>>>>>>    a table: using a Hadoop location, using the default metastore 
>>>>>>>> catalog, or
>>>>>>>>    using an external Iceberg catalog
>>>>>>>>       - If there is a metadata_location, then use the HiveCatalog
>>>>>>>>       for this metastore (where it is tracked)
>>>>>>>>       - If there is a catalog property, then load that catalog and
>>>>>>>>       use it to load the table identifier, or maybe an identifier from 
>>>>>>>> HMS table
>>>>>>>>       properties
>>>>>>>>       - If there is no catalog or metadata_location, then use
>>>>>>>>       HadoopTables to load the table location as an Iceberg table
>>>>>>>>
>>>>>>>> This would make it possible to access all types of Iceberg tables
>>>>>>>> in the same query, and would match how Spark and Flink configure 
>>>>>>>> catalogs.
>>>>>>>> Other than the configuration above, I don’t think that config in HMS 
>>>>>>>> should
>>>>>>>> be used at all, like how the other engines work. Iceberg is the source 
>>>>>>>> of
>>>>>>>> truth for table metadata, HMS stores how to load the Iceberg table, and
>>>>>>>> HiveConf defines the catalogs (or runtime overrides).
>>>>>>>>
>>>>>>>> This isn’t quite how configuration works right now. Currently, the
>>>>>>>> catalog is controlled by a HiveConf property, iceberg.mr.catalog.
>>>>>>>> If that isn’t set, HadoopTables will be used to load table locations. 
>>>>>>>> If it
>>>>>>>> is set, then that catalog will be used to load all tables by name. This
>>>>>>>> makes it impossible to load tables from different catalogs at the same
>>>>>>>> time. That’s why I think the Iceberg catalog for a table should be 
>>>>>>>> stored
>>>>>>>> in HMS table properties.
>>>>>>>>
>>>>>>>> I should also explain iceberg.hive.engine.enabled flag, but I
>>>>>>>> think this is long enough for now.
>>>>>>>>
>>>>>>>> rb
>>>>>>>>
>>>>>>>> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter <
>>>>>>>> lpin...@cloudera.com.invalid> wrote:
>>>>>>>>
>>>>>>>>> Hi All,
>>>>>>>>>
>>>>>>>>> I would like to start a discussion, how should we handle
>>>>>>>>> properties from various sources like Iceberg, Hive or global 
>>>>>>>>> configuration.
>>>>>>>>> I've put together a short document
>>>>>>>>> <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>,
>>>>>>>>> please have a look and let me know what you think.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Laszlo
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Software Engineer
>>>>>>>> Netflix
>>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>>
>>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Iceberg/Hive properties handling

Reply via email to