Re: Iceberg/Hive properties handling

Zoltán Borók-Nagy Tue, 01 Dec 2020 05:32:28 -0800

Thanks, Ryan. I answered inline.

On Mon, Nov 30, 2020 at 8:26 PM Ryan Blue <rb...@netflix.com> wrote:


> This sounds like a good plan overall, but I have a couple of notes:
>
>    1. We need to keep in mind that users plug in their own catalogs, so
>    iceberg.catalog could be a Glue or Nessie catalog, not just Hive or
>    Hadoop. I don’t think it makes much sense to use separate hadoop.catalog
>    and hive.catalog values. Those should just be names for catalogs configured
>    in Configuration, i.e., via hive-site.xml. We then only need a special
>    value for loading Hadoop tables from paths.
>
> About extensibility, I think the usual Hive way is to use Java class
names. So this way the value for 'iceberg.catalog' could be e.g.
'org.apache.iceberg.hive.HiveCatalog'. Then each catalog implementation
would need to have a factory method that constructs the catalog object from
a properties object (Map<String, String>). E.g.
'org.apache.iceberg.hadoop.HadoopCatalog' would require
'iceberg.catalog_location' to be present in properties.

>
>    1. I don’t think that catalog configuration should be kept in table
>    properties. A catalog should not be loaded for each table. So I don’t think
>    we need iceberg.catalog_location. Instead, we should have a way to
>    define catalogs in the Configuration for tables in the metastore to
>    reference.
>
>  I think it makes sense, on the other hand it would make adding new
catalogs more heavy-weight, i.e. now you'd need to edit configuration files
and restart/reinit services. Maybe it can be cumbersome in some
environments.

>
>    1. I’d rather use a prefix to exclude properties from being passed to
>    Iceberg than to include them. Otherwise, users don’t know what to do to
>    pass table properties from Hive or Impala. If we exclude a prefix or
>    specific properties, then everything but the properties reserved for
>    locating the table are passed as the user would expect.
>
> I don't have a strong opinion about this, but yeah, maybe this behavior
would cause the least surprises.

>
>
>
> On Mon, Nov 30, 2020 at 7:51 AM Zoltán Borók-Nagy <borokna...@apache.org>
> wrote:
>
>> Thanks, Peter. I answered inline.
>>
>> On Mon, Nov 30, 2020 at 3:13 PM Peter Vary <pv...@cloudera.com.invalid>
>> wrote:
>>
>>> Hi Zoltan,
>>>
>>> Answers below:
>>>
>>> On Nov 30, 2020, at 14:19, Zoltán Borók-Nagy <
>>> borokna...@cloudera.com.INVALID> wrote:
>>>
>>> Hi,
>>>
>>> Thanks for the replies. My take for the above questions are as follows
>>>
>>>    - Should 'iceberg.catalog' be a required property?
>>>    - Yeah, I think it would be nice if this would be required to avoid
>>>       any implicit behavior
>>>
>>> Currently we have a Catalogs class to get/initialize/use the different
>>> Catalogs. At that time the decision was to use HadoopTables as a default
>>> catalog.
>>> It might be worthwhile to use the same class in Impala as well, so the
>>> behavior is consistent.
>>>
>>
>> Yeah, I think it'd be beneficial for us to use the Iceberg classes
>> whenever possible. The Catalogs class is very similar to what we have
>> currently in Impala.
>>
>>>
>>>    - 'hadoop.catalog' LOCATION and catalog_location
>>>       - In Impala we don't allow setting LOCATION for tables stored in
>>>       'hadoop.catalog'. But Impala internally sets LOCATION to the Iceberg
>>>       table's actual location. We were also thinking about using only the 
>>> table
>>>       LOCATION, and set it to the catalog location, but we also found it
>>>       confusing.
>>>
>>> It could definitely work, but it is somewhat strange that we have an
>>> external table location set to an arbitrary path, and we have a different
>>> location generated by other configs. It would be nice to have the real
>>> location set in the external table location as well.
>>>
>>
>> Impala sets the real Iceberg table location for external tables. E.g. if
>> the user issues
>>
>> CREATE EXTERNAL TABLE my_hive_db.iceberg_table_hadoop_catalog
>> STORED AS ICEBERG
>> TBLPROPERTIES('iceberg.catalog'='hadoop.catalog',
>>               'iceberg.catalog_location'='/path/to/hadoop/catalog',
>>               'iceberg.table_identifier'='namespace1.namespace2.ice_t');
>>
>> If the end user had specified LOCATION, then Impala would have raised an
>> error. But the above DDL statement is correct, so Impala loads the iceberg
>> table via Iceberg API, then creates the HMS table and sets LOCATION to the
>> Iceberg table location (something like
>> /path/to/hadoop/catalog/namespace1/namespace2/ice_t).
>>
>>
>>> I like the flexibility of setting the table_identifier on table level,
>>> which could help removing naming conflicts. We might want to have this in
>>> the Iceberg Catalog implementation.
>>>
>>>
>>>    - 'iceberg.table_identifier' for HiveCatalog
>>>       - Yeah, it doesn't add much if we only allow using the current
>>>       HMS. I think it can be only useful if we are allowing external HMSes.
>>>    - Moving properties to SERDEPROPERTIES
>>>       - I see that these properties are used by the SerDe classes in
>>>       Hive, but I feel that these properties are just not about 
>>> serialization and
>>>       deserialization. And as I see the current SERDEPROPERTIES are things 
>>> like
>>>       'field.delim', 'separatorChar', 'quoteChar', etc. So properties about 
>>> table
>>>       loading more naturally belong to TBLPROPERTIES in my opinion.
>>>
>>> I have seen it used both ways for HBaseSerDe. (even the wiki page uses
>>> both :) ). Since Impala prefers TBLPROPERTIES and if we start using prefix
>>> for separating real Iceberg table properties from other properties, then we
>>> can keep it at TBLPROPERTIES.
>>>
>>
>> In the google doc I also had a comment about prefixing iceberg table
>> properties. We could use a prefix like 'iceberg.tblproperties.', and pass
>> every property with this prefix to the Iceberg table. Currently Impala
>> passes every table property to the Iceberg table.
>>
>>
>>>
>>> Thanks,
>>>     Zoltan
>>>
>>>
>>> On Mon, Nov 30, 2020 at 1:33 PM Peter Vary <pv...@cloudera.com.invalid>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Based on the discussion below I understand we have the following kinds
>>>> of properties:
>>>>
>>>>    1. Iceberg table properties - Engine independent, storage related
>>>>    parameters
>>>>    2. "how to get to" - I think these are mostly Hive table specific
>>>>    properties, since for Spark, the Spark catalog configuration serves for 
>>>> the
>>>>    same purpose. I think the best place for storing these would be the
>>>>    Hive SERDEPROPERTIES, as this describes the access information for the
>>>>    SerDe. Sidenote: I think we should decide if we allow HiveCatalogs
>>>>    pointing to a different HMS and the 'iceberg.table_identifier' would 
>>>> make
>>>>    sense only if we allow having multiple catalogs.
>>>>    3. Query specific properties - These are engine specific and might
>>>>    be mapped to / even override the Iceberg table properties on the engine
>>>>    specific code paths, but currently these properties have independent 
>>>> names
>>>>    and mapped on a case-by-case basis.
>>>>
>>>>
>>>> Based on this:
>>>>
>>>>    - Shall we move the "how to get to" properties to SERDEPROPERTIES?
>>>>    - Shall we define a prefix for setting Iceberg table properties
>>>>    from Hive queries and omitting other engine specific properties?
>>>>
>>>>
>>>> Thanks,
>>>> Peter
>>>>
>>>>
>>>> On Nov 27, 2020, at 17:45, Mass Dosage <massdos...@gmail.com> wrote:
>>>>
>>>> I like these suggestions, comments inline below on the last round...
>>>>
>>>> On Thu, 26 Nov 2020 at 09:45, Zoltán Borók-Nagy <borokna...@apache.org>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> The above aligns with what we did in Impala, i.e. we store information
>>>>> about table loading in HMS table properties. We are just a bit more
>>>>> explicit about which catalog to use.
>>>>> We have table property 'iceberg.catalog' to determine the catalog
>>>>> type, right now the supported values are 'hadoop.tables', 
>>>>> 'hadoop.catalog',
>>>>> and 'hive.catalog'. Additional table properties can be set based on the
>>>>> catalog type.
>>>>>
>>>>> So, if the value of 'iceberg.catalog' is
>>>>>
>>>>
>>>> I'm all for renaming this, having "mr" in the property name is
>>>> confusing.
>>>>
>>>>
>>>>>
>>>>>    - hadoop.tables
>>>>>       - the table location is used to load the table
>>>>>
>>>>> The only question I have is should we have this as the default? i.e.
>>>> if you don't set a catalog it will assume its HadoopTables and use the
>>>> location? Or should we require this property to be here to be consistent
>>>> and avoid any "magic"?
>>>>
>>>>
>>>>>
>>>>>    - hadoop.catalog
>>>>>       - Required table property 'iceberg.catalog_location' specifies
>>>>>       the location of the hadoop catalog in the file system
>>>>>       - Optional table property 'iceberg.table_identifier' specifies
>>>>>       the table id. If it's not set, then <database_name>.<table_name> is 
>>>>> used as
>>>>>       table identifier
>>>>>
>>>>> I like this as it would allow you to use a different database and
>>>> table name in Hive as opposed to the Hadoop Catalog - at the moment they
>>>> have to match. The only thing here is that I think Hive requires a table
>>>> LOCATION to be set and it's then confusing as there are now two locations
>>>> on the table. I'm not sure whether in the Hive storage handler or SerDe
>>>> etc. we can get Hive to not require that and maybe even disallow it from
>>>> being set. That would probably be best in conjunction with this. Another
>>>> solution would be to not have the 'iceberg.catalog_location' property but
>>>> instead use the table LOCATION for this but that's a bit confusing from a
>>>> Hive point of view.
>>>>
>>>>
>>>>>    - hive.catalog
>>>>>       - Optional table property 'iceberg.table_identifier' specifies
>>>>>       the table id. If it's not set, then <database_name>.<table_name> is 
>>>>> used as
>>>>>       table identifier
>>>>>       - We have the assumption that the current Hive metastore stores
>>>>>       the table, i.e. we don't support external Hive metastores currently
>>>>>
>>>>> These sound fine for Hive catalog tables that are created outside of
>>>> the automatic Hive table creation (see https://iceberg.apache.org/hive/
>>>> -> Using Hive Catalog) we'd just need to document how you can create these
>>>> yourself and that one could use a different Hive database and table etc.
>>>>
>>>>
>>>>> Independent of catalog implementations, but we also have table
>>>>> property 'iceberg.file_format' to specify the file format for the data
>>>>> files.
>>>>>
>>>>
>>>> OK, I don't think we need that for Hive?
>>>>
>>>>
>>>>> We haven't released it yet, so we are open to changes, but I think
>>>>> these properties are reasonable and it would be great if we could
>>>>> standardize the properties across engines that use HMS as the primary
>>>>> metastore of tables.
>>>>>
>>>>>
>>>> If others agree I think we should create an issue where we document the
>>>> above changes so it's very clear what we're doing and can then go an
>>>> implement them and update the docs etc.
>>>>
>>>>
>>>>> Cheers,
>>>>>     Zoltan
>>>>>
>>>>>
>>>>> On Thu, Nov 26, 2020 at 2:20 AM Ryan Blue <rb...@netflix.com.invalid>
>>>>> wrote:
>>>>>
>>>>>> Yes, I think that is a good summary of the principles.
>>>>>>
>>>>>> #4 is correct because we provide some information that is
>>>>>> informational (Hive schema) or tracked only by the metastore (best-effort
>>>>>> current user). I also agree that it would be good to have a table
>>>>>> identifier in HMS table metadata when loading from an external table. 
>>>>>> That
>>>>>> gives us a way to handle name conflicts.
>>>>>>
>>>>>> On Wed, Nov 25, 2020 at 5:14 PM Jacques Nadeau <jacq...@dremio.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Minor error, my last example should have been:
>>>>>>>
>>>>>>> db1.table1_etl_branch =>
>>>>>>> nessie.folder1.folder2.folder3.table1@etl_branch
>>>>>>>
>>>>>>> --
>>>>>>> Jacques Nadeau
>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau <jacq...@dremio.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I agree with Ryan on the core principles here. As I understand them:
>>>>>>>>
>>>>>>>>    1. Iceberg metadata describes all properties of a table
>>>>>>>>    2. Hive table properties describe "how to get to" Iceberg
>>>>>>>>    metadata (which catalog + possibly ptr, path, token, etc)
>>>>>>>>    3. There could be default "how to get to" information set at a
>>>>>>>>    global level
>>>>>>>>    4. Best-effort schema should stored be in the table properties
>>>>>>>>    in HMS. This should be done for information schema retrieval 
>>>>>>>> purposes
>>>>>>>>    within Hive but should be ignored during Hive/other tool execution.
>>>>>>>>
>>>>>>>> Is that a fair summary of your statements Ryan (except 4, which I
>>>>>>>> just added)?
>>>>>>>>
>>>>>>>> One comment I have on #2 is that for different catalogs and use
>>>>>>>> cases, I think it can be somewhat more complex where it would be
>>>>>>>> desirable for a table that initially existed without Hive that was 
>>>>>>>> later
>>>>>>>> exposed in Hive to support a ptr/path/token for how the table is named
>>>>>>>> externally. For example, in a Nessie context we support arbitrary 
>>>>>>>> paths for
>>>>>>>> an Iceberg table (such as folder1.folder2.folder3.table1). If you then 
>>>>>>>> want
>>>>>>>> to expose that table to Hive, you might have this mapping for #2
>>>>>>>>
>>>>>>>> db1.table1 => nessie:folder1.folder2.folder3.table1
>>>>>>>>
>>>>>>>> Similarly, you might want to expose a particular branch version of
>>>>>>>> a table. So it might say:
>>>>>>>>
>>>>>>>> db1.table1_etl_branch => nessie.folder1@etl_branch
>>>>>>>>
>>>>>>>> Just saying that the address to the table in the catalog could
>>>>>>>> itself have several properties. The key being that no matter what those
>>>>>>>> are, we should follow #1 and only store properties that are about the 
>>>>>>>> ptr,
>>>>>>>> not the content/metadata.
>>>>>>>>
>>>>>>>> Lastly, I believe #4 is the case but haven't tested it. Can someone
>>>>>>>> confirm that it is true? And that it is possible/not problematic?
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Jacques Nadeau
>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <
>>>>>>>> rb...@netflix.com.invalid> wrote:
>>>>>>>>
>>>>>>>>> Thanks for working on this, Laszlo. I’ve been thinking about these
>>>>>>>>> problems as well, so this is a good time to have a discussion about 
>>>>>>>>> Hive
>>>>>>>>> config.
>>>>>>>>>
>>>>>>>>> I think that Hive configuration should work mostly like other
>>>>>>>>> engines, where different configurations are used for different 
>>>>>>>>> purposes.
>>>>>>>>> Different purposes means that there is not a global configuration 
>>>>>>>>> priority.
>>>>>>>>> Hopefully, I can explain how we use the different config sources 
>>>>>>>>> elsewhere
>>>>>>>>> to clarify.
>>>>>>>>>
>>>>>>>>> Let’s take Spark as an example. Spark uses Hadoop, so it has a
>>>>>>>>> Hadoop Configuration, but it also has its own global configuration. 
>>>>>>>>> There
>>>>>>>>> are also Iceberg table properties, and all of the various Hive 
>>>>>>>>> properties
>>>>>>>>> if you’re tracking tables with a Hive MetaStore.
>>>>>>>>>
>>>>>>>>> The first step is to simplify where we can, so we effectively
>>>>>>>>> eliminate 2 sources of config:
>>>>>>>>>
>>>>>>>>>    - The Hadoop Configuration is only used to instantiate Hadoop
>>>>>>>>>    classes, like FileSystem. Iceberg should not use it for any other 
>>>>>>>>> config.
>>>>>>>>>    - Config in the Hive MetaStore is only used to identify that a
>>>>>>>>>    table is Iceberg and point to its metadata location. All other 
>>>>>>>>> config in
>>>>>>>>>    HMS is informational. For example, the input format is 
>>>>>>>>> FileInputFormat so
>>>>>>>>>    that non-Iceberg readers cannot actually instantiate the format 
>>>>>>>>> (it’s
>>>>>>>>>    abstract) but it is available so they also don’t fail trying to 
>>>>>>>>> load the
>>>>>>>>>    class. Table-specific config should not be stored in table or serde
>>>>>>>>>    properties.
>>>>>>>>>
>>>>>>>>> That leaves Spark configuration and Iceberg table configuration.
>>>>>>>>>
>>>>>>>>> Iceberg differs from other tables because it is opinionated: data
>>>>>>>>> configuration should be maintained at the table level. This is 
>>>>>>>>> cleaner for
>>>>>>>>> users because config is standardized across engines and in one place. 
>>>>>>>>> And
>>>>>>>>> it also enables services that analyze a table and update its 
>>>>>>>>> configuration
>>>>>>>>> to tune options that users almost never do, like row group or stripe 
>>>>>>>>> size
>>>>>>>>> in the columnar formats. Iceberg table configuration is used to 
>>>>>>>>> configure
>>>>>>>>> table-specific concerns and behavior.
>>>>>>>>>
>>>>>>>>> Spark configuration is used for engine-specific concerns, and
>>>>>>>>> runtime overrides. A good example of an engine-specific concern is the
>>>>>>>>> catalogs that are available to load Iceberg tables. Spark has a way 
>>>>>>>>> to load
>>>>>>>>> and configure catalog implementations and Iceberg uses that for all
>>>>>>>>> catalog-level config. Runtime overrides are things like target split 
>>>>>>>>> size.
>>>>>>>>> Iceberg has a table-level default split size in table properties, but 
>>>>>>>>> this
>>>>>>>>> can be overridden by a Spark option for each table, as well as an 
>>>>>>>>> option
>>>>>>>>> passed to the individual read. Note that these necessarily have 
>>>>>>>>> different
>>>>>>>>> config names for how they are used: Iceberg uses
>>>>>>>>> read.split.target-size and the read-specific option is target-size
>>>>>>>>> .
>>>>>>>>>
>>>>>>>>> Applying this to Hive is a little strange for a couple reasons.
>>>>>>>>> First, Hive’s engine configuration *is* a Hadoop Configuration.
>>>>>>>>> As a result, I think the right place to store engine-specific config 
>>>>>>>>> is
>>>>>>>>> there, including Iceberg catalogs using a strategy similar to what 
>>>>>>>>> Spark
>>>>>>>>> does: what external Iceberg catalogs are available and their 
>>>>>>>>> configuration
>>>>>>>>> should come from the HiveConf.
>>>>>>>>>
>>>>>>>>> The second way Hive is strange is that Hive needs to use its own
>>>>>>>>> MetaStore to track Hive table concerns. The MetaStore may have tables
>>>>>>>>> created by an Iceberg HiveCatalog, and Hive also needs to be able to 
>>>>>>>>> load
>>>>>>>>> tables from other Iceberg catalogs by creating table entries for them.
>>>>>>>>>
>>>>>>>>> Here’s how I think Hive should work:
>>>>>>>>>
>>>>>>>>>    - There should be a default HiveCatalog that uses the current
>>>>>>>>>    MetaStore URI to be used for HiveCatalog tables tracked in the 
>>>>>>>>> MetaStore
>>>>>>>>>    - Other catalogs should be defined in HiveConf
>>>>>>>>>    - HMS table properties should be used to determine how to load
>>>>>>>>>    a table: using a Hadoop location, using the default metastore 
>>>>>>>>> catalog, or
>>>>>>>>>    using an external Iceberg catalog
>>>>>>>>>       - If there is a metadata_location, then use the HiveCatalog
>>>>>>>>>       for this metastore (where it is tracked)
>>>>>>>>>       - If there is a catalog property, then load that catalog
>>>>>>>>>       and use it to load the table identifier, or maybe an identifier 
>>>>>>>>> from HMS
>>>>>>>>>       table properties
>>>>>>>>>       - If there is no catalog or metadata_location, then use
>>>>>>>>>       HadoopTables to load the table location as an Iceberg table
>>>>>>>>>
>>>>>>>>> This would make it possible to access all types of Iceberg tables
>>>>>>>>> in the same query, and would match how Spark and Flink configure 
>>>>>>>>> catalogs.
>>>>>>>>> Other than the configuration above, I don’t think that config in HMS 
>>>>>>>>> should
>>>>>>>>> be used at all, like how the other engines work. Iceberg is the 
>>>>>>>>> source of
>>>>>>>>> truth for table metadata, HMS stores how to load the Iceberg table, 
>>>>>>>>> and
>>>>>>>>> HiveConf defines the catalogs (or runtime overrides).
>>>>>>>>>
>>>>>>>>> This isn’t quite how configuration works right now. Currently, the
>>>>>>>>> catalog is controlled by a HiveConf property, iceberg.mr.catalog.
>>>>>>>>> If that isn’t set, HadoopTables will be used to load table locations. 
>>>>>>>>> If it
>>>>>>>>> is set, then that catalog will be used to load all tables by name. 
>>>>>>>>> This
>>>>>>>>> makes it impossible to load tables from different catalogs at the same
>>>>>>>>> time. That’s why I think the Iceberg catalog for a table should be 
>>>>>>>>> stored
>>>>>>>>> in HMS table properties.
>>>>>>>>>
>>>>>>>>> I should also explain iceberg.hive.engine.enabled flag, but I
>>>>>>>>> think this is long enough for now.
>>>>>>>>>
>>>>>>>>> rb
>>>>>>>>>
>>>>>>>>> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter <
>>>>>>>>> lpin...@cloudera.com.invalid> wrote:
>>>>>>>>>
>>>>>>>>>> Hi All,
>>>>>>>>>>
>>>>>>>>>> I would like to start a discussion, how should we handle
>>>>>>>>>> properties from various sources like Iceberg, Hive or global 
>>>>>>>>>> configuration.
>>>>>>>>>> I've put together a short document
>>>>>>>>>> <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>,
>>>>>>>>>> please have a look and let me know what you think.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Laszlo
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Ryan Blue
>>>>>>>>> Software Engineer
>>>>>>>>> Netflix
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>
>>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Iceberg/Hive properties handling

Reply via email to