Re: Iceberg/Hive properties handling

Ryan Blue Wed, 25 Nov 2020 17:28:18 -0800

Yes, I think that is a good summary of the principles.

#4 is correct because we provide some information that is informational
(Hive schema) or tracked only by the metastore (best-effort current user).
I also agree that it would be good to have a table identifier in HMS table
metadata when loading from an external table. That gives us a way to handle
name conflicts.


On Wed, Nov 25, 2020 at 5:14 PM Jacques Nadeau <jacq...@dremio.com> wrote:

> Minor error, my last example should have been:
>
> db1.table1_etl_branch => nessie.folder1.folder2.folder3.table1@etl_branch
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
>
> On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau <jacq...@dremio.com> wrote:
>
>> I agree with Ryan on the core principles here. As I understand them:
>>
>>    1. Iceberg metadata describes all properties of a table
>>    2. Hive table properties describe "how to get to" Iceberg metadata
>>    (which catalog + possibly ptr, path, token, etc)
>>    3. There could be default "how to get to" information set at a global
>>    level
>>    4. Best-effort schema should stored be in the table properties in
>>    HMS. This should be done for information schema retrieval purposes within
>>    Hive but should be ignored during Hive/other tool execution.
>>
>> Is that a fair summary of your statements Ryan (except 4, which I just
>> added)?
>>
>> One comment I have on #2 is that for different catalogs and use cases, I
>> think it can be somewhat more complex where it would be desirable for a
>> table that initially existed without Hive that was later exposed in Hive to
>> support a ptr/path/token for how the table is named externally. For
>> example, in a Nessie context we support arbitrary paths for an Iceberg
>> table (such as folder1.folder2.folder3.table1). If you then want to expose
>> that table to Hive, you might have this mapping for #2
>>
>> db1.table1 => nessie:folder1.folder2.folder3.table1
>>
>> Similarly, you might want to expose a particular branch version of a
>> table. So it might say:
>>
>> db1.table1_etl_branch => nessie.folder1@etl_branch
>>
>> Just saying that the address to the table in the catalog could itself
>> have several properties. The key being that no matter what those are, we
>> should follow #1 and only store properties that are about the ptr, not the
>> content/metadata.
>>
>> Lastly, I believe #4 is the case but haven't tested it. Can someone
>> confirm that it is true? And that it is possible/not problematic?
>>
>>
>> --
>> Jacques Nadeau
>> CTO and Co-Founder, Dremio
>>
>>
>> On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> Thanks for working on this, Laszlo. I’ve been thinking about these
>>> problems as well, so this is a good time to have a discussion about Hive
>>> config.
>>>
>>> I think that Hive configuration should work mostly like other engines,
>>> where different configurations are used for different purposes. Different
>>> purposes means that there is not a global configuration priority.
>>> Hopefully, I can explain how we use the different config sources elsewhere
>>> to clarify.
>>>
>>> Let’s take Spark as an example. Spark uses Hadoop, so it has a Hadoop
>>> Configuration, but it also has its own global configuration. There are also
>>> Iceberg table properties, and all of the various Hive properties if you’re
>>> tracking tables with a Hive MetaStore.
>>>
>>> The first step is to simplify where we can, so we effectively eliminate
>>> 2 sources of config:
>>>
>>>    - The Hadoop Configuration is only used to instantiate Hadoop
>>>    classes, like FileSystem. Iceberg should not use it for any other config.
>>>    - Config in the Hive MetaStore is only used to identify that a table
>>>    is Iceberg and point to its metadata location. All other config in HMS is
>>>    informational. For example, the input format is FileInputFormat so that
>>>    non-Iceberg readers cannot actually instantiate the format (it’s 
>>> abstract)
>>>    but it is available so they also don’t fail trying to load the class.
>>>    Table-specific config should not be stored in table or serde properties.
>>>
>>> That leaves Spark configuration and Iceberg table configuration.
>>>
>>> Iceberg differs from other tables because it is opinionated: data
>>> configuration should be maintained at the table level. This is cleaner for
>>> users because config is standardized across engines and in one place. And
>>> it also enables services that analyze a table and update its configuration
>>> to tune options that users almost never do, like row group or stripe size
>>> in the columnar formats. Iceberg table configuration is used to configure
>>> table-specific concerns and behavior.
>>>
>>> Spark configuration is used for engine-specific concerns, and runtime
>>> overrides. A good example of an engine-specific concern is the catalogs
>>> that are available to load Iceberg tables. Spark has a way to load and
>>> configure catalog implementations and Iceberg uses that for all
>>> catalog-level config. Runtime overrides are things like target split size.
>>> Iceberg has a table-level default split size in table properties, but this
>>> can be overridden by a Spark option for each table, as well as an option
>>> passed to the individual read. Note that these necessarily have different
>>> config names for how they are used: Iceberg uses read.split.target-size
>>> and the read-specific option is target-size.
>>>
>>> Applying this to Hive is a little strange for a couple reasons. First,
>>> Hive’s engine configuration *is* a Hadoop Configuration. As a result, I
>>> think the right place to store engine-specific config is there, including
>>> Iceberg catalogs using a strategy similar to what Spark does: what external
>>> Iceberg catalogs are available and their configuration should come from the
>>> HiveConf.
>>>
>>> The second way Hive is strange is that Hive needs to use its own
>>> MetaStore to track Hive table concerns. The MetaStore may have tables
>>> created by an Iceberg HiveCatalog, and Hive also needs to be able to load
>>> tables from other Iceberg catalogs by creating table entries for them.
>>>
>>> Here’s how I think Hive should work:
>>>
>>>    - There should be a default HiveCatalog that uses the current
>>>    MetaStore URI to be used for HiveCatalog tables tracked in the MetaStore
>>>    - Other catalogs should be defined in HiveConf
>>>    - HMS table properties should be used to determine how to load a
>>>    table: using a Hadoop location, using the default metastore catalog, or
>>>    using an external Iceberg catalog
>>>       - If there is a metadata_location, then use the HiveCatalog for
>>>       this metastore (where it is tracked)
>>>       - If there is a catalog property, then load that catalog and use
>>>       it to load the table identifier, or maybe an identifier from HMS table
>>>       properties
>>>       - If there is no catalog or metadata_location, then use
>>>       HadoopTables to load the table location as an Iceberg table
>>>
>>> This would make it possible to access all types of Iceberg tables in the
>>> same query, and would match how Spark and Flink configure catalogs. Other
>>> than the configuration above, I don’t think that config in HMS should be
>>> used at all, like how the other engines work. Iceberg is the source of
>>> truth for table metadata, HMS stores how to load the Iceberg table, and
>>> HiveConf defines the catalogs (or runtime overrides).
>>>
>>> This isn’t quite how configuration works right now. Currently, the
>>> catalog is controlled by a HiveConf property, iceberg.mr.catalog. If
>>> that isn’t set, HadoopTables will be used to load table locations. If it is
>>> set, then that catalog will be used to load all tables by name. This makes
>>> it impossible to load tables from different catalogs at the same time.
>>> That’s why I think the Iceberg catalog for a table should be stored in HMS
>>> table properties.
>>>
>>> I should also explain iceberg.hive.engine.enabled flag, but I think
>>> this is long enough for now.
>>>
>>> rb
>>>
>>> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter
>>> <lpin...@cloudera.com.invalid> wrote:
>>>
>>>> Hi All,
>>>>
>>>> I would like to start a discussion, how should we handle properties
>>>> from various sources like Iceberg, Hive or global configuration. I've put
>>>> together a short document
>>>> <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>,
>>>> please have a look and let me know what you think.
>>>>
>>>> Thanks,
>>>> Laszlo
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Iceberg/Hive properties handling

Reply via email to