Re: Iceberg/Hive properties handling

Jacques Nadeau Wed, 25 Nov 2020 17:14:32 -0800

Minor error, my last example should have been:

db1.table1_etl_branch => nessie.folder1.folder2.folder3.table1@etl_branch


--
Jacques Nadeau
CTO and Co-Founder, Dremio


On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau <[email protected]> wrote:

> I agree with Ryan on the core principles here. As I understand them:
>
>    1. Iceberg metadata describes all properties of a table
>    2. Hive table properties describe "how to get to" Iceberg metadata
>    (which catalog + possibly ptr, path, token, etc)
>    3. There could be default "how to get to" information set at a global
>    level
>    4. Best-effort schema should stored be in the table properties in HMS.
>    This should be done for information schema retrieval purposes within Hive
>    but should be ignored during Hive/other tool execution.
>
> Is that a fair summary of your statements Ryan (except 4, which I just
> added)?
>
> One comment I have on #2 is that for different catalogs and use cases, I
> think it can be somewhat more complex where it would be desirable for a
> table that initially existed without Hive that was later exposed in Hive to
> support a ptr/path/token for how the table is named externally. For
> example, in a Nessie context we support arbitrary paths for an Iceberg
> table (such as folder1.folder2.folder3.table1). If you then want to expose
> that table to Hive, you might have this mapping for #2
>
> db1.table1 => nessie:folder1.folder2.folder3.table1
>
> Similarly, you might want to expose a particular branch version of a
> table. So it might say:
>
> db1.table1_etl_branch => nessie.folder1@etl_branch
>
> Just saying that the address to the table in the catalog could itself have
> several properties. The key being that no matter what those are, we should
> follow #1 and only store properties that are about the ptr, not the
> content/metadata.
>
> Lastly, I believe #4 is the case but haven't tested it. Can someone
> confirm that it is true? And that it is possible/not problematic?
>
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
>
> On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <[email protected]>
> wrote:
>
>> Thanks for working on this, Laszlo. I’ve been thinking about these
>> problems as well, so this is a good time to have a discussion about Hive
>> config.
>>
>> I think that Hive configuration should work mostly like other engines,
>> where different configurations are used for different purposes. Different
>> purposes means that there is not a global configuration priority.
>> Hopefully, I can explain how we use the different config sources elsewhere
>> to clarify.
>>
>> Let’s take Spark as an example. Spark uses Hadoop, so it has a Hadoop
>> Configuration, but it also has its own global configuration. There are also
>> Iceberg table properties, and all of the various Hive properties if you’re
>> tracking tables with a Hive MetaStore.
>>
>> The first step is to simplify where we can, so we effectively eliminate 2
>> sources of config:
>>
>>    - The Hadoop Configuration is only used to instantiate Hadoop
>>    classes, like FileSystem. Iceberg should not use it for any other config.
>>    - Config in the Hive MetaStore is only used to identify that a table
>>    is Iceberg and point to its metadata location. All other config in HMS is
>>    informational. For example, the input format is FileInputFormat so that
>>    non-Iceberg readers cannot actually instantiate the format (it’s abstract)
>>    but it is available so they also don’t fail trying to load the class.
>>    Table-specific config should not be stored in table or serde properties.
>>
>> That leaves Spark configuration and Iceberg table configuration.
>>
>> Iceberg differs from other tables because it is opinionated: data
>> configuration should be maintained at the table level. This is cleaner for
>> users because config is standardized across engines and in one place. And
>> it also enables services that analyze a table and update its configuration
>> to tune options that users almost never do, like row group or stripe size
>> in the columnar formats. Iceberg table configuration is used to configure
>> table-specific concerns and behavior.
>>
>> Spark configuration is used for engine-specific concerns, and runtime
>> overrides. A good example of an engine-specific concern is the catalogs
>> that are available to load Iceberg tables. Spark has a way to load and
>> configure catalog implementations and Iceberg uses that for all
>> catalog-level config. Runtime overrides are things like target split size.
>> Iceberg has a table-level default split size in table properties, but this
>> can be overridden by a Spark option for each table, as well as an option
>> passed to the individual read. Note that these necessarily have different
>> config names for how they are used: Iceberg uses read.split.target-size
>> and the read-specific option is target-size.
>>
>> Applying this to Hive is a little strange for a couple reasons. First,
>> Hive’s engine configuration *is* a Hadoop Configuration. As a result, I
>> think the right place to store engine-specific config is there, including
>> Iceberg catalogs using a strategy similar to what Spark does: what external
>> Iceberg catalogs are available and their configuration should come from the
>> HiveConf.
>>
>> The second way Hive is strange is that Hive needs to use its own
>> MetaStore to track Hive table concerns. The MetaStore may have tables
>> created by an Iceberg HiveCatalog, and Hive also needs to be able to load
>> tables from other Iceberg catalogs by creating table entries for them.
>>
>> Here’s how I think Hive should work:
>>
>>    - There should be a default HiveCatalog that uses the current
>>    MetaStore URI to be used for HiveCatalog tables tracked in the MetaStore
>>    - Other catalogs should be defined in HiveConf
>>    - HMS table properties should be used to determine how to load a
>>    table: using a Hadoop location, using the default metastore catalog, or
>>    using an external Iceberg catalog
>>       - If there is a metadata_location, then use the HiveCatalog for
>>       this metastore (where it is tracked)
>>       - If there is a catalog property, then load that catalog and use
>>       it to load the table identifier, or maybe an identifier from HMS table
>>       properties
>>       - If there is no catalog or metadata_location, then use
>>       HadoopTables to load the table location as an Iceberg table
>>
>> This would make it possible to access all types of Iceberg tables in the
>> same query, and would match how Spark and Flink configure catalogs. Other
>> than the configuration above, I don’t think that config in HMS should be
>> used at all, like how the other engines work. Iceberg is the source of
>> truth for table metadata, HMS stores how to load the Iceberg table, and
>> HiveConf defines the catalogs (or runtime overrides).
>>
>> This isn’t quite how configuration works right now. Currently, the
>> catalog is controlled by a HiveConf property, iceberg.mr.catalog. If
>> that isn’t set, HadoopTables will be used to load table locations. If it is
>> set, then that catalog will be used to load all tables by name. This makes
>> it impossible to load tables from different catalogs at the same time.
>> That’s why I think the Iceberg catalog for a table should be stored in HMS
>> table properties.
>>
>> I should also explain iceberg.hive.engine.enabled flag, but I think this
>> is long enough for now.
>>
>> rb
>>
>> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter
>> <[email protected]> wrote:
>>
>>> Hi All,
>>>
>>> I would like to start a discussion, how should we handle properties from
>>> various sources like Iceberg, Hive or global configuration. I've put
>>> together a short document
>>> <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>,
>>> please have a look and let me know what you think.
>>>
>>> Thanks,
>>> Laszlo
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

Re: Iceberg/Hive properties handling

Reply via email to