Re: Iceberg/Hive properties handling

Jacques Nadeau Wed, 25 Nov 2020 16:56:31 -0800

I agree with Ryan on the core principles here. As I understand them:

   1. Iceberg metadata describes all properties of a table
   2. Hive table properties describe "how to get to" Iceberg metadata
   (which catalog + possibly ptr, path, token, etc)
   3. There could be default "how to get to" information set at a global
   level
   4. Best-effort schema should stored be in the table properties in HMS.
   This should be done for information schema retrieval purposes within Hive
   but should be ignored during Hive/other tool execution.


Is that a fair summary of your statements Ryan (except 4, which I just
added)?

One comment I have on #2 is that for different catalogs and use cases, I
think it can be somewhat more complex where it would be desirable for a
table that initially existed without Hive that was later exposed in Hive to
support a ptr/path/token for how the table is named externally. For
example, in a Nessie context we support arbitrary paths for an Iceberg
table (such as folder1.folder2.folder3.table1). If you then want to expose
that table to Hive, you might have this mapping for #2

db1.table1 => nessie:folder1.folder2.folder3.table1

Similarly, you might want to expose a particular branch version of a table.
So it might say:

db1.table1_etl_branch => nessie.folder1@etl_branch

Just saying that the address to the table in the catalog could itself have
several properties. The key being that no matter what those are, we should
follow #1 and only store properties that are about the ptr, not the
content/metadata.

Lastly, I believe #4 is the case but haven't tested it. Can someone confirm
that it is true? And that it is possible/not problematic?


--
Jacques Nadeau
CTO and Co-Founder, Dremio


On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Thanks for working on this, Laszlo. I’ve been thinking about these
> problems as well, so this is a good time to have a discussion about Hive
> config.
>
> I think that Hive configuration should work mostly like other engines,
> where different configurations are used for different purposes. Different
> purposes means that there is not a global configuration priority.
> Hopefully, I can explain how we use the different config sources elsewhere
> to clarify.
>
> Let’s take Spark as an example. Spark uses Hadoop, so it has a Hadoop
> Configuration, but it also has its own global configuration. There are also
> Iceberg table properties, and all of the various Hive properties if you’re
> tracking tables with a Hive MetaStore.
>
> The first step is to simplify where we can, so we effectively eliminate 2
> sources of config:
>
>    - The Hadoop Configuration is only used to instantiate Hadoop classes,
>    like FileSystem. Iceberg should not use it for any other config.
>    - Config in the Hive MetaStore is only used to identify that a table
>    is Iceberg and point to its metadata location. All other config in HMS is
>    informational. For example, the input format is FileInputFormat so that
>    non-Iceberg readers cannot actually instantiate the format (it’s abstract)
>    but it is available so they also don’t fail trying to load the class.
>    Table-specific config should not be stored in table or serde properties.
>
> That leaves Spark configuration and Iceberg table configuration.
>
> Iceberg differs from other tables because it is opinionated: data
> configuration should be maintained at the table level. This is cleaner for
> users because config is standardized across engines and in one place. And
> it also enables services that analyze a table and update its configuration
> to tune options that users almost never do, like row group or stripe size
> in the columnar formats. Iceberg table configuration is used to configure
> table-specific concerns and behavior.
>
> Spark configuration is used for engine-specific concerns, and runtime
> overrides. A good example of an engine-specific concern is the catalogs
> that are available to load Iceberg tables. Spark has a way to load and
> configure catalog implementations and Iceberg uses that for all
> catalog-level config. Runtime overrides are things like target split size.
> Iceberg has a table-level default split size in table properties, but this
> can be overridden by a Spark option for each table, as well as an option
> passed to the individual read. Note that these necessarily have different
> config names for how they are used: Iceberg uses read.split.target-size
> and the read-specific option is target-size.
>
> Applying this to Hive is a little strange for a couple reasons. First,
> Hive’s engine configuration *is* a Hadoop Configuration. As a result, I
> think the right place to store engine-specific config is there, including
> Iceberg catalogs using a strategy similar to what Spark does: what external
> Iceberg catalogs are available and their configuration should come from the
> HiveConf.
>
> The second way Hive is strange is that Hive needs to use its own MetaStore
> to track Hive table concerns. The MetaStore may have tables created by an
> Iceberg HiveCatalog, and Hive also needs to be able to load tables from
> other Iceberg catalogs by creating table entries for them.
>
> Here’s how I think Hive should work:
>
>    - There should be a default HiveCatalog that uses the current
>    MetaStore URI to be used for HiveCatalog tables tracked in the MetaStore
>    - Other catalogs should be defined in HiveConf
>    - HMS table properties should be used to determine how to load a
>    table: using a Hadoop location, using the default metastore catalog, or
>    using an external Iceberg catalog
>       - If there is a metadata_location, then use the HiveCatalog for
>       this metastore (where it is tracked)
>       - If there is a catalog property, then load that catalog and use it
>       to load the table identifier, or maybe an identifier from HMS table
>       properties
>       - If there is no catalog or metadata_location, then use
>       HadoopTables to load the table location as an Iceberg table
>
> This would make it possible to access all types of Iceberg tables in the
> same query, and would match how Spark and Flink configure catalogs. Other
> than the configuration above, I don’t think that config in HMS should be
> used at all, like how the other engines work. Iceberg is the source of
> truth for table metadata, HMS stores how to load the Iceberg table, and
> HiveConf defines the catalogs (or runtime overrides).
>
> This isn’t quite how configuration works right now. Currently, the catalog
> is controlled by a HiveConf property, iceberg.mr.catalog. If that isn’t
> set, HadoopTables will be used to load table locations. If it is set, then
> that catalog will be used to load all tables by name. This makes it
> impossible to load tables from different catalogs at the same time. That’s
> why I think the Iceberg catalog for a table should be stored in HMS table
> properties.
>
> I should also explain iceberg.hive.engine.enabled flag, but I think this
> is long enough for now.
>
> rb
>
> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter <lpin...@cloudera.com.invalid>
> wrote:
>
>> Hi All,
>>
>> I would like to start a discussion, how should we handle properties from
>> various sources like Iceberg, Hive or global configuration. I've put
>> together a short document
>> <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>,
>> please have a look and let me know what you think.
>>
>> Thanks,
>> Laszlo
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Iceberg/Hive properties handling

Reply via email to