Re: Iceberg/Hive properties handling

Ryan Blue Wed, 25 Nov 2020 16:28:36 -0800

Thanks for working on this, Laszlo. I’ve been thinking about these problems
as well, so this is a good time to have a discussion about Hive config.

I think that Hive configuration should work mostly like other engines,
where different configurations are used for different purposes. Different
purposes means that there is not a global configuration priority.
Hopefully, I can explain how we use the different config sources elsewhere
to clarify.

Let’s take Spark as an example. Spark uses Hadoop, so it has a Hadoop
Configuration, but it also has its own global configuration. There are also
Iceberg table properties, and all of the various Hive properties if you’re
tracking tables with a Hive MetaStore.

The first step is to simplify where we can, so we effectively eliminate 2
sources of config:

   - The Hadoop Configuration is only used to instantiate Hadoop classes,
   like FileSystem. Iceberg should not use it for any other config.
   - Config in the Hive MetaStore is only used to identify that a table is
   Iceberg and point to its metadata location. All other config in HMS is
   informational. For example, the input format is FileInputFormat so that
   non-Iceberg readers cannot actually instantiate the format (it’s abstract)
   but it is available so they also don’t fail trying to load the class.
   Table-specific config should not be stored in table or serde properties.

That leaves Spark configuration and Iceberg table configuration.

Iceberg differs from other tables because it is opinionated: data
configuration should be maintained at the table level. This is cleaner for
users because config is standardized across engines and in one place. And
it also enables services that analyze a table and update its configuration
to tune options that users almost never do, like row group or stripe size
in the columnar formats. Iceberg table configuration is used to configure
table-specific concerns and behavior.

Spark configuration is used for engine-specific concerns, and runtime
overrides. A good example of an engine-specific concern is the catalogs
that are available to load Iceberg tables. Spark has a way to load and
configure catalog implementations and Iceberg uses that for all
catalog-level config. Runtime overrides are things like target split size.
Iceberg has a table-level default split size in table properties, but this
can be overridden by a Spark option for each table, as well as an option
passed to the individual read. Note that these necessarily have different
config names for how they are used: Iceberg uses read.split.target-size and
the read-specific option is target-size.

Applying this to Hive is a little strange for a couple reasons. First,
Hive’s engine configuration *is* a Hadoop Configuration. As a result, I
think the right place to store engine-specific config is there, including
Iceberg catalogs using a strategy similar to what Spark does: what external
Iceberg catalogs are available and their configuration should come from the
HiveConf.

The second way Hive is strange is that Hive needs to use its own MetaStore
to track Hive table concerns. The MetaStore may have tables created by an
Iceberg HiveCatalog, and Hive also needs to be able to load tables from
other Iceberg catalogs by creating table entries for them.

Here’s how I think Hive should work:

   - There should be a default HiveCatalog that uses the current MetaStore
   URI to be used for HiveCatalog tables tracked in the MetaStore
   - Other catalogs should be defined in HiveConf
   - HMS table properties should be used to determine how to load a table:
   using a Hadoop location, using the default metastore catalog, or using an
   external Iceberg catalog
      - If there is a metadata_location, then use the HiveCatalog for this
      metastore (where it is tracked)
      - If there is a catalog property, then load that catalog and use it
      to load the table identifier, or maybe an identifier from HMS table
      properties
      - If there is no catalog or metadata_location, then use HadoopTables
      to load the table location as an Iceberg table

This would make it possible to access all types of Iceberg tables in the
same query, and would match how Spark and Flink configure catalogs. Other
than the configuration above, I don’t think that config in HMS should be
used at all, like how the other engines work. Iceberg is the source of
truth for table metadata, HMS stores how to load the Iceberg table, and
HiveConf defines the catalogs (or runtime overrides).

This isn’t quite how configuration works right now. Currently, the catalog
is controlled by a HiveConf property, iceberg.mr.catalog. If that isn’t
set, HadoopTables will be used to load table locations. If it is set, then
that catalog will be used to load all tables by name. This makes it
impossible to load tables from different catalogs at the same time. That’s
why I think the Iceberg catalog for a table should be stored in HMS table
properties.

I should also explain iceberg.hive.engine.enabled flag, but I think this is
long enough for now.

rb

On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter <lpin...@cloudera.com.invalid>
wrote:

> Hi All,
>
> I would like to start a discussion, how should we handle properties from
> various sources like Iceberg, Hive or global configuration. I've put
> together a short document
> <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>,
> please have a look and let me know what you think.
>
> Thanks,
> Laszlo
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Iceberg/Hive properties handling

Reply via email to