Hey Peter, thanks for updating the doc and your heads up in the other
thread on your capacity to look at this before EOY.

I'm going to try to create a specification document based on the discussion
document you put together. I think there is general consensus around what
you call "Spark-like catalog configuration" so I'd like to formalize that
more.

It seems like there is less consensus around the whitelist/blacklist side
of things. You outline four approaches:

   1. Hard coded HMS only property list
   2. Hard coded Iceberg only property list
   3. Prefix for Iceberg properties
   4. Prefix for HMS only properties

I generally think #2 is a no-go as it creates too much coupling between
catalog implementations and core iceberg. It seems like Ryan Blue would
prefer #4 (correct?). Any other strong opinions?
--
Jacques Nadeau
CTO and Co-Founder, Dremio


On Thu, Dec 3, 2020 at 9:27 AM Peter Vary <pv...@cloudera.com.invalid>
wrote:

> As Jacques suggested (with the help of Zoltan) I have collected the
> current state and the proposed solutions in a document:
>
> https://docs.google.com/document/d/1KumHM9IKbQyleBEUHZDbeoMjd7n6feUPJ5zK8NQb-Qw/edit?usp=sharing
>
> My feeling is that we do not have a final decision, so tried to list all
> the possible solutions.
> Please comment!
>
> Thanks,
> Peter
>
> On Dec 2, 2020, at 18:10, Peter Vary <pv...@cloudera.com> wrote:
>
> When I was working on the CREATE TABLE patch I found the following
> TBLPROPERTIES on newly created tables:
>
>    - external.table.purge
>    - EXTERNAL
>    - bucketing_version
>    - numRows
>    - rawDataSize
>    - totalSize
>    - numFiles
>    - numFileErasureCoded
>
>
> I am afraid that we can not change the name of most of these properties,
> and might not be useful to have most of them along with Iceberg statistics
> already there. Also my feeling is that this is only the top of the Iceberg
> (pun intended :)) so this is why I think we should be more targeted way to
> push properties to the Iceberg tables.
>
> On Dec 2, 2020, at 18:04, Ryan Blue <rb...@netflix.com> wrote:
>
> Sorry, I accidentally didn’t copy the dev list on this reply. Resending:
>
> Also I expect that we want to add Hive write specific configs to table
> level when the general engine independent configuration is not ideal for
> Hive, but every Hive query for a given table should use some specific
> config.
>
> Hive may need configuration, but I think these should still be kept in the
> Iceberg table. There is no reason to make Hive config inaccessible from
> other engines. If someone wants to view all of the config for a table from
> Spark, the Hive config should also be included right?
>
> On Tue, Dec 1, 2020 at 10:36 AM Peter Vary <pv...@cloudera.com> wrote:
>
>> I will ask Laszlo if he wants to update his doc.
>>
>> I see both pros and cons of catalog definition in config files. If there
>> is an easy default then I do not mind any of the proposed solutions.
>>
>> OTOH I am in favor of the "use prefix for Iceberg table properties"
>> solution, because in Hive it is common to add new keys to the property list
>> - no restriction is in place (I am not even sure that the currently
>> implemented blacklist for preventing to propagate properties to Iceberg
>> tables is complete). Also I expect that we want to add Hive write specific
>> configs to table level when the general engine independent configuration is
>> not ideal for Hive, but every Hive query for a given table should use some
>> specific config.
>>
>> Thanks, Peter
>>
>> Jacques Nadeau <jacq...@dremio.com> ezt írta (időpont: 2020. dec. 1., Ke
>> 17:06):
>>
>>> Would someone be willing to create a document that states the current
>>> proposal?
>>>
>>> It is becoming somewhat difficult to follow this thread. I also worry
>>> that without a complete statement of the current shape that people may be
>>> incorrectly thinking they are in alignment.
>>>
>>>
>>>
>>> --
>>> Jacques Nadeau
>>> CTO and Co-Founder, Dremio
>>>
>>>
>>> On Tue, Dec 1, 2020 at 5:32 AM Zoltán Borók-Nagy <
>>> borokna...@cloudera.com> wrote:
>>>
>>>> Thanks, Ryan. I answered inline.
>>>>
>>>> On Mon, Nov 30, 2020 at 8:26 PM Ryan Blue <rb...@netflix.com> wrote:
>>>>
>>>>> This sounds like a good plan overall, but I have a couple of notes:
>>>>>
>>>>>    1. We need to keep in mind that users plug in their own catalogs,
>>>>>    so iceberg.catalog could be a Glue or Nessie catalog, not just
>>>>>    Hive or Hadoop. I don’t think it makes much sense to use separate
>>>>>    hadoop.catalog and hive.catalog values. Those should just be names for
>>>>>    catalogs configured in Configuration, i.e., via hive-site.xml. We
>>>>>    then only need a special value for loading Hadoop tables from paths.
>>>>>
>>>>> About extensibility, I think the usual Hive way is to use Java class
>>>> names. So this way the value for 'iceberg.catalog' could be e.g.
>>>> 'org.apache.iceberg.hive.HiveCatalog'. Then each catalog implementation
>>>> would need to have a factory method that constructs the catalog object from
>>>> a properties object (Map<String, String>). E.g.
>>>> 'org.apache.iceberg.hadoop.HadoopCatalog' would require
>>>> 'iceberg.catalog_location' to be present in properties.
>>>>
>>>>>
>>>>>    1. I don’t think that catalog configuration should be kept in
>>>>>    table properties. A catalog should not be loaded for each table. So I 
>>>>> don’t
>>>>>    think we need iceberg.catalog_location. Instead, we should have a
>>>>>    way to define catalogs in the Configuration for tables in the
>>>>>    metastore to reference.
>>>>>
>>>>>  I think it makes sense, on the other hand it would make adding new
>>>> catalogs more heavy-weight, i.e. now you'd need to edit configuration files
>>>> and restart/reinit services. Maybe it can be cumbersome in some
>>>> environments.
>>>>
>>>>>
>>>>>    1. I’d rather use a prefix to exclude properties from being passed
>>>>>    to Iceberg than to include them. Otherwise, users don’t know what to 
>>>>> do to
>>>>>    pass table properties from Hive or Impala. If we exclude a prefix or
>>>>>    specific properties, then everything but the properties reserved for
>>>>>    locating the table are passed as the user would expect.
>>>>>
>>>>> I don't have a strong opinion about this, but yeah, maybe this
>>>> behavior would cause the least surprises.
>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Nov 30, 2020 at 7:51 AM Zoltán Borók-Nagy <
>>>>> borokna...@apache.org> wrote:
>>>>>
>>>>>> Thanks, Peter. I answered inline.
>>>>>>
>>>>>> On Mon, Nov 30, 2020 at 3:13 PM Peter Vary <
>>>>>> pv...@cloudera.com.invalid> wrote:
>>>>>>
>>>>>>> Hi Zoltan,
>>>>>>>
>>>>>>> Answers below:
>>>>>>>
>>>>>>> On Nov 30, 2020, at 14:19, Zoltán Borók-Nagy <
>>>>>>> borokna...@cloudera.com.INVALID> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Thanks for the replies. My take for the above questions are as
>>>>>>> follows
>>>>>>>
>>>>>>>    - Should 'iceberg.catalog' be a required property?
>>>>>>>    - Yeah, I think it would be nice if this would be required to
>>>>>>>       avoid any implicit behavior
>>>>>>>
>>>>>>> Currently we have a Catalogs class to get/initialize/use the
>>>>>>> different Catalogs. At that time the decision was to use HadoopTables 
>>>>>>> as a
>>>>>>> default catalog.
>>>>>>> It might be worthwhile to use the same class in Impala as well, so
>>>>>>> the behavior is consistent.
>>>>>>>
>>>>>>
>>>>>> Yeah, I think it'd be beneficial for us to use the Iceberg classes
>>>>>> whenever possible. The Catalogs class is very similar to what we have
>>>>>> currently in Impala.
>>>>>>
>>>>>>>
>>>>>>>    - 'hadoop.catalog' LOCATION and catalog_location
>>>>>>>       - In Impala we don't allow setting LOCATION for tables stored
>>>>>>>       in 'hadoop.catalog'. But Impala internally sets LOCATION to the 
>>>>>>> Iceberg
>>>>>>>       table's actual location. We were also thinking about using only 
>>>>>>> the table
>>>>>>>       LOCATION, and set it to the catalog location, but we also found it
>>>>>>>       confusing.
>>>>>>>
>>>>>>> It could definitely work, but it is somewhat strange that we have an
>>>>>>> external table location set to an arbitrary path, and we have a 
>>>>>>> different
>>>>>>> location generated by other configs. It would be nice to have the real
>>>>>>> location set in the external table location as well.
>>>>>>>
>>>>>>
>>>>>> Impala sets the real Iceberg table location for external tables. E.g.
>>>>>> if the user issues
>>>>>>
>>>>>> CREATE EXTERNAL TABLE my_hive_db.iceberg_table_hadoop_catalog
>>>>>> STORED AS ICEBERG
>>>>>> TBLPROPERTIES('iceberg.catalog'='hadoop.catalog',
>>>>>>               'iceberg.catalog_location'='/path/to/hadoop/catalog',
>>>>>>
>>>>>> 'iceberg.table_identifier'='namespace1.namespace2.ice_t');
>>>>>>
>>>>>> If the end user had specified LOCATION, then Impala would have raised
>>>>>> an error. But the above DDL statement is correct, so Impala loads the
>>>>>> iceberg table via Iceberg API, then creates the HMS table and sets 
>>>>>> LOCATION
>>>>>> to the Iceberg table location (something like
>>>>>> /path/to/hadoop/catalog/namespace1/namespace2/ice_t).
>>>>>>
>>>>>>
>>>>>>> I like the flexibility of setting the table_identifier on table
>>>>>>> level, which could help removing naming conflicts. We might want to have
>>>>>>> this in the Iceberg Catalog implementation.
>>>>>>>
>>>>>>>
>>>>>>>    - 'iceberg.table_identifier' for HiveCatalog
>>>>>>>       - Yeah, it doesn't add much if we only allow using the
>>>>>>>       current HMS. I think it can be only useful if we are allowing 
>>>>>>> external
>>>>>>>       HMSes.
>>>>>>>    - Moving properties to SERDEPROPERTIES
>>>>>>>       - I see that these properties are used by the SerDe classes
>>>>>>>       in Hive, but I feel that these properties are just not about 
>>>>>>> serialization
>>>>>>>       and deserialization. And as I see the current SERDEPROPERTIES are 
>>>>>>> things
>>>>>>>       like 'field.delim', 'separatorChar', 'quoteChar', etc. So 
>>>>>>> properties about
>>>>>>>       table loading more naturally belong to TBLPROPERTIES in my 
>>>>>>> opinion.
>>>>>>>
>>>>>>> I have seen it used both ways for HBaseSerDe. (even the wiki page
>>>>>>> uses both :) ). Since Impala prefers TBLPROPERTIES and if we start using
>>>>>>> prefix for separating real Iceberg table properties from other 
>>>>>>> properties,
>>>>>>> then we can keep it at TBLPROPERTIES.
>>>>>>>
>>>>>>
>>>>>> In the google doc I also had a comment about prefixing iceberg table
>>>>>> properties. We could use a prefix like 'iceberg.tblproperties.', and pass
>>>>>> every property with this prefix to the Iceberg table. Currently Impala
>>>>>> passes every table property to the Iceberg table.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>>     Zoltan
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Nov 30, 2020 at 1:33 PM Peter Vary <
>>>>>>> pv...@cloudera.com.invalid> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Based on the discussion below I understand we have the following
>>>>>>>> kinds of properties:
>>>>>>>>
>>>>>>>>    1. Iceberg table properties - Engine independent, storage
>>>>>>>>    related parameters
>>>>>>>>    2. "how to get to" - I think these are mostly Hive table
>>>>>>>>    specific properties, since for Spark, the Spark catalog 
>>>>>>>> configuration
>>>>>>>>    serves for the same purpose. I think the best place for storing 
>>>>>>>> these would
>>>>>>>>    be the Hive SERDEPROPERTIES, as this describes the access 
>>>>>>>> information for
>>>>>>>>    the SerDe. Sidenote: I think we should decide if we allow
>>>>>>>>    HiveCatalogs pointing to a different HMS and the 
>>>>>>>> 'iceberg.table_identifier'
>>>>>>>>    would make sense only if we allow having multiple catalogs.
>>>>>>>>    3. Query specific properties - These are engine specific and
>>>>>>>>    might be mapped to / even override the Iceberg table properties on 
>>>>>>>> the
>>>>>>>>    engine specific code paths, but currently these properties have 
>>>>>>>> independent
>>>>>>>>    names and mapped on a case-by-case basis.
>>>>>>>>
>>>>>>>>
>>>>>>>> Based on this:
>>>>>>>>
>>>>>>>>    - Shall we move the "how to get to" properties to
>>>>>>>>    SERDEPROPERTIES?
>>>>>>>>    - Shall we define a prefix for setting Iceberg table properties
>>>>>>>>    from Hive queries and omitting other engine specific properties?
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Peter
>>>>>>>>
>>>>>>>>
>>>>>>>> On Nov 27, 2020, at 17:45, Mass Dosage <massdos...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> I like these suggestions, comments inline below on the last round...
>>>>>>>>
>>>>>>>> On Thu, 26 Nov 2020 at 09:45, Zoltán Borók-Nagy <
>>>>>>>> borokna...@apache.org> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> The above aligns with what we did in Impala, i.e. we store
>>>>>>>>> information about table loading in HMS table properties. We are just 
>>>>>>>>> a bit
>>>>>>>>> more explicit about which catalog to use.
>>>>>>>>> We have table property 'iceberg.catalog' to determine the catalog
>>>>>>>>> type, right now the supported values are 'hadoop.tables', 
>>>>>>>>> 'hadoop.catalog',
>>>>>>>>> and 'hive.catalog'. Additional table properties can be set based on 
>>>>>>>>> the
>>>>>>>>> catalog type.
>>>>>>>>>
>>>>>>>>> So, if the value of 'iceberg.catalog' is
>>>>>>>>>
>>>>>>>>
>>>>>>>> I'm all for renaming this, having "mr" in the property name is
>>>>>>>> confusing.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>    - hadoop.tables
>>>>>>>>>       - the table location is used to load the table
>>>>>>>>>
>>>>>>>>> The only question I have is should we have this as the default?
>>>>>>>> i.e. if you don't set a catalog it will assume its HadoopTables and 
>>>>>>>> use the
>>>>>>>> location? Or should we require this property to be here to be 
>>>>>>>> consistent
>>>>>>>> and avoid any "magic"?
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>    - hadoop.catalog
>>>>>>>>>       - Required table property 'iceberg.catalog_location'
>>>>>>>>>       specifies the location of the hadoop catalog in the file system
>>>>>>>>>       - Optional table property 'iceberg.table_identifier'
>>>>>>>>>       specifies the table id. If it's not set, then 
>>>>>>>>> <database_name>.<table_name>
>>>>>>>>>       is used as table identifier
>>>>>>>>>
>>>>>>>>> I like this as it would allow you to use a different database and
>>>>>>>> table name in Hive as opposed to the Hadoop Catalog - at the moment 
>>>>>>>> they
>>>>>>>> have to match. The only thing here is that I think Hive requires a 
>>>>>>>> table
>>>>>>>> LOCATION to be set and it's then confusing as there are now two 
>>>>>>>> locations
>>>>>>>> on the table. I'm not sure whether in the Hive storage handler or SerDe
>>>>>>>> etc. we can get Hive to not require that and maybe even disallow it 
>>>>>>>> from
>>>>>>>> being set. That would probably be best in conjunction with this. 
>>>>>>>> Another
>>>>>>>> solution would be to not have the 'iceberg.catalog_location' property 
>>>>>>>> but
>>>>>>>> instead use the table LOCATION for this but that's a bit confusing 
>>>>>>>> from a
>>>>>>>> Hive point of view.
>>>>>>>>
>>>>>>>>
>>>>>>>>>    - hive.catalog
>>>>>>>>>       - Optional table property 'iceberg.table_identifier'
>>>>>>>>>       specifies the table id. If it's not set, then 
>>>>>>>>> <database_name>.<table_name>
>>>>>>>>>       is used as table identifier
>>>>>>>>>       - We have the assumption that the current Hive metastore
>>>>>>>>>       stores the table, i.e. we don't support external Hive 
>>>>>>>>> metastores currently
>>>>>>>>>
>>>>>>>>> These sound fine for Hive catalog tables that are created outside
>>>>>>>> of the automatic Hive table creation (see
>>>>>>>> https://iceberg.apache.org/hive/ -> Using Hive Catalog) we'd just
>>>>>>>> need to document how you can create these yourself and that one could 
>>>>>>>> use a
>>>>>>>> different Hive database and table etc.
>>>>>>>>
>>>>>>>>
>>>>>>>>> Independent of catalog implementations, but we also have table
>>>>>>>>> property 'iceberg.file_format' to specify the file format for the data
>>>>>>>>> files.
>>>>>>>>>
>>>>>>>>
>>>>>>>> OK, I don't think we need that for Hive?
>>>>>>>>
>>>>>>>>
>>>>>>>>> We haven't released it yet, so we are open to changes, but I think
>>>>>>>>> these properties are reasonable and it would be great if we could
>>>>>>>>> standardize the properties across engines that use HMS as the primary
>>>>>>>>> metastore of tables.
>>>>>>>>>
>>>>>>>>>
>>>>>>>> If others agree I think we should create an issue where we document
>>>>>>>> the above changes so it's very clear what we're doing and can then go 
>>>>>>>> an
>>>>>>>> implement them and update the docs etc.
>>>>>>>>
>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>>     Zoltan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Nov 26, 2020 at 2:20 AM Ryan Blue <
>>>>>>>>> rb...@netflix.com.invalid> wrote:
>>>>>>>>>
>>>>>>>>>> Yes, I think that is a good summary of the principles.
>>>>>>>>>>
>>>>>>>>>> #4 is correct because we provide some information that is
>>>>>>>>>> informational (Hive schema) or tracked only by the metastore 
>>>>>>>>>> (best-effort
>>>>>>>>>> current user). I also agree that it would be good to have a table
>>>>>>>>>> identifier in HMS table metadata when loading from an external 
>>>>>>>>>> table. That
>>>>>>>>>> gives us a way to handle name conflicts.
>>>>>>>>>>
>>>>>>>>>> On Wed, Nov 25, 2020 at 5:14 PM Jacques Nadeau <
>>>>>>>>>> jacq...@dremio.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Minor error, my last example should have been:
>>>>>>>>>>>
>>>>>>>>>>> db1.table1_etl_branch =>
>>>>>>>>>>> nessie.folder1.folder2.folder3.table1@etl_branch
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Jacques Nadeau
>>>>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau <
>>>>>>>>>>> jacq...@dremio.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I agree with Ryan on the core principles here. As I understand
>>>>>>>>>>>> them:
>>>>>>>>>>>>
>>>>>>>>>>>>    1. Iceberg metadata describes all properties of a table
>>>>>>>>>>>>    2. Hive table properties describe "how to get to" Iceberg
>>>>>>>>>>>>    metadata (which catalog + possibly ptr, path, token, etc)
>>>>>>>>>>>>    3. There could be default "how to get to" information set
>>>>>>>>>>>>    at a global level
>>>>>>>>>>>>    4. Best-effort schema should stored be in the table
>>>>>>>>>>>>    properties in HMS. This should be done for information schema 
>>>>>>>>>>>> retrieval
>>>>>>>>>>>>    purposes within Hive but should be ignored during Hive/other 
>>>>>>>>>>>> tool execution.
>>>>>>>>>>>>
>>>>>>>>>>>> Is that a fair summary of your statements Ryan (except 4, which
>>>>>>>>>>>> I just added)?
>>>>>>>>>>>>
>>>>>>>>>>>> One comment I have on #2 is that for different catalogs and use
>>>>>>>>>>>> cases, I think it can be somewhat more complex where it would be
>>>>>>>>>>>> desirable for a table that initially existed without Hive that was 
>>>>>>>>>>>> later
>>>>>>>>>>>> exposed in Hive to support a ptr/path/token for how the table is 
>>>>>>>>>>>> named
>>>>>>>>>>>> externally. For example, in a Nessie context we support arbitrary 
>>>>>>>>>>>> paths for
>>>>>>>>>>>> an Iceberg table (such as folder1.folder2.folder3.table1). If you 
>>>>>>>>>>>> then want
>>>>>>>>>>>> to expose that table to Hive, you might have this mapping for #2
>>>>>>>>>>>>
>>>>>>>>>>>> db1.table1 => nessie:folder1.folder2.folder3.table1
>>>>>>>>>>>>
>>>>>>>>>>>> Similarly, you might want to expose a particular branch version
>>>>>>>>>>>> of a table. So it might say:
>>>>>>>>>>>>
>>>>>>>>>>>> db1.table1_etl_branch => nessie.folder1@etl_branch
>>>>>>>>>>>>
>>>>>>>>>>>> Just saying that the address to the table in the catalog could
>>>>>>>>>>>> itself have several properties. The key being that no matter what 
>>>>>>>>>>>> those
>>>>>>>>>>>> are, we should follow #1 and only store properties that are about 
>>>>>>>>>>>> the ptr,
>>>>>>>>>>>> not the content/metadata.
>>>>>>>>>>>>
>>>>>>>>>>>> Lastly, I believe #4 is the case but haven't tested it. Can
>>>>>>>>>>>> someone confirm that it is true? And that it is possible/not 
>>>>>>>>>>>> problematic?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Jacques Nadeau
>>>>>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <
>>>>>>>>>>>> rb...@netflix.com.invalid> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for working on this, Laszlo. I’ve been thinking about
>>>>>>>>>>>>> these problems as well, so this is a good time to have a 
>>>>>>>>>>>>> discussion about
>>>>>>>>>>>>> Hive config.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think that Hive configuration should work mostly like other
>>>>>>>>>>>>> engines, where different configurations are used for different 
>>>>>>>>>>>>> purposes.
>>>>>>>>>>>>> Different purposes means that there is not a global configuration 
>>>>>>>>>>>>> priority.
>>>>>>>>>>>>> Hopefully, I can explain how we use the different config sources 
>>>>>>>>>>>>> elsewhere
>>>>>>>>>>>>> to clarify.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Let’s take Spark as an example. Spark uses Hadoop, so it has a
>>>>>>>>>>>>> Hadoop Configuration, but it also has its own global 
>>>>>>>>>>>>> configuration. There
>>>>>>>>>>>>> are also Iceberg table properties, and all of the various Hive 
>>>>>>>>>>>>> properties
>>>>>>>>>>>>> if you’re tracking tables with a Hive MetaStore.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The first step is to simplify where we can, so we effectively
>>>>>>>>>>>>> eliminate 2 sources of config:
>>>>>>>>>>>>>
>>>>>>>>>>>>>    - The Hadoop Configuration is only used to instantiate
>>>>>>>>>>>>>    Hadoop classes, like FileSystem. Iceberg should not use it for 
>>>>>>>>>>>>> any other
>>>>>>>>>>>>>    config.
>>>>>>>>>>>>>    - Config in the Hive MetaStore is only used to identify
>>>>>>>>>>>>>    that a table is Iceberg and point to its metadata location. 
>>>>>>>>>>>>> All other
>>>>>>>>>>>>>    config in HMS is informational. For example, the input format 
>>>>>>>>>>>>> is
>>>>>>>>>>>>>    FileInputFormat so that non-Iceberg readers cannot actually 
>>>>>>>>>>>>> instantiate the
>>>>>>>>>>>>>    format (it’s abstract) but it is available so they also don’t 
>>>>>>>>>>>>> fail trying
>>>>>>>>>>>>>    to load the class. Table-specific config should not be stored 
>>>>>>>>>>>>> in table or
>>>>>>>>>>>>>    serde properties.
>>>>>>>>>>>>>
>>>>>>>>>>>>> That leaves Spark configuration and Iceberg table
>>>>>>>>>>>>> configuration.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Iceberg differs from other tables because it is opinionated:
>>>>>>>>>>>>> data configuration should be maintained at the table level. This 
>>>>>>>>>>>>> is cleaner
>>>>>>>>>>>>> for users because config is standardized across engines and in 
>>>>>>>>>>>>> one place.
>>>>>>>>>>>>> And it also enables services that analyze a table and update its
>>>>>>>>>>>>> configuration to tune options that users almost never do, like 
>>>>>>>>>>>>> row group or
>>>>>>>>>>>>> stripe size in the columnar formats. Iceberg table configuration 
>>>>>>>>>>>>> is used to
>>>>>>>>>>>>> configure table-specific concerns and behavior.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Spark configuration is used for engine-specific concerns, and
>>>>>>>>>>>>> runtime overrides. A good example of an engine-specific concern 
>>>>>>>>>>>>> is the
>>>>>>>>>>>>> catalogs that are available to load Iceberg tables. Spark has a 
>>>>>>>>>>>>> way to load
>>>>>>>>>>>>> and configure catalog implementations and Iceberg uses that for 
>>>>>>>>>>>>> all
>>>>>>>>>>>>> catalog-level config. Runtime overrides are things like target 
>>>>>>>>>>>>> split size.
>>>>>>>>>>>>> Iceberg has a table-level default split size in table properties, 
>>>>>>>>>>>>> but this
>>>>>>>>>>>>> can be overridden by a Spark option for each table, as well as an 
>>>>>>>>>>>>> option
>>>>>>>>>>>>> passed to the individual read. Note that these necessarily have 
>>>>>>>>>>>>> different
>>>>>>>>>>>>> config names for how they are used: Iceberg uses
>>>>>>>>>>>>> read.split.target-size and the read-specific option is
>>>>>>>>>>>>> target-size.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Applying this to Hive is a little strange for a couple
>>>>>>>>>>>>> reasons. First, Hive’s engine configuration *is* a Hadoop
>>>>>>>>>>>>> Configuration. As a result, I think the right place to store
>>>>>>>>>>>>> engine-specific config is there, including Iceberg catalogs using 
>>>>>>>>>>>>> a
>>>>>>>>>>>>> strategy similar to what Spark does: what external Iceberg 
>>>>>>>>>>>>> catalogs are
>>>>>>>>>>>>> available and their configuration should come from the HiveConf.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The second way Hive is strange is that Hive needs to use its
>>>>>>>>>>>>> own MetaStore to track Hive table concerns. The MetaStore may 
>>>>>>>>>>>>> have tables
>>>>>>>>>>>>> created by an Iceberg HiveCatalog, and Hive also needs to be able 
>>>>>>>>>>>>> to load
>>>>>>>>>>>>> tables from other Iceberg catalogs by creating table entries for 
>>>>>>>>>>>>> them.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here’s how I think Hive should work:
>>>>>>>>>>>>>
>>>>>>>>>>>>>    - There should be a default HiveCatalog that uses the
>>>>>>>>>>>>>    current MetaStore URI to be used for HiveCatalog tables 
>>>>>>>>>>>>> tracked in the
>>>>>>>>>>>>>    MetaStore
>>>>>>>>>>>>>    - Other catalogs should be defined in HiveConf
>>>>>>>>>>>>>    - HMS table properties should be used to determine how to
>>>>>>>>>>>>>    load a table: using a Hadoop location, using the default 
>>>>>>>>>>>>> metastore catalog,
>>>>>>>>>>>>>    or using an external Iceberg catalog
>>>>>>>>>>>>>       - If there is a metadata_location, then use the
>>>>>>>>>>>>>       HiveCatalog for this metastore (where it is tracked)
>>>>>>>>>>>>>       - If there is a catalog property, then load that
>>>>>>>>>>>>>       catalog and use it to load the table identifier, or maybe 
>>>>>>>>>>>>> an identifier
>>>>>>>>>>>>>       from HMS table properties
>>>>>>>>>>>>>       - If there is no catalog or metadata_location, then use
>>>>>>>>>>>>>       HadoopTables to load the table location as an Iceberg table
>>>>>>>>>>>>>
>>>>>>>>>>>>> This would make it possible to access all types of Iceberg
>>>>>>>>>>>>> tables in the same query, and would match how Spark and Flink 
>>>>>>>>>>>>> configure
>>>>>>>>>>>>> catalogs. Other than the configuration above, I don’t think that 
>>>>>>>>>>>>> config in
>>>>>>>>>>>>> HMS should be used at all, like how the other engines work. 
>>>>>>>>>>>>> Iceberg is the
>>>>>>>>>>>>> source of truth for table metadata, HMS stores how to load the 
>>>>>>>>>>>>> Iceberg
>>>>>>>>>>>>> table, and HiveConf defines the catalogs (or runtime overrides).
>>>>>>>>>>>>>
>>>>>>>>>>>>> This isn’t quite how configuration works right now. Currently,
>>>>>>>>>>>>> the catalog is controlled by a HiveConf property,
>>>>>>>>>>>>> iceberg.mr.catalog. If that isn’t set, HadoopTables will be
>>>>>>>>>>>>> used to load table locations. If it is set, then that catalog 
>>>>>>>>>>>>> will be used
>>>>>>>>>>>>> to load all tables by name. This makes it impossible to load 
>>>>>>>>>>>>> tables from
>>>>>>>>>>>>> different catalogs at the same time. That’s why I think the 
>>>>>>>>>>>>> Iceberg catalog
>>>>>>>>>>>>> for a table should be stored in HMS table properties.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I should also explain iceberg.hive.engine.enabled flag, but I
>>>>>>>>>>>>> think this is long enough for now.
>>>>>>>>>>>>>
>>>>>>>>>>>>> rb
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter <
>>>>>>>>>>>>> lpin...@cloudera.com.invalid> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I would like to start a discussion, how should we handle
>>>>>>>>>>>>>> properties from various sources like Iceberg, Hive or global 
>>>>>>>>>>>>>> configuration.
>>>>>>>>>>>>>> I've put together a short document
>>>>>>>>>>>>>> <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>,
>>>>>>>>>>>>>> please have a look and let me know what you think.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Laszlo
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Ryan Blue
>>>>>>>>>>>>> Software Engineer
>>>>>>>>>>>>> Netflix
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Ryan Blue
>>>>>>>>>> Software Engineer
>>>>>>>>>> Netflix
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>
>

Reply via email to