Would someone be willing to create a document that states the current proposal?
It is becoming somewhat difficult to follow this thread. I also worry that without a complete statement of the current shape that people may be incorrectly thinking they are in alignment. -- Jacques Nadeau CTO and Co-Founder, Dremio On Tue, Dec 1, 2020 at 5:32 AM Zoltán Borók-Nagy <borokna...@cloudera.com> wrote: > Thanks, Ryan. I answered inline. > > On Mon, Nov 30, 2020 at 8:26 PM Ryan Blue <rb...@netflix.com> wrote: > >> This sounds like a good plan overall, but I have a couple of notes: >> >> 1. We need to keep in mind that users plug in their own catalogs, so >> iceberg.catalog could be a Glue or Nessie catalog, not just Hive or >> Hadoop. I don’t think it makes much sense to use separate hadoop.catalog >> and hive.catalog values. Those should just be names for catalogs >> configured >> in Configuration, i.e., via hive-site.xml. We then only need a >> special value for loading Hadoop tables from paths. >> >> About extensibility, I think the usual Hive way is to use Java class > names. So this way the value for 'iceberg.catalog' could be e.g. > 'org.apache.iceberg.hive.HiveCatalog'. Then each catalog implementation > would need to have a factory method that constructs the catalog object from > a properties object (Map<String, String>). E.g. > 'org.apache.iceberg.hadoop.HadoopCatalog' would require > 'iceberg.catalog_location' to be present in properties. > >> >> 1. I don’t think that catalog configuration should be kept in table >> properties. A catalog should not be loaded for each table. So I don’t >> think >> we need iceberg.catalog_location. Instead, we should have a way to >> define catalogs in the Configuration for tables in the metastore to >> reference. >> >> I think it makes sense, on the other hand it would make adding new > catalogs more heavy-weight, i.e. now you'd need to edit configuration files > and restart/reinit services. Maybe it can be cumbersome in some > environments. > >> >> 1. I’d rather use a prefix to exclude properties from being passed to >> Iceberg than to include them. Otherwise, users don’t know what to do to >> pass table properties from Hive or Impala. If we exclude a prefix or >> specific properties, then everything but the properties reserved for >> locating the table are passed as the user would expect. >> >> I don't have a strong opinion about this, but yeah, maybe this behavior > would cause the least surprises. > >> >> >> >> On Mon, Nov 30, 2020 at 7:51 AM Zoltán Borók-Nagy <borokna...@apache.org> >> wrote: >> >>> Thanks, Peter. I answered inline. >>> >>> On Mon, Nov 30, 2020 at 3:13 PM Peter Vary <pv...@cloudera.com.invalid> >>> wrote: >>> >>>> Hi Zoltan, >>>> >>>> Answers below: >>>> >>>> On Nov 30, 2020, at 14:19, Zoltán Borók-Nagy < >>>> borokna...@cloudera.com.INVALID> wrote: >>>> >>>> Hi, >>>> >>>> Thanks for the replies. My take for the above questions are as follows >>>> >>>> - Should 'iceberg.catalog' be a required property? >>>> - Yeah, I think it would be nice if this would be required to avoid >>>> any implicit behavior >>>> >>>> Currently we have a Catalogs class to get/initialize/use the different >>>> Catalogs. At that time the decision was to use HadoopTables as a default >>>> catalog. >>>> It might be worthwhile to use the same class in Impala as well, so the >>>> behavior is consistent. >>>> >>> >>> Yeah, I think it'd be beneficial for us to use the Iceberg classes >>> whenever possible. The Catalogs class is very similar to what we have >>> currently in Impala. >>> >>>> >>>> - 'hadoop.catalog' LOCATION and catalog_location >>>> - In Impala we don't allow setting LOCATION for tables stored in >>>> 'hadoop.catalog'. But Impala internally sets LOCATION to the Iceberg >>>> table's actual location. We were also thinking about using only the >>>> table >>>> LOCATION, and set it to the catalog location, but we also found it >>>> confusing. >>>> >>>> It could definitely work, but it is somewhat strange that we have an >>>> external table location set to an arbitrary path, and we have a different >>>> location generated by other configs. It would be nice to have the real >>>> location set in the external table location as well. >>>> >>> >>> Impala sets the real Iceberg table location for external tables. E.g. if >>> the user issues >>> >>> CREATE EXTERNAL TABLE my_hive_db.iceberg_table_hadoop_catalog >>> STORED AS ICEBERG >>> TBLPROPERTIES('iceberg.catalog'='hadoop.catalog', >>> 'iceberg.catalog_location'='/path/to/hadoop/catalog', >>> 'iceberg.table_identifier'='namespace1.namespace2.ice_t'); >>> >>> If the end user had specified LOCATION, then Impala would have raised an >>> error. But the above DDL statement is correct, so Impala loads the iceberg >>> table via Iceberg API, then creates the HMS table and sets LOCATION to the >>> Iceberg table location (something like >>> /path/to/hadoop/catalog/namespace1/namespace2/ice_t). >>> >>> >>>> I like the flexibility of setting the table_identifier on table level, >>>> which could help removing naming conflicts. We might want to have this in >>>> the Iceberg Catalog implementation. >>>> >>>> >>>> - 'iceberg.table_identifier' for HiveCatalog >>>> - Yeah, it doesn't add much if we only allow using the current >>>> HMS. I think it can be only useful if we are allowing external HMSes. >>>> - Moving properties to SERDEPROPERTIES >>>> - I see that these properties are used by the SerDe classes in >>>> Hive, but I feel that these properties are just not about >>>> serialization and >>>> deserialization. And as I see the current SERDEPROPERTIES are things >>>> like >>>> 'field.delim', 'separatorChar', 'quoteChar', etc. So properties >>>> about table >>>> loading more naturally belong to TBLPROPERTIES in my opinion. >>>> >>>> I have seen it used both ways for HBaseSerDe. (even the wiki page uses >>>> both :) ). Since Impala prefers TBLPROPERTIES and if we start using prefix >>>> for separating real Iceberg table properties from other properties, then we >>>> can keep it at TBLPROPERTIES. >>>> >>> >>> In the google doc I also had a comment about prefixing iceberg table >>> properties. We could use a prefix like 'iceberg.tblproperties.', and pass >>> every property with this prefix to the Iceberg table. Currently Impala >>> passes every table property to the Iceberg table. >>> >>> >>>> >>>> Thanks, >>>> Zoltan >>>> >>>> >>>> On Mon, Nov 30, 2020 at 1:33 PM Peter Vary <pv...@cloudera.com.invalid> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> Based on the discussion below I understand we have the following kinds >>>>> of properties: >>>>> >>>>> 1. Iceberg table properties - Engine independent, storage related >>>>> parameters >>>>> 2. "how to get to" - I think these are mostly Hive table specific >>>>> properties, since for Spark, the Spark catalog configuration serves >>>>> for the >>>>> same purpose. I think the best place for storing these would be the >>>>> Hive SERDEPROPERTIES, as this describes the access information for the >>>>> SerDe. Sidenote: I think we should decide if we allow HiveCatalogs >>>>> pointing to a different HMS and the 'iceberg.table_identifier' would >>>>> make >>>>> sense only if we allow having multiple catalogs. >>>>> 3. Query specific properties - These are engine specific and might >>>>> be mapped to / even override the Iceberg table properties on the engine >>>>> specific code paths, but currently these properties have independent >>>>> names >>>>> and mapped on a case-by-case basis. >>>>> >>>>> >>>>> Based on this: >>>>> >>>>> - Shall we move the "how to get to" properties to SERDEPROPERTIES? >>>>> - Shall we define a prefix for setting Iceberg table properties >>>>> from Hive queries and omitting other engine specific properties? >>>>> >>>>> >>>>> Thanks, >>>>> Peter >>>>> >>>>> >>>>> On Nov 27, 2020, at 17:45, Mass Dosage <massdos...@gmail.com> wrote: >>>>> >>>>> I like these suggestions, comments inline below on the last round... >>>>> >>>>> On Thu, 26 Nov 2020 at 09:45, Zoltán Borók-Nagy <borokna...@apache.org> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> The above aligns with what we did in Impala, i.e. we store >>>>>> information about table loading in HMS table properties. We are just a >>>>>> bit >>>>>> more explicit about which catalog to use. >>>>>> We have table property 'iceberg.catalog' to determine the catalog >>>>>> type, right now the supported values are 'hadoop.tables', >>>>>> 'hadoop.catalog', >>>>>> and 'hive.catalog'. Additional table properties can be set based on the >>>>>> catalog type. >>>>>> >>>>>> So, if the value of 'iceberg.catalog' is >>>>>> >>>>> >>>>> I'm all for renaming this, having "mr" in the property name is >>>>> confusing. >>>>> >>>>> >>>>>> >>>>>> - hadoop.tables >>>>>> - the table location is used to load the table >>>>>> >>>>>> The only question I have is should we have this as the default? i.e. >>>>> if you don't set a catalog it will assume its HadoopTables and use the >>>>> location? Or should we require this property to be here to be consistent >>>>> and avoid any "magic"? >>>>> >>>>> >>>>>> >>>>>> - hadoop.catalog >>>>>> - Required table property 'iceberg.catalog_location' specifies >>>>>> the location of the hadoop catalog in the file system >>>>>> - Optional table property 'iceberg.table_identifier' specifies >>>>>> the table id. If it's not set, then <database_name>.<table_name> >>>>>> is used as >>>>>> table identifier >>>>>> >>>>>> I like this as it would allow you to use a different database and >>>>> table name in Hive as opposed to the Hadoop Catalog - at the moment they >>>>> have to match. The only thing here is that I think Hive requires a table >>>>> LOCATION to be set and it's then confusing as there are now two locations >>>>> on the table. I'm not sure whether in the Hive storage handler or SerDe >>>>> etc. we can get Hive to not require that and maybe even disallow it from >>>>> being set. That would probably be best in conjunction with this. Another >>>>> solution would be to not have the 'iceberg.catalog_location' property but >>>>> instead use the table LOCATION for this but that's a bit confusing from a >>>>> Hive point of view. >>>>> >>>>> >>>>>> - hive.catalog >>>>>> - Optional table property 'iceberg.table_identifier' specifies >>>>>> the table id. If it's not set, then <database_name>.<table_name> >>>>>> is used as >>>>>> table identifier >>>>>> - We have the assumption that the current Hive metastore >>>>>> stores the table, i.e. we don't support external Hive metastores >>>>>> currently >>>>>> >>>>>> These sound fine for Hive catalog tables that are created outside of >>>>> the automatic Hive table creation (see >>>>> https://iceberg.apache.org/hive/ -> Using Hive Catalog) we'd just >>>>> need to document how you can create these yourself and that one could use >>>>> a >>>>> different Hive database and table etc. >>>>> >>>>> >>>>>> Independent of catalog implementations, but we also have table >>>>>> property 'iceberg.file_format' to specify the file format for the data >>>>>> files. >>>>>> >>>>> >>>>> OK, I don't think we need that for Hive? >>>>> >>>>> >>>>>> We haven't released it yet, so we are open to changes, but I think >>>>>> these properties are reasonable and it would be great if we could >>>>>> standardize the properties across engines that use HMS as the primary >>>>>> metastore of tables. >>>>>> >>>>>> >>>>> If others agree I think we should create an issue where we document >>>>> the above changes so it's very clear what we're doing and can then go an >>>>> implement them and update the docs etc. >>>>> >>>>> >>>>>> Cheers, >>>>>> Zoltan >>>>>> >>>>>> >>>>>> On Thu, Nov 26, 2020 at 2:20 AM Ryan Blue <rb...@netflix.com.invalid> >>>>>> wrote: >>>>>> >>>>>>> Yes, I think that is a good summary of the principles. >>>>>>> >>>>>>> #4 is correct because we provide some information that is >>>>>>> informational (Hive schema) or tracked only by the metastore >>>>>>> (best-effort >>>>>>> current user). I also agree that it would be good to have a table >>>>>>> identifier in HMS table metadata when loading from an external table. >>>>>>> That >>>>>>> gives us a way to handle name conflicts. >>>>>>> >>>>>>> On Wed, Nov 25, 2020 at 5:14 PM Jacques Nadeau <jacq...@dremio.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Minor error, my last example should have been: >>>>>>>> >>>>>>>> db1.table1_etl_branch => >>>>>>>> nessie.folder1.folder2.folder3.table1@etl_branch >>>>>>>> >>>>>>>> -- >>>>>>>> Jacques Nadeau >>>>>>>> CTO and Co-Founder, Dremio >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau <jacq...@dremio.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I agree with Ryan on the core principles here. As I understand >>>>>>>>> them: >>>>>>>>> >>>>>>>>> 1. Iceberg metadata describes all properties of a table >>>>>>>>> 2. Hive table properties describe "how to get to" Iceberg >>>>>>>>> metadata (which catalog + possibly ptr, path, token, etc) >>>>>>>>> 3. There could be default "how to get to" information set at a >>>>>>>>> global level >>>>>>>>> 4. Best-effort schema should stored be in the table properties >>>>>>>>> in HMS. This should be done for information schema retrieval >>>>>>>>> purposes >>>>>>>>> within Hive but should be ignored during Hive/other tool execution. >>>>>>>>> >>>>>>>>> Is that a fair summary of your statements Ryan (except 4, which I >>>>>>>>> just added)? >>>>>>>>> >>>>>>>>> One comment I have on #2 is that for different catalogs and use >>>>>>>>> cases, I think it can be somewhat more complex where it would be >>>>>>>>> desirable for a table that initially existed without Hive that was >>>>>>>>> later >>>>>>>>> exposed in Hive to support a ptr/path/token for how the table is named >>>>>>>>> externally. For example, in a Nessie context we support arbitrary >>>>>>>>> paths for >>>>>>>>> an Iceberg table (such as folder1.folder2.folder3.table1). If you >>>>>>>>> then want >>>>>>>>> to expose that table to Hive, you might have this mapping for #2 >>>>>>>>> >>>>>>>>> db1.table1 => nessie:folder1.folder2.folder3.table1 >>>>>>>>> >>>>>>>>> Similarly, you might want to expose a particular branch version of >>>>>>>>> a table. So it might say: >>>>>>>>> >>>>>>>>> db1.table1_etl_branch => nessie.folder1@etl_branch >>>>>>>>> >>>>>>>>> Just saying that the address to the table in the catalog could >>>>>>>>> itself have several properties. The key being that no matter what >>>>>>>>> those >>>>>>>>> are, we should follow #1 and only store properties that are about the >>>>>>>>> ptr, >>>>>>>>> not the content/metadata. >>>>>>>>> >>>>>>>>> Lastly, I believe #4 is the case but haven't tested it. Can >>>>>>>>> someone confirm that it is true? And that it is possible/not >>>>>>>>> problematic? >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Jacques Nadeau >>>>>>>>> CTO and Co-Founder, Dremio >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue < >>>>>>>>> rb...@netflix.com.invalid> wrote: >>>>>>>>> >>>>>>>>>> Thanks for working on this, Laszlo. I’ve been thinking about >>>>>>>>>> these problems as well, so this is a good time to have a discussion >>>>>>>>>> about >>>>>>>>>> Hive config. >>>>>>>>>> >>>>>>>>>> I think that Hive configuration should work mostly like other >>>>>>>>>> engines, where different configurations are used for different >>>>>>>>>> purposes. >>>>>>>>>> Different purposes means that there is not a global configuration >>>>>>>>>> priority. >>>>>>>>>> Hopefully, I can explain how we use the different config sources >>>>>>>>>> elsewhere >>>>>>>>>> to clarify. >>>>>>>>>> >>>>>>>>>> Let’s take Spark as an example. Spark uses Hadoop, so it has a >>>>>>>>>> Hadoop Configuration, but it also has its own global configuration. >>>>>>>>>> There >>>>>>>>>> are also Iceberg table properties, and all of the various Hive >>>>>>>>>> properties >>>>>>>>>> if you’re tracking tables with a Hive MetaStore. >>>>>>>>>> >>>>>>>>>> The first step is to simplify where we can, so we effectively >>>>>>>>>> eliminate 2 sources of config: >>>>>>>>>> >>>>>>>>>> - The Hadoop Configuration is only used to instantiate Hadoop >>>>>>>>>> classes, like FileSystem. Iceberg should not use it for any other >>>>>>>>>> config. >>>>>>>>>> - Config in the Hive MetaStore is only used to identify that >>>>>>>>>> a table is Iceberg and point to its metadata location. All other >>>>>>>>>> config in >>>>>>>>>> HMS is informational. For example, the input format is >>>>>>>>>> FileInputFormat so >>>>>>>>>> that non-Iceberg readers cannot actually instantiate the format >>>>>>>>>> (it’s >>>>>>>>>> abstract) but it is available so they also don’t fail trying to >>>>>>>>>> load the >>>>>>>>>> class. Table-specific config should not be stored in table or >>>>>>>>>> serde >>>>>>>>>> properties. >>>>>>>>>> >>>>>>>>>> That leaves Spark configuration and Iceberg table configuration. >>>>>>>>>> >>>>>>>>>> Iceberg differs from other tables because it is opinionated: data >>>>>>>>>> configuration should be maintained at the table level. This is >>>>>>>>>> cleaner for >>>>>>>>>> users because config is standardized across engines and in one >>>>>>>>>> place. And >>>>>>>>>> it also enables services that analyze a table and update its >>>>>>>>>> configuration >>>>>>>>>> to tune options that users almost never do, like row group or stripe >>>>>>>>>> size >>>>>>>>>> in the columnar formats. Iceberg table configuration is used to >>>>>>>>>> configure >>>>>>>>>> table-specific concerns and behavior. >>>>>>>>>> >>>>>>>>>> Spark configuration is used for engine-specific concerns, and >>>>>>>>>> runtime overrides. A good example of an engine-specific concern is >>>>>>>>>> the >>>>>>>>>> catalogs that are available to load Iceberg tables. Spark has a way >>>>>>>>>> to load >>>>>>>>>> and configure catalog implementations and Iceberg uses that for all >>>>>>>>>> catalog-level config. Runtime overrides are things like target split >>>>>>>>>> size. >>>>>>>>>> Iceberg has a table-level default split size in table properties, >>>>>>>>>> but this >>>>>>>>>> can be overridden by a Spark option for each table, as well as an >>>>>>>>>> option >>>>>>>>>> passed to the individual read. Note that these necessarily have >>>>>>>>>> different >>>>>>>>>> config names for how they are used: Iceberg uses >>>>>>>>>> read.split.target-size and the read-specific option is >>>>>>>>>> target-size. >>>>>>>>>> >>>>>>>>>> Applying this to Hive is a little strange for a couple reasons. >>>>>>>>>> First, Hive’s engine configuration *is* a Hadoop Configuration. >>>>>>>>>> As a result, I think the right place to store engine-specific config >>>>>>>>>> is >>>>>>>>>> there, including Iceberg catalogs using a strategy similar to what >>>>>>>>>> Spark >>>>>>>>>> does: what external Iceberg catalogs are available and their >>>>>>>>>> configuration >>>>>>>>>> should come from the HiveConf. >>>>>>>>>> >>>>>>>>>> The second way Hive is strange is that Hive needs to use its own >>>>>>>>>> MetaStore to track Hive table concerns. The MetaStore may have tables >>>>>>>>>> created by an Iceberg HiveCatalog, and Hive also needs to be able to >>>>>>>>>> load >>>>>>>>>> tables from other Iceberg catalogs by creating table entries for >>>>>>>>>> them. >>>>>>>>>> >>>>>>>>>> Here’s how I think Hive should work: >>>>>>>>>> >>>>>>>>>> - There should be a default HiveCatalog that uses the current >>>>>>>>>> MetaStore URI to be used for HiveCatalog tables tracked in the >>>>>>>>>> MetaStore >>>>>>>>>> - Other catalogs should be defined in HiveConf >>>>>>>>>> - HMS table properties should be used to determine how to >>>>>>>>>> load a table: using a Hadoop location, using the default >>>>>>>>>> metastore catalog, >>>>>>>>>> or using an external Iceberg catalog >>>>>>>>>> - If there is a metadata_location, then use the >>>>>>>>>> HiveCatalog for this metastore (where it is tracked) >>>>>>>>>> - If there is a catalog property, then load that catalog >>>>>>>>>> and use it to load the table identifier, or maybe an >>>>>>>>>> identifier from HMS >>>>>>>>>> table properties >>>>>>>>>> - If there is no catalog or metadata_location, then use >>>>>>>>>> HadoopTables to load the table location as an Iceberg table >>>>>>>>>> >>>>>>>>>> This would make it possible to access all types of Iceberg tables >>>>>>>>>> in the same query, and would match how Spark and Flink configure >>>>>>>>>> catalogs. >>>>>>>>>> Other than the configuration above, I don’t think that config in HMS >>>>>>>>>> should >>>>>>>>>> be used at all, like how the other engines work. Iceberg is the >>>>>>>>>> source of >>>>>>>>>> truth for table metadata, HMS stores how to load the Iceberg table, >>>>>>>>>> and >>>>>>>>>> HiveConf defines the catalogs (or runtime overrides). >>>>>>>>>> >>>>>>>>>> This isn’t quite how configuration works right now. Currently, >>>>>>>>>> the catalog is controlled by a HiveConf property, >>>>>>>>>> iceberg.mr.catalog. If that isn’t set, HadoopTables will be used >>>>>>>>>> to load table locations. If it is set, then that catalog will be >>>>>>>>>> used to >>>>>>>>>> load all tables by name. This makes it impossible to load tables from >>>>>>>>>> different catalogs at the same time. That’s why I think the Iceberg >>>>>>>>>> catalog >>>>>>>>>> for a table should be stored in HMS table properties. >>>>>>>>>> >>>>>>>>>> I should also explain iceberg.hive.engine.enabled flag, but I >>>>>>>>>> think this is long enough for now. >>>>>>>>>> >>>>>>>>>> rb >>>>>>>>>> >>>>>>>>>> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter < >>>>>>>>>> lpin...@cloudera.com.invalid> wrote: >>>>>>>>>> >>>>>>>>>>> Hi All, >>>>>>>>>>> >>>>>>>>>>> I would like to start a discussion, how should we handle >>>>>>>>>>> properties from various sources like Iceberg, Hive or global >>>>>>>>>>> configuration. >>>>>>>>>>> I've put together a short document >>>>>>>>>>> <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>, >>>>>>>>>>> please have a look and let me know what you think. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Laszlo >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Ryan Blue >>>>>>>>>> Software Engineer >>>>>>>>>> Netflix >>>>>>>>>> >>>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Ryan Blue >>>>>>> Software Engineer >>>>>>> Netflix >>>>>>> >>>>>> >>>>> >>>> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> >