Hi, I’d like to get some thoughts on a Hive metastore system/feature that we’ve been toying with to provide greater independence and flexibility to our data teams when operating in cloud environments. It can probably best be thought of as a federation layer for metastores, but from the user’s perspective it’s more similar to the external or remote table features found in some traditional RDBMSes.
For some time I’ve been involved in projects to move data processing functions from a single, large, on-premises clusters, to smaller, team owned clusters in the cloud. While this has had many benefits, it has reinforced our reliance on the Hive metastore. We use it as the source of truth for descriptions of our data, and as a directory help us locate it. Additionally, in the cloud it adds a layer of consistency to data stored on eventually consistent file stores. Finally, it serves as a broadly supported integration point for many data processing frameworks that we might choose to use. The problem we may face is that in line with a distributed, self-service ethos, teams will spin up their own metastores and unintentionally create isolated silos of data. Naturally we often want to share datasets and this arrangement, for all its benefits, is a technological barrier to that. To solve this problem we’ve been experimenting with a federated metastore. Quite simply this is a service that presents a metastore Thrift API and routes requests to different metastores based on some mappings derived from the database and table names. By default we provide a companion ’local’ metastore instance that serves as the cluster’s primary read/write metadata store. Users can then add read-only mappings to ‘external’ metastores using database name prefixes such as: ‘extdb_’ → thrift://external.metastore:9038/ With this implementation we’ve been able to perform joins between tables in different metastores and query ‘remote’ tables as if they were local. Conceptually this architecture can allow individual teams to take full ownership of the publishing and maintenance of their datasets while being free to share them with other teams or divisions. Additionally the costs involved to transport and process the data downstream are borne by the consumer, not the owner, which seems fair. Finally there is no requirement for a centralised team to run and manage a single organisation-wide metastore. Below is an example of a cross-metastore query: hive> show databases; OK default etl -- database in local metastore extdb_etl -- database in thrift://external.metastore:9038/ -- example of what a query looks like joining data from a -- 'local' db called 'etl' and a 'remote' db called -- 'extdb_etl' hive> select l.id , r.name from etl.local_table l join extdb_etl.remote_table r on ( r.id = l.id ) where l.load_date = '2016-05-13' ; To get to this point we’ve had to side-step some of the tricker issues such as authentication (we simply turned it off for now!) and compatibility across different metastore versions (we’ve stuck with one version only). Clearly these need to be addressed if we were to use this in the real world. I see that this recent HortonWorks blog post ( http://hortonworks.com/blog/making-elephant-fly-cloud/) alludes to issues of ‘Shared Metadata and Governance’ and perhaps the role of the metastore in this regard. Therefore I’m wondering where to take this next as I can envisage a number of possible forms this feature could take: 1. A separate stand-alone federation service that sits between Hive clients and metastore instances. 2. Metastore federation added as a feature to the current Hive metastore. Similar to 1 but integrated into the current hive-metastore module. 3. Support for remote tables added to Hive. This, while similar in implementation to 2, might provide a user experience consistent with that found with configuring external tables in traditional RDBMSes: CREATE REMOTE TABLE my_table AS their_database.their_table ON SERVER ‘thrift://external.metastore:9083/’ I’d appreciate any thoughts or suggestions. Thanks, Elliot.