Discovering and sharing datasets in the clouds

Elliot West Fri, 23 Sep 2016 07:40:30 -0700

Hi,

I’d like to get some thoughts on a Hive metastore system/feature that we’ve
been toying with to provide greater independence and flexibility to our
data teams when operating in cloud environments. It can probably best be
thought of as a federation layer for metastores, but from the user’s
perspective it’s more similar to the external or remote table features
found in some traditional RDBMSes.


For some time I’ve been involved in projects to move data processing
functions from a single, large, on-premises clusters, to smaller, team
owned clusters in the cloud. While this has had many benefits, it has
reinforced our reliance on the Hive metastore. We use it as the source of
truth for descriptions of our data, and as a directory help us locate it.
Additionally, in the cloud it adds a layer of consistency to data stored on
eventually consistent file stores. Finally, it serves as a broadly
supported integration point for many data processing frameworks that we
might choose to use. The problem we may face is that in line with a
distributed, self-service ethos, teams will spin up their own metastores
and unintentionally create isolated silos of data. Naturally we often want
to share datasets and this arrangement, for all its benefits, is a
technological barrier to that.

To solve this problem we’ve been experimenting with a federated metastore.
Quite simply this is a service that presents a metastore Thrift API and
routes requests to different metastores based on some mappings derived from
the database and table names. By default we provide a companion ’local’
metastore instance that serves as the cluster’s primary read/write metadata
store. Users can then add read-only mappings to ‘external’ metastores using
database name prefixes such as:

‘extdb_’ → thrift://external.metastore:9038/


With this implementation we’ve been able to perform joins between tables in
different metastores and query ‘remote’ tables as if they were local.
Conceptually this architecture can allow individual teams to take full
ownership of the publishing and maintenance of their datasets while being
free to share them with other teams or divisions. Additionally the costs
involved to transport and process the data downstream are borne by the
consumer, not the owner, which seems fair. Finally there is no requirement
for a centralised team to run and manage a single organisation-wide
metastore.

Below is an example of a cross-metastore query:

hive> show databases;
OK
default
etl       -- database in local metastore
extdb_etl -- database in thrift://external.metastore:9038/

-- example of what a query looks like joining data from a
-- 'local' db called 'etl' and a 'remote' db called
-- 'extdb_etl'

hive> select
  l.id
  , r.name
from
  etl.local_table l
join
  extdb_etl.remote_table r
on (
  r.id = l.id
)
where
  l.load_date = '2016-05-13'
;


To get to this point we’ve had to side-step some of the tricker issues such
as authentication (we simply turned it off for now!) and compatibility
across different metastore versions (we’ve stuck with one version only).
Clearly these need to be addressed if we were to use this in the real world.

I see that this recent HortonWorks blog post (
http://hortonworks.com/blog/making-elephant-fly-cloud/) alludes to issues
of ‘Shared Metadata and Governance’ and perhaps the role of the metastore
in this regard. Therefore I’m wondering where to take this next as I can
envisage a number of possible forms this feature could take:

   1. A separate stand-alone federation service that sits between Hive
   clients and metastore instances.
   2. Metastore federation added as a feature to the current Hive
   metastore. Similar to 1 but integrated into the current hive-metastore
   module.
   3. Support for remote tables added to Hive. This, while similar in
   implementation to 2, might provide a user experience consistent with that
   found with configuring external tables in traditional RDBMSes:

CREATE REMOTE TABLE my_table
AS their_database.their_table
ON SERVER ‘thrift://external.metastore:9083/’


I’d appreciate any thoughts or suggestions.

Thanks,

Elliot.

Discovering and sharing datasets in the clouds

Reply via email to