Re: Hive federation service

Elliot West Thu, 27 Jul 2017 08:47:27 -0700

Hi Carter,

1. In theory, under certain conditions, I believe so, (note that my
kerberos experience is very limited). Waggle Dance will simply forward the
kerberos prinipals presented to the client. If these prinicpals are
meaningful to the target cluster then my assumption is that this would
work. In our case not all of our target platforms support kerberos yet, so
it's not something we've tried.

2. You are correct in identifying the the layers in the namespace
hierarchy. As Hive has no cluster representation, we flatten the cluster
and database coordinates into a single database name coordinate. At first
we simply applied a cluster specific prefix to all database names. However,
we found in our case that database name collisions were very infrequent
('default' being a notable exception), and that users disliked updating
database name references in existing HQL. Therefore now also provide an
alternative strategy that only applies the cluster prefix to database names
that are overloaded in the context of the hierarchy.

3. To be clear, Waggle Dance operates at the HIve Metastore Thift layer,
however JDBC also has some relevance in two areas. JDBC clients connecting
to Hive for the purposes of querying do so using HiveServer2 which in turn
uses the metastore API for metadata retrieval. If HiveServer2 is configured
to use a Waggle Dance instance as a metastore, these JDBC clients will also
benefit from the 'federated view'. The Hive metastore also uses JDBC when
persisting its internal model to relational stores. Normally, we'd be able
to ignore this as a backend implementation detail. However, we've seen that
some tools (GUI based schema explorers) bypass the Thrift based metastore
and interrogate the relational backing store directly, presumably as an
optimization. These tools don't see the 'federated view' as they 'step
over' our point of integration.

Hope this helps,

Elliot.

On 27 July 2017 at 15:56, Carter Shanklin <car...@hortonworks.com> wrote:

> Elliot,
>
> Interesting stuff
>
> I have 3 questions
> 1. Can Waggle Dance deal with multiple kerberized Hadoop clusters?
> 2. Do you support 3 layers in the hierarchy (i.e. cluster.database.table)
> or 2 layers, with a requirement to avoid any possible name collisions in
> the mapping layer.
> 3. Is it compatible with JDBC? It wasn't clear to me since the diagrams
> all mention thrift.
>
> Thanks!
>
>
> From: Elliot West <tea...@gmail.com>
> Reply-To: "user@hive.apache.org" <user@hive.apache.org>
> Date: Thursday, July 27, 2017 at 06:21
> To: "user@hive.apache.org" <user@hive.apache.org>
> Subject: Hive federation service
>
> Hello,
>
> We've recently contributed our Hive federation service to the open source
> community:
>
> https://github.com/HotelsDotCom/waggle-dance
>
>
> Waggle Dance is a request routing Hive metastore proxy that allows tables
> to be concurrently accessed across multiple Hive deployments. It was
> created to tackle the appearance of the dataset silos that arose as our
> large organization gradually migrated from monolithic on-premises clusters,
> to cloud based platforms.
>
> In short, Waggle Dance enables a unified end point with which you can
> describe, query, and join tables that may exist in multiple distinct Hive
> deployments. Such deployments may exist in disparate regions, accounts, or
> clouds (security and network permitting). Dataset access is not limited to
> the Hive query engine, and should work with any Hive metastore enabled
> platform. We've been successfully using it with Spark for example.
>
> More recently we've employed Waggle Dance to apply a simple security layer
> to cloud based platforms such as Qubole, Databricks, and EMR. These
> currently provide no means to construct cross platform authentication and
> authorization strategies. Therefore we use a combination of Waggle Dance
> and network configuration to restrict writes and destructive Hive
> operations to specific user groups and applications.
>
> We currently operate many disparate Hive metastore instances whose tables
> must be shared across the organization. Therefore we are committed to the
> ongoing development of this project. However, should such federation
> features have broader appeal, we'd be keen to see similar features
> integrated into Hive, perhaps in a more accessible form echoing existing
> remote table or database link features present in some traditional RDBMSes.
>
> All feedback appreciated, many thanks for your time,
>
> Elliot.
>

Re: Hive federation service

Reply via email to