Hi Carter and Elliot,

First off:

Carter, as the JDBC endpoint is serviced by the HiveServer2 service and not the 
Hive Metastore Service (HMS), I’d assume that the answer to your question is no 
and that you’d still need your own HiveServer2 to interact with the 
Waggle-Dance HMS proxy to process your JDBC API requests.

Waggle-Dance question:

As the Waggle-Dance diagram shows only the HMS thrift API being federated, how 
is access to all the data that the LOCATION properties of all Hive database 
objects points to.  It seems that Waggle-Dance goes to great lengths to 
navigate the network topology to get from the proxy to the remote HMSs.  
However, there’s no mention of where the data is stored.  Clearly if all the 
remote HMSs store their data in a common service, like AWS S3 or Azure Blob 
Storage, it will be easier for the HMS proxy consumers to access it, but may 
still be configuration challenges of multiple accounts and different 
permissions and roles.  If each remote HMSs store their data in separate local 
distributed file systems, like HDFS clusters, or a mix of the 2, there are 
additional network topology challenges similar to get to the HMSs themselves.  
Is there any solution or consideration for federated data access?

Thanks!
-Kurt

From: Carter Shanklin [mailto:car...@hortonworks.com]
Sent: Thursday, July 27, 2017 10:57 AM
To: user@hive.apache.org
Subject: Re: Hive federation service

Elliot,

Interesting stuff

I have 3 questions
1. Can Waggle Dance deal with multiple kerberized Hadoop clusters?
2. Do you support 3 layers in the hierarchy (i.e. cluster.database.table) or 2 
layers, with a requirement to avoid any possible name collisions in the mapping 
layer.
3. Is it compatible with JDBC? It wasn't clear to me since the diagrams all 
mention thrift.

Thanks!


From: Elliot West <tea...@gmail.com<mailto:tea...@gmail.com>>
Reply-To: "user@hive.apache.org<mailto:user@hive.apache.org>" 
<user@hive.apache.org<mailto:user@hive.apache.org>>
Date: Thursday, July 27, 2017 at 06:21
To: "user@hive.apache.org<mailto:user@hive.apache.org>" 
<user@hive.apache.org<mailto:user@hive.apache.org>>
Subject: Hive federation service

Hello,

We've recently contributed our Hive federation service to the open source 
community:

https://github.com/HotelsDotCom/waggle-dance

Waggle Dance is a request routing Hive metastore proxy that allows tables to be 
concurrently accessed across multiple Hive deployments. It was created to 
tackle the appearance of the dataset silos that arose as our large organization 
gradually migrated from monolithic on-premises clusters, to cloud based 
platforms.

In short, Waggle Dance enables a unified end point with which you can describe, 
query, and join tables that may exist in multiple distinct Hive deployments. 
Such deployments may exist in disparate regions, accounts, or clouds (security 
and network permitting). Dataset access is not limited to the Hive query 
engine, and should work with any Hive metastore enabled platform. We've been 
successfully using it with Spark for example.

More recently we've employed Waggle Dance to apply a simple security layer to 
cloud based platforms such as Qubole, Databricks, and EMR. These currently 
provide no means to construct cross platform authentication and authorization 
strategies. Therefore we use a combination of Waggle Dance and network 
configuration to restrict writes and destructive Hive operations to specific 
user groups and applications.

We currently operate many disparate Hive metastore instances whose tables must 
be shared across the organization. Therefore we are committed to the ongoing 
development of this project. However, should such federation features have 
broader appeal, we'd be keen to see similar features integrated into Hive, 
perhaps in a more accessible form echoing existing remote table or database 
link features present in some traditional RDBMSes.

All feedback appreciated, many thanks for your time,

Elliot.

Reply via email to