Kurt,

You are correct, Waggle Dance makes no attempt to simplify access to the
underlying table data in the filestore(s). In practice we ensure that
consumers of a table in a given cluster have sufficient connectivity and
privilege to read the data referenced in the LOCATION properties. As you
suggest, the precise mechanisms for providing such access vary depending on
the location of both the consumer and the data. We find that implementing
these 'one off' configuration changes is a small price to pay for enabling
exploratory access to useful data that would otherwise remain isolated.

As mentioned in my original post, the motivations for Waggle Dance arose
during cloud migrations from a single on-premises cluster. Therefore, in
practice, we're not federating multiple HDFS backed Hive deployments; all
new deployments are storing their table data on S3 which perhaps simplifies
things a little.

We also use Data+Metadata replication to address our original problem of
data silos. While replication can reduce the complexity of the security and
network configuration somewhat, it does not avoid it completely.
Additionally, it can suffer from increased latency, increased storage cost,
and discourages exploration (you need to know what you want to replicate
first). We have built an in-house tool to perform a variety of Hive
replications, taking advantage of cloud platform specific features where we
can (e.g there are some optimal methods for moving data between S3
deployments). Replication is our preferred approach for frequently accessed
datasets, where repeated transfer costs can be far greater than duplicate
storage costs.

Thanks,

Elliot.

On 27 July 2017 at 16:40, Larson, Kurt <klar...@wbgames.com> wrote:

> Hi Carter and Elliot,
>
>
>
> First off:
>
>
>
> Carter, as the JDBC endpoint is serviced by the HiveServer2 service and
> not the Hive Metastore Service (HMS), I’d assume that the answer to your
> question is no and that you’d still need your own HiveServer2 to interact
> with the Waggle-Dance HMS proxy to process your JDBC API requests.
>
>
>
> Waggle-Dance question:
>
>
>
> As the Waggle-Dance diagram shows only the HMS thrift API being federated,
> how is access to all the data that the LOCATION properties of all Hive
> database objects points to.  It seems that Waggle-Dance goes to great
> lengths to navigate the network topology to get from the proxy to the
> remote HMSs.  However, there’s no mention of where the data is stored.
> Clearly if all the remote HMSs store their data in a common service, like
> AWS S3 or Azure Blob Storage, it will be easier for the HMS proxy consumers
> to access it, but may still be configuration challenges of multiple
> accounts and different permissions and roles.  If each remote HMSs store
> their data in separate local distributed file systems, like HDFS clusters,
> or a mix of the 2, there are additional network topology challenges similar
> to get to the HMSs themselves.  Is there any solution or consideration for
> federated data access?
>
>
>
> Thanks!
>
> -Kurt
>
>
>
> *From:* Carter Shanklin [mailto:car...@hortonworks.com]
> *Sent:* Thursday, July 27, 2017 10:57 AM
> *To:* user@hive.apache.org
> *Subject:* Re: Hive federation service
>
>
>
> Elliot,
>
>
>
> Interesting stuff
>
>
>
> I have 3 questions
>
> 1. Can Waggle Dance deal with multiple kerberized Hadoop clusters?
>
> 2. Do you support 3 layers in the hierarchy (i.e. cluster.database.table)
> or 2 layers, with a requirement to avoid any possible name collisions in
> the mapping layer.
>
> 3. Is it compatible with JDBC? It wasn't clear to me since the diagrams
> all mention thrift.
>
>
>
> Thanks!
>
>
>
>
>
> *From: *Elliot West <tea...@gmail.com>
> *Reply-To: *"user@hive.apache.org" <user@hive.apache.org>
> *Date: *Thursday, July 27, 2017 at 06:21
> *To: *"user@hive.apache.org" <user@hive.apache.org>
> *Subject: *Hive federation service
>
>
>
> Hello,
>
>
>
> We've recently contributed our Hive federation service to the open source
> community:
>
>
>
> https://github.com/HotelsDotCom/waggle-dance
>
>
>
> Waggle Dance is a request routing Hive metastore proxy that allows tables
> to be concurrently accessed across multiple Hive deployments. It was
> created to tackle the appearance of the dataset silos that arose as our
> large organization gradually migrated from monolithic on-premises clusters,
> to cloud based platforms.
>
>
>
> In short, Waggle Dance enables a unified end point with which you can
> describe, query, and join tables that may exist in multiple distinct Hive
> deployments. Such deployments may exist in disparate regions, accounts, or
> clouds (security and network permitting). Dataset access is not limited to
> the Hive query engine, and should work with any Hive metastore enabled
> platform. We've been successfully using it with Spark for example.
>
>
>
> More recently we've employed Waggle Dance to apply a simple security layer
> to cloud based platforms such as Qubole, Databricks, and EMR. These
> currently provide no means to construct cross platform authentication and
> authorization strategies. Therefore we use a combination of Waggle Dance
> and network configuration to restrict writes and destructive Hive
> operations to specific user groups and applications.
>
>
>
> We currently operate many disparate Hive metastore instances whose tables
> must be shared across the organization. Therefore we are committed to the
> ongoing development of this project. However, should such federation
> features have broader appeal, we'd be keen to see similar features
> integrated into Hive, perhaps in a more accessible form echoing existing
> remote table or database link features present in some traditional RDBMSes.
>
>
>
> All feedback appreciated, many thanks for your time,
>
>
>
> Elliot.
>

Reply via email to