Hello,

We've recently contributed our Hive federation service to the open source
community:

https://github.com/HotelsDotCom/waggle-dance


Waggle Dance is a request routing Hive metastore proxy that allows tables
to be concurrently accessed across multiple Hive deployments. It was
created to tackle the appearance of the dataset silos that arose as our
large organization gradually migrated from monolithic on-premises clusters,
to cloud based platforms.

In short, Waggle Dance enables a unified end point with which you can
describe, query, and join tables that may exist in multiple distinct Hive
deployments. Such deployments may exist in disparate regions, accounts, or
clouds (security and network permitting). Dataset access is not limited to
the Hive query engine, and should work with any Hive metastore enabled
platform. We've been successfully using it with Spark for example.

More recently we've employed Waggle Dance to apply a simple security layer
to cloud based platforms such as Qubole, Databricks, and EMR. These
currently provide no means to construct cross platform authentication and
authorization strategies. Therefore we use a combination of Waggle Dance
and network configuration to restrict writes and destructive Hive
operations to specific user groups and applications.

We currently operate many disparate Hive metastore instances whose tables
must be shared across the organization. Therefore we are committed to the
ongoing development of this project. However, should such federation
features have broader appeal, we'd be keen to see similar features
integrated into Hive, perhaps in a more accessible form echoing existing
remote table or database link features present in some traditional RDBMSes.

All feedback appreciated, many thanks for your time,

Elliot.

Reply via email to