Following up on this: I've spent some time trying to evaluate the Hive
replication features but in truth it's more been an exercise in trying to
get them working! I thought I'd share my findings:

   - Conceptually this feature can sync (nearly) all Hive metadata and data
   changes between two clusters.
   - On the source cluster you require at least Hive 1.1.0
   (DbNotificationListener dependency).
   - On the destination cluster you require at least Hive 0.8.0 (IMPORT
   command dependency).
   - The environment in which you execute replication tasks requires at
   least Hive 1.2.0 (ReplicationTask dependency) although this is at the JAR
   level only (i.e. you do not need a 1.2.0 metastore running etc).
   - It is not an 'out of the box solution'; you must still write some kind
   of service that instantiates, schedules, and executes ReplicationTasks.
   This can be quite simple.
   - Exporting into S3 using Hive on EMR (AMI 4.2.0) is currently broken,
   but apparently work is underway to fix it.
   - Data inserted into Hive tables using HCatalog writers will not be
   automatically synced (HIVE-9577).
   - Mappings can be applied to destination database names, table names,
   and table and partition locations.
   - All tables at the destination are managed, even if they are external
   at the source.
   - The source and destination can be running different Hadoop
   distributions and use differing metastore database providers.
   - There is no real user level documentation.
   - It might be nice to add a Kafka based NotificationListener.

In summary it looks like quite a powerful and useful feature. However as
I'm currently running Hive 1.0.0 at my source I cannot use it in a
straightforward manner.

Thanks for your help.

Elliot.

On 18 December 2015 at 14:31, Elliot West <tea...@gmail.com> wrote:

> Eugene/Susanth,
>
> Thank you for pointing me in the direction of these features. I'll
> investigate them further to see if I can put them to good use.
>
> Cheers - Elliot.
>
> On 17 December 2015 at 20:03, Sushanth Sowmyan <khorg...@gmail.com> wrote:
>
>> Also, while I have not wiki-ized the documentation for the above, I
>> have uploaded slides from talks that I've given in hive user group
>> meetup on the subject, and also a doc that describes the replication
>> protocol followed for the EXIM replication that are attached over at
>> https://issues.apache.org/jira/browse/HIVE-10264
>>
>> On Thu, Dec 17, 2015 at 11:59 AM, Sushanth Sowmyan <khorg...@gmail.com>
>> wrote:
>> > Hi,
>> >
>> > I think that the replication work added with
>> > https://issues.apache.org/jira/browse/HIVE-7973 is exactly up this
>> > alley.
>> >
>> > Per Eugene's suggestion of MetaStoreEventListener, this replication
>> > system plugs into that and gets you a stream of notification events
>> > from HCatClient for the exact purpose you mention.
>> >
>> > There's some work still outstanding on this task, most notably
>> > documentation (sorry!) but please have a look at
>> > HCatClient.getReplicationTasks(...) and
>> > org.apache.hive.hcatalog.api.repl.ReplicationTask. You can plug in
>> > your implementation of  ReplicationTask.Factory to inject your own
>> > logic for how to handle the replication according to your needs.
>> > (currently there exists an implementation that uses Hive EXPORT/IMPORT
>> > to perform replication - you can look at the code for this, and the
>> > tests for these classes to see how that is achieved. Falcon already
>> > uses this to perform cross-hive-warehouse replication)
>> >
>> >
>> > Thanks,
>> >
>> > -Sushanth
>> >
>> > On Thu, Dec 17, 2015 at 11:22 AM, Eugene Koifman
>> > <ekoif...@hortonworks.com> wrote:
>> >> Metastore supports MetaStoreEventListener and MetaStorePreEventListener
>> >> which may be useful here
>> >>
>> >> Eugene
>> >>
>> >> From: Elliot West <tea...@gmail.com>
>> >> Reply-To: "user@hive.apache.org" <user@hive.apache.org>
>> >> Date: Thursday, December 17, 2015 at 8:21 AM
>> >> To: "user@hive.apache.org" <user@hive.apache.org>
>> >> Subject: Synchronizing Hive metastores across clusters
>> >>
>> >> Hello,
>> >>
>> >> I'm thinking about the steps required to repeatedly push Hive datasets
>> out
>> >> from a traditional Hadoop cluster into a parallel cloud based cluster.
>> This
>> >> is not a one off, it needs to be a constantly running sync process. As
>> new
>> >> tables and partitions are added in one cluster, they need to be synced
>> to
>> >> the cloud cluster. Assuming for a moment that I have the HDFS data
>> syncing
>> >> working, I'm wondering what steps I need to take to reliably ship the
>> >> HCatalog metadata across. I use HCatalog as the point of truth as to
>> when
>> >> when data is available and where it is located and so I think that
>> metadata
>> >> is a critical element to replicate in the cloud based cluster.
>> >>
>> >> Does anyone have any recommendations on how to achieve this in
>> practice? One
>> >> issue (of many I suspect) is that Hive appears to store table/partition
>> >> locations internally with absolute, fully qualified URLs, therefore
>> unless
>> >> the target cloud cluster is similarly named and configured some path
>> >> transformation step will be needed as part of the synchronisation
>> process.
>> >>
>> >> I'd appreciate any suggestions, thoughts, or experiences related to
>> this.
>> >>
>> >> Cheers - Elliot.
>> >>
>> >>
>>
>
>

Reply via email to