Following up on this: I've spent some time trying to evaluate the Hive replication features but in truth it's more been an exercise in trying to get them working! I thought I'd share my findings:
- Conceptually this feature can sync (nearly) all Hive metadata and data changes between two clusters. - On the source cluster you require at least Hive 1.1.0 (DbNotificationListener dependency). - On the destination cluster you require at least Hive 0.8.0 (IMPORT command dependency). - The environment in which you execute replication tasks requires at least Hive 1.2.0 (ReplicationTask dependency) although this is at the JAR level only (i.e. you do not need a 1.2.0 metastore running etc). - It is not an 'out of the box solution'; you must still write some kind of service that instantiates, schedules, and executes ReplicationTasks. This can be quite simple. - Exporting into S3 using Hive on EMR (AMI 4.2.0) is currently broken, but apparently work is underway to fix it. - Data inserted into Hive tables using HCatalog writers will not be automatically synced (HIVE-9577). - Mappings can be applied to destination database names, table names, and table and partition locations. - All tables at the destination are managed, even if they are external at the source. - The source and destination can be running different Hadoop distributions and use differing metastore database providers. - There is no real user level documentation. - It might be nice to add a Kafka based NotificationListener. In summary it looks like quite a powerful and useful feature. However as I'm currently running Hive 1.0.0 at my source I cannot use it in a straightforward manner. Thanks for your help. Elliot. On 18 December 2015 at 14:31, Elliot West <tea...@gmail.com> wrote: > Eugene/Susanth, > > Thank you for pointing me in the direction of these features. I'll > investigate them further to see if I can put them to good use. > > Cheers - Elliot. > > On 17 December 2015 at 20:03, Sushanth Sowmyan <khorg...@gmail.com> wrote: > >> Also, while I have not wiki-ized the documentation for the above, I >> have uploaded slides from talks that I've given in hive user group >> meetup on the subject, and also a doc that describes the replication >> protocol followed for the EXIM replication that are attached over at >> https://issues.apache.org/jira/browse/HIVE-10264 >> >> On Thu, Dec 17, 2015 at 11:59 AM, Sushanth Sowmyan <khorg...@gmail.com> >> wrote: >> > Hi, >> > >> > I think that the replication work added with >> > https://issues.apache.org/jira/browse/HIVE-7973 is exactly up this >> > alley. >> > >> > Per Eugene's suggestion of MetaStoreEventListener, this replication >> > system plugs into that and gets you a stream of notification events >> > from HCatClient for the exact purpose you mention. >> > >> > There's some work still outstanding on this task, most notably >> > documentation (sorry!) but please have a look at >> > HCatClient.getReplicationTasks(...) and >> > org.apache.hive.hcatalog.api.repl.ReplicationTask. You can plug in >> > your implementation of ReplicationTask.Factory to inject your own >> > logic for how to handle the replication according to your needs. >> > (currently there exists an implementation that uses Hive EXPORT/IMPORT >> > to perform replication - you can look at the code for this, and the >> > tests for these classes to see how that is achieved. Falcon already >> > uses this to perform cross-hive-warehouse replication) >> > >> > >> > Thanks, >> > >> > -Sushanth >> > >> > On Thu, Dec 17, 2015 at 11:22 AM, Eugene Koifman >> > <ekoif...@hortonworks.com> wrote: >> >> Metastore supports MetaStoreEventListener and MetaStorePreEventListener >> >> which may be useful here >> >> >> >> Eugene >> >> >> >> From: Elliot West <tea...@gmail.com> >> >> Reply-To: "user@hive.apache.org" <user@hive.apache.org> >> >> Date: Thursday, December 17, 2015 at 8:21 AM >> >> To: "user@hive.apache.org" <user@hive.apache.org> >> >> Subject: Synchronizing Hive metastores across clusters >> >> >> >> Hello, >> >> >> >> I'm thinking about the steps required to repeatedly push Hive datasets >> out >> >> from a traditional Hadoop cluster into a parallel cloud based cluster. >> This >> >> is not a one off, it needs to be a constantly running sync process. As >> new >> >> tables and partitions are added in one cluster, they need to be synced >> to >> >> the cloud cluster. Assuming for a moment that I have the HDFS data >> syncing >> >> working, I'm wondering what steps I need to take to reliably ship the >> >> HCatalog metadata across. I use HCatalog as the point of truth as to >> when >> >> when data is available and where it is located and so I think that >> metadata >> >> is a critical element to replicate in the cloud based cluster. >> >> >> >> Does anyone have any recommendations on how to achieve this in >> practice? One >> >> issue (of many I suspect) is that Hive appears to store table/partition >> >> locations internally with absolute, fully qualified URLs, therefore >> unless >> >> the target cloud cluster is similarly named and configured some path >> >> transformation step will be needed as part of the synchronisation >> process. >> >> >> >> I'd appreciate any suggestions, thoughts, or experiences related to >> this. >> >> >> >> Cheers - Elliot. >> >> >> >> >> > >