Hi Elliot.
Strictly speaking I believe your question is when the metastore in the replicate gets out of sync in replicate. So any query against cloud table will only show say partitions at time T0 as opposed to T1? I don’t know what your metastore is on. With ours on Oracle this can happen when there is a network glitch hence the metadata tables can get out of sync. Each table has a Materialized view (MV) log that keeps the deltas for that table and pushes the deltas to the replicate table every say 30 seconds (configurable). So this is the scenario 1. Network issue. Data cannot be delivered (deltas) and the replicate table is out of sync. The replicated table data is kept in the primary table MV log until the network is back and the next scheduled refresh delivers it. There could be a backlog 2. The replicated table gets out of sync. In this case Oracle package DBMS_MVIEW.REFRESH is used to sync the replicate table. Again best done when there is no activity in the primary We use Oracle for our metastore as the Bank has many instances of Oracle, Sybase, Microsoft SQL server and it is pretty easy for DBAs to look after a small Hive schema on an Oracle instance. I gather if we build a model based on what classic databases do to keep reporting database tables in sync (which is in essence what we are talking about) then we should be OK. That takes care of metadata but I noticed that you are also mentioning synching data on HDFS in the replicate as well. Sounds like many people go for DistCp <http://hadoop.apache.org/common/docs/current/distcp.html> — an application shipped with Hadoop that uses a MapReduce job to copy files in parallel. There seems to be a good article here <https://www.facebook.com/notes/paul-yang/moving-an-elephant-large-scale-hadoop-data-migration-at-facebook/10150246275318920> on general replication for Facebook. HTH, Mich Talebzadeh Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4 Publications due shortly: Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8 Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility. From: Elliot West [mailto:tea...@gmail.com] Sent: 17 December 2015 17:17 To: user@hive.apache.org Subject: Re: Synchronizing Hive metastores across clusters Hi Mich, In your scenario is there any coordination of data syncing on HDFS and metadata in HCatalog? I.e. could a situation occur where the replicated metastore shows a partition as 'present' yet the data that backs the partition in HDFS has not yet arrived at the replica filesystem? I Imagine one could avoid this by snapshotting the source metastore, then syncing HDFS, and then finally shipping the snapshot to the replica(?). Thanks - Elliot. On 17 December 2015 at 16:57, Mich Talebzadeh <m...@peridale.co.uk <mailto:m...@peridale.co.uk> > wrote: Sounds like one way replication of metastore. Depending on your metastore platform that could be achieved pretty easily. Mine is Oracle and I use Materialised View replication which is pretty good but no latest technology. Others would be GoldenGate or SAP replication server. HTH, Mich From: Mich Talebzadeh [mailto:m...@peridale.co.uk <mailto:m...@peridale.co.uk> ] Sent: 17 December 2015 16:47 To: user@hive.apache.org <mailto:user@hive.apache.org> Subject: RE: Synchronizing Hive metastores across clusters Are both clusters in active/active mode or the cloud based cluster is standby? From: Elliot West [mailto:tea...@gmail.com] Sent: 17 December 2015 16:21 To: user@hive.apache.org <mailto:user@hive.apache.org> Subject: Synchronizing Hive metastores across clusters Hello, I'm thinking about the steps required to repeatedly push Hive datasets out from a traditional Hadoop cluster into a parallel cloud based cluster. This is not a one off, it needs to be a constantly running sync process. As new tables and partitions are added in one cluster, they need to be synced to the cloud cluster. Assuming for a moment that I have the HDFS data syncing working, I'm wondering what steps I need to take to reliably ship the HCatalog metadata across. I use HCatalog as the point of truth as to when when data is available and where it is located and so I think that metadata is a critical element to replicate in the cloud based cluster. Does anyone have any recommendations on how to achieve this in practice? One issue (of many I suspect) is that Hive appears to store table/partition locations internally with absolute, fully qualified URLs, therefore unless the target cloud cluster is similarly named and configured some path transformation step will be needed as part of the synchronisation process. I'd appreciate any suggestions, thoughts, or experiences related to this. Cheers - Elliot.