Hi Elliot.

 

Strictly speaking I believe your question is when the metastore in the 
replicate gets out of sync in replicate. So any query against cloud table will 
only show say partitions at time T0 as opposed to T1?

 

I don’t know what your metastore is on. With ours on Oracle this can happen 
when there is a network glitch hence the metadata tables can get out of sync. 
Each table has a Materialized view (MV) log that keeps the deltas for that 
table and pushes the deltas to the replicate table every say 30 seconds 
(configurable). So this is the scenario

 

1.    Network issue. Data cannot be delivered (deltas) and the replicate table 
is out of sync. The replicated table data is kept in the primary table MV log 
until the network is back and the next scheduled refresh delivers it. There 
could be a backlog

2.    The replicated table gets out of sync. In this case Oracle package 
DBMS_MVIEW.REFRESH is used to sync the replicate table. Again best done when 
there is no activity in the primary

 

 

We use Oracle for our metastore as the Bank has many instances of Oracle, 
Sybase, Microsoft SQL server and it is pretty easy for DBAs to look after a 
small Hive schema on an Oracle instance.

 

I gather if we build a model based on what classic databases do to keep 
reporting database tables in sync (which is in essence what we are talking 
about) then we should be OK.

 

That takes care of metadata but I noticed that you are also mentioning synching 
data on HDFS in the replicate as well. Sounds like many people go for DistCp 
<http://hadoop.apache.org/common/docs/current/distcp.html>  — an application 
shipped with Hadoop that uses a MapReduce job to copy files in parallel. There 
seems to be a good article here 
<https://www.facebook.com/notes/paul-yang/moving-an-elephant-large-scale-hadoop-data-migration-at-facebook/10150246275318920>
  on general replication for Facebook.

 

 

HTH,

 

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

 

From: Elliot West [mailto:tea...@gmail.com] 
Sent: 17 December 2015 17:17
To: user@hive.apache.org
Subject: Re: Synchronizing Hive metastores across clusters

 

Hi Mich,

 

In your scenario is there any coordination of data syncing on HDFS and metadata 
in HCatalog? I.e. could a situation occur where the replicated metastore shows 
a partition as 'present' yet the data that backs the partition in HDFS has not 
yet arrived at the replica filesystem? I Imagine one could avoid this by 
snapshotting the source metastore, then syncing HDFS, and then finally shipping 
the snapshot to the replica(?).

 

Thanks - Elliot.

 

On 17 December 2015 at 16:57, Mich Talebzadeh <m...@peridale.co.uk 
<mailto:m...@peridale.co.uk> > wrote:

Sounds like one way replication of metastore. Depending on your metastore 
platform that could be achieved pretty easily. 

 

Mine is Oracle and I use Materialised View replication which is pretty good but 
no latest technology. Others would be GoldenGate or SAP replication server.

 

HTH,

 

Mich

 

From: Mich Talebzadeh [mailto:m...@peridale.co.uk <mailto:m...@peridale.co.uk> 
] 
Sent: 17 December 2015 16:47
To: user@hive.apache.org <mailto:user@hive.apache.org> 
Subject: RE: Synchronizing Hive metastores across clusters

 

Are both clusters in active/active mode or the cloud based cluster is standby?

 

From: Elliot West [mailto:tea...@gmail.com] 
Sent: 17 December 2015 16:21
To: user@hive.apache.org <mailto:user@hive.apache.org> 
Subject: Synchronizing Hive metastores across clusters

 

Hello,

 

I'm thinking about the steps required to repeatedly push Hive datasets out from 
a traditional Hadoop cluster into a parallel cloud based cluster. This is not a 
one off, it needs to be a constantly running sync process. As new tables and 
partitions are added in one cluster, they need to be synced to the cloud 
cluster. Assuming for a moment that I have the HDFS data syncing working, I'm 
wondering what steps I need to take to reliably ship the HCatalog metadata 
across. I use HCatalog as the point of truth as to when when data is available 
and where it is located and so I think that metadata is a critical element to 
replicate in the cloud based cluster.

 

Does anyone have any recommendations on how to achieve this in practice? One 
issue (of many I suspect) is that Hive appears to store table/partition 
locations internally with absolute, fully qualified URLs, therefore unless the 
target cloud cluster is similarly named and configured some path transformation 
step will be needed as part of the synchronisation process.

 

I'd appreciate any suggestions, thoughts, or experiences related to this.

 

Cheers - Elliot.

 

 

 

Reply via email to