Re: Possible DR solution

Timur Shenkao Sat, 12 Nov 2016 09:18:07 -0800

Hi guys!

1) Though it's quite interesting, I believe that this discussion is not
about Spark :)
2) If you are interested, there is solution by Cloudera
https://www.cloudera.com/documentation/enterprise/5-5-x/topics/cm_bdr_replication_intro.html
(requires that *source cluster* has Cloudera Enterprise license, so it's
not for free).
Correct me but I don't remember specialized replication solution by
Hortonworks (Atlas, Falcon, etc. are not precisely about inter-custer
replication).
Some solutions from Hadoop  Ecosystem try to implement replication of their
own:
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62687462 ,
http://highscalability.com/blog/2016/8/1/how-to-setup-a-highly-available-multi-az-cassandra-cluster-o.html
,
3) Read this discussion
https://community.hortonworks.com/questions/29645/hdfs-replication-for-dr.html
4) I prefer bash scripts / Python scripts / Oozie jobs + distcp - it's for
free & I control what's going on precisely. But, in case of huge clusters &
sophisticated logic, this approach become cumbersome.
5) Don't forget about security & encryption: your sensitive data may be
read by third-party agents during replication


On Sat, Nov 12, 2016 at 6:05 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Thanks Jorn.
>
> The way WanDisco promotes itself is doing block level replication. as I
> understand you modify core-file.xml and add couple of network server
> locations there. they call this tool Fusion. there are at least 2 fusion
> servers for high availability. each one among other things has a database
> of its own. Once the client interacts with HDFS the fusion server behaves
> like a sniffer  with its own port. As soon as the first HTFS block of
> 256MBout of say a file of 30GB is written, it starts sending that block to
> recipient. the laws of physics, the pipeline size etc applies here. That is
> up to the consumer. it can 10 files at the same time etc. so that is all.
> It is a known technology now labeled as streaming. so in summary it does
> not have to wait for the full file to be written to HDFS before replicating
> blocks.  that is where it scores.
>
> It helps WAN work. Say the primary/active HDFS is in London and the
> replicate is in Singapore. so users in Singapore can see replicated data
> (eventually) when it gets there. It can obviously be used for DR in that
> case it is like Hot standby (borrowing a terminology from Sybase). In
> contrast one can do the same with period loads with homemade tools or tools
> like BDR from Cloudera.
>
> I mentioned that Hive is going to have its metastore on Hbase as well and
> that can be potential problems. The site is here
> <https://www.wandisco.com/>
>
> They are claiming there is no competitors in the market for their
> streaming HA product.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 12 November 2016 at 11:17, Jörn Franke <jornfra...@gmail.com> wrote:
>
>> What is wrong with the good old batch transfer for transferring data from
>> a cluster to another? I assume your use case is only business continuity in
>> case of disasters such as data center loss, which are unlikely to happen
>> (well it does not mean they do not happen) and where you could afford to
>> loose one day (or hour) of data (depends!).
>>
>> Nevertheless, I assume he refers to the Hadoop storage policies:
>> https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/
>> ArchivalStorage.html , but this still only works for the same cluster.
>>
>> You could also develop a custom secondary file system, similar to the
>> Ignite Cache filesystem, that sits on top of HDFS and as soon as it
>> receives data it sends them to another cluster and provides it to HDFS. Not
>> knowing Wandisco, I assume what it does. Given the prices (and the fact
>> that clusters tend to grow) you may want to evaluate if buying or making
>> makes sense. In any case, it also requires evaluation of network
>> throughput, because this may become the bottleneck somewhere (either within
>> the cluster or more likely between data centers).
>>
>> As you mentioned, Hbase & Co may require a special consideration for the
>> case that data is in-memory and not yet persisted.
>>
>> On Sat, Nov 12, 2016 at 12:04 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> thanks Vince
>>>
>>> can you provide more details on this pls
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 12 November 2016 at 09:52, vincent gromakowski <
>>> vincent.gromakow...@gmail.com> wrote:
>>>
>>>> A Hdfs tiering policy with good tags should be similar
>>>>
>>>> Le 11 nov. 2016 11:19 PM, "Mich Talebzadeh" <mich.talebza...@gmail.com>
>>>> a écrit :
>>>>
>>>>> I really don't see why one wants to set up streaming replication
>>>>> unless for situations where similar functionality to transactional
>>>>> databases is required in big data?
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>> On 11 November 2016 at 17:24, Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com> wrote:
>>>>>
>>>>>> I think it differs as it starts streaming data through its own port
>>>>>> as soon as the first block is landed. so the granularity is a block.
>>>>>>
>>>>>> however, think of it as oracle golden gate replication or sap
>>>>>> replication for databases. the only difference is that if the corruption 
>>>>>> in
>>>>>> the block with hdfs it will be freplicated much like srdf.
>>>>>>
>>>>>> whereas with oracle or sap it is log based replication which stops
>>>>>> when it encounters corruption.
>>>>>>
>>>>>> replication depends on the block. so can replicate hive metadata and
>>>>>> fsimage etc. but cannot replicate hbase memstore if hbase crashes.
>>>>>>
>>>>>> so that is the gist of it. streaming replication as opposed to
>>>>>> snapshot.
>>>>>>
>>>>>> sounds familiar. think of it as log shipping in oracle old days
>>>>>> versus goldengate etc.
>>>>>>
>>>>>> hth
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> LinkedIn * 
>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>> may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 11 November 2016 at 17:14, Deepak Sharma <deepakmc...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Reason being you can set up hdfs duplication on your own to some
>>>>>>> other cluster .
>>>>>>>
>>>>>>> On Nov 11, 2016 22:42, "Mich Talebzadeh" <mich.talebza...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> reason being ?
>>>>>>>>
>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> LinkedIn * 
>>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>>
>>>>>>>>
>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>>> for any loss, damage or destruction of data or any other property 
>>>>>>>> which may
>>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>>> damages
>>>>>>>> arising from such loss, damage or destruction.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 11 November 2016 at 17:11, Deepak Sharma <deepakmc...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> This is waste of money I guess.
>>>>>>>>>
>>>>>>>>> On Nov 11, 2016 22:41, "Mich Talebzadeh" <
>>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> starts at $4,000 per node per year all inclusive.
>>>>>>>>>>
>>>>>>>>>> With discount it can be halved but we are talking a node itself
>>>>>>>>>> so if you have 5 nodes in primary and 5 nodes in DR we are talking 
>>>>>>>>>> about
>>>>>>>>>> $40K already.
>>>>>>>>>>
>>>>>>>>>> HTH
>>>>>>>>>>
>>>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> LinkedIn * 
>>>>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all
>>>>>>>>>> responsibility for any loss, damage or destruction of data or any 
>>>>>>>>>> other
>>>>>>>>>> property which may arise from relying on this email's technical 
>>>>>>>>>> content is
>>>>>>>>>> explicitly disclaimed. The author will in no case be liable for any
>>>>>>>>>> monetary damages arising from such loss, damage or destruction.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 11 November 2016 at 16:43, Mudit Kumar <mkumar...@sapient.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Is it feasible cost wise?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Mudit
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *From:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
>>>>>>>>>>> *Sent:* Friday, November 11, 2016 2:56 PM
>>>>>>>>>>> *To:* user @spark
>>>>>>>>>>> *Subject:* Possible DR solution
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Has anyone had experience of using WanDisco
>>>>>>>>>>> <https://www.wandisco.com/> block replication to create a fault
>>>>>>>>>>> tolerant solution to DR in Hadoop?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> The product claims that it starts replicating as soon as the
>>>>>>>>>>> first data block lands on HDFS and takes the block and sends it to
>>>>>>>>>>> DR/replicate site. The idea is that is faster than doing it through
>>>>>>>>>>> traditional HDFS copy tools which are normally batch oriented.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> It also claims to replicate Hive metadata as well.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I wanted to gauge if anyone has used it or a competitor product.
>>>>>>>>>>> The claim is that they do not have competitors!
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> LinkedIn  
>>>>>>>>>>> *https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all
>>>>>>>>>>> responsibility for any loss, damage or destruction of data or any 
>>>>>>>>>>> other
>>>>>>>>>>> property which may arise from relying on this email's technical 
>>>>>>>>>>> content is
>>>>>>>>>>> explicitly disclaimed. The author will in no case be liable for any
>>>>>>>>>>> monetary damages arising from such loss, damage or destruction.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>
>>
>

Re: Possible DR solution

Reply via email to