Hi guys! 1) Though it's quite interesting, I believe that this discussion is not about Spark :) 2) If you are interested, there is solution by Cloudera https://www.cloudera.com/documentation/enterprise/5-5-x/topics/cm_bdr_replication_intro.html (requires that *source cluster* has Cloudera Enterprise license, so it's not for free). Correct me but I don't remember specialized replication solution by Hortonworks (Atlas, Falcon, etc. are not precisely about inter-custer replication). Some solutions from Hadoop Ecosystem try to implement replication of their own: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62687462 , http://highscalability.com/blog/2016/8/1/how-to-setup-a-highly-available-multi-az-cassandra-cluster-o.html , 3) Read this discussion https://community.hortonworks.com/questions/29645/hdfs-replication-for-dr.html 4) I prefer bash scripts / Python scripts / Oozie jobs + distcp - it's for free & I control what's going on precisely. But, in case of huge clusters & sophisticated logic, this approach become cumbersome. 5) Don't forget about security & encryption: your sensitive data may be read by third-party agents during replication
On Sat, Nov 12, 2016 at 6:05 PM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Thanks Jorn. > > The way WanDisco promotes itself is doing block level replication. as I > understand you modify core-file.xml and add couple of network server > locations there. they call this tool Fusion. there are at least 2 fusion > servers for high availability. each one among other things has a database > of its own. Once the client interacts with HDFS the fusion server behaves > like a sniffer with its own port. As soon as the first HTFS block of > 256MBout of say a file of 30GB is written, it starts sending that block to > recipient. the laws of physics, the pipeline size etc applies here. That is > up to the consumer. it can 10 files at the same time etc. so that is all. > It is a known technology now labeled as streaming. so in summary it does > not have to wait for the full file to be written to HDFS before replicating > blocks. that is where it scores. > > It helps WAN work. Say the primary/active HDFS is in London and the > replicate is in Singapore. so users in Singapore can see replicated data > (eventually) when it gets there. It can obviously be used for DR in that > case it is like Hot standby (borrowing a terminology from Sybase). In > contrast one can do the same with period loads with homemade tools or tools > like BDR from Cloudera. > > I mentioned that Hive is going to have its metastore on Hbase as well and > that can be potential problems. The site is here > <https://www.wandisco.com/> > > They are claiming there is no competitors in the market for their > streaming HA product. > > HTH > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 12 November 2016 at 11:17, Jörn Franke <jornfra...@gmail.com> wrote: > >> What is wrong with the good old batch transfer for transferring data from >> a cluster to another? I assume your use case is only business continuity in >> case of disasters such as data center loss, which are unlikely to happen >> (well it does not mean they do not happen) and where you could afford to >> loose one day (or hour) of data (depends!). >> >> Nevertheless, I assume he refers to the Hadoop storage policies: >> https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/ >> ArchivalStorage.html , but this still only works for the same cluster. >> >> You could also develop a custom secondary file system, similar to the >> Ignite Cache filesystem, that sits on top of HDFS and as soon as it >> receives data it sends them to another cluster and provides it to HDFS. Not >> knowing Wandisco, I assume what it does. Given the prices (and the fact >> that clusters tend to grow) you may want to evaluate if buying or making >> makes sense. In any case, it also requires evaluation of network >> throughput, because this may become the bottleneck somewhere (either within >> the cluster or more likely between data centers). >> >> As you mentioned, Hbase & Co may require a special consideration for the >> case that data is in-memory and not yet persisted. >> >> On Sat, Nov 12, 2016 at 12:04 PM, Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> thanks Vince >>> >>> can you provide more details on this pls >>> >>> Dr Mich Talebzadeh >>> >>> >>> >>> LinkedIn * >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> On 12 November 2016 at 09:52, vincent gromakowski < >>> vincent.gromakow...@gmail.com> wrote: >>> >>>> A Hdfs tiering policy with good tags should be similar >>>> >>>> Le 11 nov. 2016 11:19 PM, "Mich Talebzadeh" <mich.talebza...@gmail.com> >>>> a écrit : >>>> >>>>> I really don't see why one wants to set up streaming replication >>>>> unless for situations where similar functionality to transactional >>>>> databases is required in big data? >>>>> >>>>> Dr Mich Talebzadeh >>>>> >>>>> >>>>> >>>>> LinkedIn * >>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>> >>>>> >>>>> >>>>> http://talebzadehmich.wordpress.com >>>>> >>>>> >>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>>> any loss, damage or destruction of data or any other property which may >>>>> arise from relying on this email's technical content is explicitly >>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>> arising from such loss, damage or destruction. >>>>> >>>>> >>>>> >>>>> On 11 November 2016 at 17:24, Mich Talebzadeh < >>>>> mich.talebza...@gmail.com> wrote: >>>>> >>>>>> I think it differs as it starts streaming data through its own port >>>>>> as soon as the first block is landed. so the granularity is a block. >>>>>> >>>>>> however, think of it as oracle golden gate replication or sap >>>>>> replication for databases. the only difference is that if the corruption >>>>>> in >>>>>> the block with hdfs it will be freplicated much like srdf. >>>>>> >>>>>> whereas with oracle or sap it is log based replication which stops >>>>>> when it encounters corruption. >>>>>> >>>>>> replication depends on the block. so can replicate hive metadata and >>>>>> fsimage etc. but cannot replicate hbase memstore if hbase crashes. >>>>>> >>>>>> so that is the gist of it. streaming replication as opposed to >>>>>> snapshot. >>>>>> >>>>>> sounds familiar. think of it as log shipping in oracle old days >>>>>> versus goldengate etc. >>>>>> >>>>>> hth >>>>>> >>>>>> Dr Mich Talebzadeh >>>>>> >>>>>> >>>>>> >>>>>> LinkedIn * >>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>>> >>>>>> >>>>>> >>>>>> http://talebzadehmich.wordpress.com >>>>>> >>>>>> >>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>> for any loss, damage or destruction of data or any other property which >>>>>> may >>>>>> arise from relying on this email's technical content is explicitly >>>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>>> arising from such loss, damage or destruction. >>>>>> >>>>>> >>>>>> >>>>>> On 11 November 2016 at 17:14, Deepak Sharma <deepakmc...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Reason being you can set up hdfs duplication on your own to some >>>>>>> other cluster . >>>>>>> >>>>>>> On Nov 11, 2016 22:42, "Mich Talebzadeh" <mich.talebza...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> reason being ? >>>>>>>> >>>>>>>> Dr Mich Talebzadeh >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> LinkedIn * >>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> http://talebzadehmich.wordpress.com >>>>>>>> >>>>>>>> >>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>>>> for any loss, damage or destruction of data or any other property >>>>>>>> which may >>>>>>>> arise from relying on this email's technical content is explicitly >>>>>>>> disclaimed. The author will in no case be liable for any monetary >>>>>>>> damages >>>>>>>> arising from such loss, damage or destruction. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 11 November 2016 at 17:11, Deepak Sharma <deepakmc...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> This is waste of money I guess. >>>>>>>>> >>>>>>>>> On Nov 11, 2016 22:41, "Mich Talebzadeh" < >>>>>>>>> mich.talebza...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> starts at $4,000 per node per year all inclusive. >>>>>>>>>> >>>>>>>>>> With discount it can be halved but we are talking a node itself >>>>>>>>>> so if you have 5 nodes in primary and 5 nodes in DR we are talking >>>>>>>>>> about >>>>>>>>>> $40K already. >>>>>>>>>> >>>>>>>>>> HTH >>>>>>>>>> >>>>>>>>>> Dr Mich Talebzadeh >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> LinkedIn * >>>>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> http://talebzadehmich.wordpress.com >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all >>>>>>>>>> responsibility for any loss, damage or destruction of data or any >>>>>>>>>> other >>>>>>>>>> property which may arise from relying on this email's technical >>>>>>>>>> content is >>>>>>>>>> explicitly disclaimed. The author will in no case be liable for any >>>>>>>>>> monetary damages arising from such loss, damage or destruction. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 11 November 2016 at 16:43, Mudit Kumar <mkumar...@sapient.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Is it feasible cost wise? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> >>>>>>>>>>> Mudit >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> *From:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com] >>>>>>>>>>> *Sent:* Friday, November 11, 2016 2:56 PM >>>>>>>>>>> *To:* user @spark >>>>>>>>>>> *Subject:* Possible DR solution >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Has anyone had experience of using WanDisco >>>>>>>>>>> <https://www.wandisco.com/> block replication to create a fault >>>>>>>>>>> tolerant solution to DR in Hadoop? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> The product claims that it starts replicating as soon as the >>>>>>>>>>> first data block lands on HDFS and takes the block and sends it to >>>>>>>>>>> DR/replicate site. The idea is that is faster than doing it through >>>>>>>>>>> traditional HDFS copy tools which are normally batch oriented. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> It also claims to replicate Hive metadata as well. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I wanted to gauge if anyone has used it or a competitor product. >>>>>>>>>>> The claim is that they do not have competitors! >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Thanks >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Dr Mich Talebzadeh >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> LinkedIn >>>>>>>>>>> *https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> http://talebzadehmich.wordpress.com >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all >>>>>>>>>>> responsibility for any loss, damage or destruction of data or any >>>>>>>>>>> other >>>>>>>>>>> property which may arise from relying on this email's technical >>>>>>>>>>> content is >>>>>>>>>>> explicitly disclaimed. The author will in no case be liable for any >>>>>>>>>>> monetary damages arising from such loss, damage or destruction. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>> >>>>> >>> >> >