Hello, pt., 19 kwi 2019 o 20:25 Heitor Faria <hei...@bacula.com.br> napisał(a):
> Hello Radoslaw, > > Hello, > > pt., 19 kwi 2019 o 13:28 Heitor Faria <hei...@bacula.com.br> napisał(a): > >> Hello Radoslaw, >> >> Speaking of Bacula HA, I've been deploying a scenario with relative >>> success. >>> Primary Director & SD have copy jobs routines to a Secondary Remote SD >>> that also has an independent working Director. >> >> >> It sounds to me as a Disaster Recovery solution and absolutely no High >> Availability. >> >> Is there any difference? >> > > The difference is HUGE!!!! > > >> For me there are two Disaster Recovery categories, Backup and >> Replication. HA falls in the second category. >> > > Disaster Recovery is a part of more general Business Continuity Plan. BCP > describes what to do when something wrong happens to our business and > consist of a number of procedures and performances executed in hard times. > DR focus on recovery only. > What is a disaster? Do a single disk failure is a disaster? Do a single > network adapter or single server or single rack failures are disasters? Do > a single Datacenter failure is a disaster? And what are availability > levels? How does it compares? > > We were discussing concepts, used by Dell/EMC Certification and the best > scientific literacture on the topic. > I'm using concepts from Veritas (i.e. Resilience Enterprise) and its Certification, so .... :) Update: I checked linked paper and it uses concepts I see as a disaster recovery solution and what a surprise it names it as Disaster Recovery... (check below) > I don't see how policies, use cases or plans affect that. > Anyway, having director redundancy, as in the original proposal, allows > Backup and Restore Services HA, > Yes, the HA is different then DR. Thank you. > since both would be almost always online (even lacking the failed running > jobs redistribution, as pointed by Dimitri). > > First of all a backup is one of the services managed by any IT > departments. So as a service it should run without problems and maintain a > good availability level. Just take a look for maintaining Oracle RDBMS with > the best backup and recovery solution using Bacula Oracle SBT Plugin. With > this plugin you can setup a two kinds of backups: online database files > backup and archived logs backups. Together allow for perfect > Point-In-Time-Recovery. The first one can be executed once a day, once a > week, etc. but the second one should be executed as frequent as it is > possible to maintain the best RPO possible. > > I see this as the Disaster Recovery levels or dimensions [T. Wood, E. > Cecchet, K. K. Ramakrishnan, P. J. Shenoy, J. E. van der Merwe, and A. > Venkataramani, “Disaster Recovery as a Cloud Service: Economic Benefits & > Deployment Challenges.,” *HotCloud*, vol. 10, pp. 8–15, 2010.]: > I checked this paper and it prove my point of view on what DR is and what is HA... in every single word. In a few minutes I thought that all I learned about High Availability and Disaster Recovery in my >20 years of Enterprise experience was redefined backwards. :) I see, not yet. What I see in your post: every time you describe a great DR solution you does not name it DR but you name it HA which is not true. "Speaking of Bacula HA, I've been deploying a scenario with relative success. Primary Director & SD have copy jobs routines to a Secondary Remote SD..." > > Data level: Security of application data > System level: Reducing recovery time as short as possible > Application level: Application continuity > > To achieve this you have to maintain a backup service as highly available > as possible with eliminating SPOF (single point in failure). For above > breakdowns you have to multiple components, i.e. bring two network > adapters, create a RAID, create a cluster, put every cluster node in a > separate rack, etc. All this allow you to achieve a High Availability > service with zero data loss in case of failover. For Datacenter it is > always a different story! If you need to failover a datacenter then you > always lost your data! This is because Bacula replication is asynchronous, > so it is not possible to have up to date archives on both sides at any > given time. > > You will always have a lag. On the other hand, you can implement a block > level replication which could be synchronous, but this kind of solution do > not work with tapes and when synchronous it has a huge impact on > performance. In most cases synchronous block level replication on large > scale and long distances requires a lot of cash! Synchronous block level > replication should never be used as a part of Backup DR solution, because a > single block corruption can leads to whole filesystem corruption and lost > of archive volumes! So, back to asynchronous Bacula replication - did I > mention it will create a lag, so your RPO > 0. :) > > This is true for most recent backups, but there are ways of mitigating > this (redundant jobs, simultaneous backup to two different jobs (if ever > developed)). > Syncronous or Asyncronous replication will always have = 1 RPO, the only > difference is the data outdating. > I see we have a very different dictionary here, so we cannot get the same conclusion. In my dictionary RPO = Recovery Point Objective means at what Point-In-Time I can recover my data. It is a shame that in a such strict science as IT on two different world locations using the same language we have a such difference in words and statements meanings. I can cite the RPO definition used in linked paper: *"Recovery Point Objective (RPO): The RPO of a DR system represents the point in time of the most recent backup prior to any failure. The necessary RPO is generally a business decision—for some applications absolutely no data can be lost (RPO=0), requiring continuous synchronous replication to be used, while for other applications, the acceptable data loss could range from a few seconds to hours or even days."* As I understand it: RPO defines the acceptable data loss (data outdating) in any DR system and it can range from RPO=0 for continuous synchronous replication solution up to a few seconds, hours or even days for others. > In any HA solution you would assure that your services are running the >> highest uptime possible and this kind of solution in most cases is >> implemented with clusters. In this case you can loose currently uncommitted >> data (running jobs) but your services are ready to proceed next jobs as >> soon as possible. >> >> I disagree a little bit. Replication purpose is provide the better >> possible RTO. >> > > So, lets compare: > Shared storage Cluster HA: RPO - no data loss; RTO - automatic failover, > seconds from failure detection to recovery; > Asynchronous Replication in Bacula: RPO - hours, minutes the best, in most > cases single day; RTO - manual switchover - hours; > > Disagree. Director redudancy provides near zero RTO to the backup and > restore service. > Why do you disagree to the same statement? - RTO - automatic failover, seconds from failure detection to recovery; vs. - ... near zero RTO to the backup and restore service; In my opinion near zero falls into seconds timeline, no? I can cite the RTO definition used in linked paper: *"Recovery Time Objective (RTO): The RTO is an orthogonal business decision that specifies a bound on how long it can take for an application to come back online after a failure occurs."* > High Availability solutions focus on Service levels and are not designed > to handle disasters. > > A power supply failis a disaster. =) > SPOF? If you design a critical service then you have to follow the path of avoiding SPOF. Single power supply failed? Not a problem as we have a redundant power supply. So it is not a disaster at all. You can manage it transparently. The same applies to other hardware components. > Disaster Recovery solutions focus on disasters and are not designed for > fast and easy backup service switchover. Different solutions for different > purposes. The Enterprise want them both! > > For obvious reasons, Bacula cannot re-distribute a failed backup job yet >> (perhaps never will), but I don't think it is necessarelly a problem for >> Replication. >> >> HA implementation in Bacula is extremely straightforward when using a >> shared storage clustering solution. >> >> >>> Both director can access the Secondary SD. >>> An Admin Job with a Shell Script daily bscans all volumes in to the >>> Secondary Director and its catalog. >>> All bscanned volumes comes with the Archived status, so they are >>> basically Read-Only. >>> Advantage: you can restore jobs from both environments, any time. => >>> http://bacula.us/bacula-server-and-backups-replication-for-high-availability/ >>> Perhaps, a "bscan all" bconsole command would be a nice feature to sync >>> all disk based volumes to catalog and improve the proccess a little bit >>> more. >>> >> >> This is a Disaster Recovery solution. A One-Way Failover. :) >> >> Pot8to, potato. =) >> From Bacula perspective that's what we have today. >> > > What? > When you implement Bacula in the shared storage cluster, you can failover > backup service from node to node in any direction in just a seconds. > > You will have running backups outdate anyway. > Yes! But any few minutes in the future scheduled backups can execute without a problem. You have to restart backups run during failure only. A lot of enterprise customers value this kind of solution. > > Radoslaw: of course my proposal doesn't work for all case scenarios - far > from that. It is conceptual and provocative. > Bscan needs to be improved to have an option to skip already synched > volumes option (perhaps a volume metadata hash comparison? don't know). > Also, Volume names wildcard or any way to easily select multiple volumes, > maybe even allow bscan to be called from bconsole. > Heitor - I'm not criticizing your proposal. I do point your mistake about what DR is and what is HA. Thats all. best regards -- Radosław Korzeniewski rados...@korzeniewski.net
_______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users