GitHub user daviftorres added a comment to the discussion: Additional Zone vs 
Region - US East and West

Dear @justinestruch and @NuxRo ,

This is a discussion that we circle back every week. How to expand 
geographically maintaining the latency as low as possible and add resiliency to 
our infrastructure.

Currently, we have one primary region with the core management of CloudStack, 
and multiple satellite (remote) Zones geographically dispersed. Here are my 
observations:

Facts:
- We use 4 management servers where:
  - #1 primarily responds to UI/API. If unhealthy, it fails over to #3, #4, and 
#2.
  - #2 first priority for Agents to connect. Followed by #4, #3, and #1.
    - Note:
      - #1 and #2 have dedicated host (metal),
      - #3 and #4 shared computing resources (virtualized) in a different 
failure domain.
- Primary -> Secondary database replication.
  - The Primary has dedicated hardware,
  - Secondary share computing resources in a different failure domain.
  - Additionally, encrypted database snapshot every 15 min two different 
geographies for DR.

Observations:
- Even with multiple Zones across North America connected to the same core 
management, we have:
  - No latency issue between management servers and database (all in same 
geography),
  - No perceptible latency to the satellite Zones (~5,000 km apart),
- Resiliency to a datacenter (region) blackout.:
  - With automation, a new CloudStack core management can be deployed in 
minutes,
  - Only the DB needs to be restored. Which is similar to the proposal number 1 
from @NuxRo.
    - Note: promoting a DB into primary is not trivial and has the risk of 
cause split brains.

Reflections:
- The focus of the architecture has to be on the DB.
  - Replication, Galera, InnoDB Cluster, you name it!
  - Point in time recovery: Snapshot, Dumps, Replication, etc.
- Management servers and databases in different failure domains addresses 
partial DC outages.
- Automation plus Snapshots resolve the need for a DR from a total DC outage.

There are many additional topics we can explore, such as serving ACS at the 
edge (similar to a CDN) with caching and BGP advertising in multiple locations, 
or placing one dedicated management server in each region to serve only that 
Zone’s Agents.

Please share your thoughts, this is a highly relevant subject that warrants 
thorough discussion.

GitHub link: 
https://github.com/apache/cloudstack/discussions/12115#discussioncomment-15109007

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to