> Very often the corrective and preventive actions appear to be > different versions and wordings of 'dont make mistakes', in this case: > > - Reviewing and improving input safety checks for mapping components > - Validate and strengthen the safety checks for the configuration > deployment zoning process > > It doesn't seem like a tenable solution, when the solution is 'do > better', since I'm sure whoever did those checks did their best in the > first place. So we must assume we have some fundamental limits what > 'do better' can achieve, we have to assume we have similar level of > outage potential in all work we've produced and continue to produce > for which we exert very little control over. > > I think the mean-time-to-repair actions described are more actionable > than the 'do better'. However Akamai already solved this very fast > and may not be very reasonable to expect big improvements to a 1h > start of fault to solution for a big organisation with a complex > product. > > One thing that comes to mind is, what if Akamai assumes they cannot > reasonably make it fail less often and they can't fix it faster. Is > this particular product/solution such that the possibility of having > entirely independent A+B sides, for which clients fail over is not > available? If it was a DNS problem, it seems like it might have been > possible to have entirely failed A, and clients automatically > reverting to B, perhaps adding some latencies but also allowing the > system to automatically detect that A and B are performing at an > unacceptable delta.
formal verification