Indeed. Worth rereading for that reason alone (or in particular).
Miles Fidelman
Hank Nussbacher wrote:
On 23/07/2021 09:24, Hank Nussbacher wrote:
From Akamai. How companies and vendors should report outages:
[07:35 UTC on July 24, 2021] Update:
Root Cause:
This configuration directive was sent as part of preparation for
independent load balancing control of a forthcoming product. Updates
to the configuration directive for this load balancing component have
routinely been made on approximately a weekly basis. (Further changes
to this configuration channel have been blocked until additional
safety measures have been implemented, as noted in Corrective and
Preventive Actions.)
The load balancing configuration directive included a formatting
error. As a safety measure, the load balancing component disregarded
the improper configuration and fell back to a minimal configuration.
In this minimal state, based on a VIP-only configuration, it did not
support load balancing for Enhanced TLS slots greater than 6145.
The missing load balancing data meant that the Akamai authoritative
DNS system for the akamaiedge.net zone would not receive any directive
for how to respond to DNS queries for many Enhanced TLS slots. The
authoritative DNS system will respond with a SERVFAIL when there is no
directive, as during localized failures resolvers will retry an
alternate authority.
The zoning process used for deploying configuration changes to the
network includes an alert check for potential issues caused by the
configuration changes. The zoning process did result in alerts during
the deployment. However, due to how the particular safety check was
configured, the alerts for this load balancing component did not
prevent the configuration from continuing to propagate, and did not
result in escalation to engineering SMEs. The input safety check on
the load balancing component also did not automatically roll back the
change upon detecting the error.
Contributing Factors:
The internal alerting which was specific to the load balancing
component did not result in blocking the configuration from
propagating to the network, and did not result in an escalation to the
SMEs for the component.
The alert and associated procedure indicating widespread SERVFAILs
potentially due to issues with mapping systems did not lead to an
appropriately urgent and timely response.
The internal alerting which fired and was escalated to SMEs was
for a separate component which uses the load balancing data. This
internal alerting initially fired for the Edge DNS system rather than
the mapping system, which delayed troubleshooting potential issues
with the mapping system and the load balancing component which had the
configuration change. Subsequent internal alerts more clearly
indicated an issue with the mapping system.
The impact to the Enhanced TLS service affected Akamai staff
access to internal tools and websites, which delayed escalation of
alerts, troubleshooting, and especially initiation of the incident
process.
Short Term
Completed:
Akamai completed rolling back the configuration change at 16:44
UTC on July 22, 2021.
Blocked any further changes to the involved configuration channel.
Other related channels are being reviewed and may be subject to a
similar block as reviews take place. Channels will be unblocked after
additional safety measures are assessed and implemented where needed.
In Progress:
Validate and strengthen the safety checks for the configuration
deployment zoning process
Increase the sensitivity and priority of alerting for high rates
of SERVFAILs.
Long Term
In Progress:
Reviewing and improving input safety checks for mapping components.
Auditing critical systems to identify gaps in monitoring and
alerting, then closing unacceptable gaps.
On 22/07/2021 19:34, Mark Tinka wrote:
https://edgedns.status.akamai.com/
Mark.
[18:30 UTC on July 22, 2021] Update:
Akamai experienced a disruption with our DNS service on July 22,
2021. The disruption began at 15:45 UTC and lasted for approximately
one hour. Affected customer sites were significantly impacted for
connections that were not established before the incident began.
Our teams identified that a change made in a mapping component was
causing the issue, and in order to mitigate it we rolled the change
back at approximately 16:44 UTC. We can confirm this was not a
cyberattack against Akamai's platform. Immediately following the
rollback, the platform stabilized and DNS services resumed normal
operations. At this time the incident is resolved, and we are
monitoring to ensure that traffic remains stable.
--
In theory, there is no difference between theory and practice.
In practice, there is. .... Yogi Berra
Theory is when you know everything but nothing works.
Practice is when everything works but no one knows why.
In our lab, theory and practice are combined:
nothing works and no one knows why. ... unknown