Re: Global Akamai Outage

Miles Fidelman Sun, 25 Jul 2021 07:23:01 -0700

Indeed.  Worth rereading for that reason alone (or in particular).


Miles Fidelman

Hank Nussbacher wrote:

On 23/07/2021 09:24, Hank Nussbacher wrote:

From Akamai.  How companies and vendors should report outages:

[07:35 UTC on July 24, 2021] Update:

Root Cause:
This configuration directive was sent as part of preparation forindependent load balancing control of a forthcoming product. Updatesto the configuration directive for this load balancing component haveroutinely been made on approximately a weekly basis. (Further changesto this configuration channel have been blocked until additionalsafety measures have been implemented, as noted in Corrective andPreventive Actions.)
The load balancing configuration directive included a formattingerror. As a safety measure, the load balancing component disregardedthe improper configuration and fell back to a minimal configuration.In this minimal state, based on a VIP-only configuration, it did notsupport load balancing for Enhanced TLS slots greater than 6145.
The missing load balancing data meant that the Akamai authoritativeDNS system for the akamaiedge.net zone would not receive any directivefor how to respond to DNS queries for many Enhanced TLS slots. Theauthoritative DNS system will respond with a SERVFAIL when there is nodirective, as during localized failures resolvers will retry analternate authority.
The zoning process used for deploying configuration changes to thenetwork includes an alert check for potential issues caused by theconfiguration changes. The zoning process did result in alerts duringthe deployment. However, due to how the particular safety check wasconfigured, the alerts for this load balancing component did notprevent the configuration from continuing to propagate, and did notresult in escalation to engineering SMEs. The input safety check onthe load balancing component also did not automatically roll back thechange upon detecting the error.
Contributing Factors:
The internal alerting which was specific to the load balancingcomponent did not result in blocking the configuration frompropagating to the network, and did not result in an escalation to theSMEs for the component. The alert and associated procedure indicating widespread SERVFAILspotentially due to issues with mapping systems did not lead to anappropriately urgent and timely response. The internal alerting which fired and was escalated to SMEs wasfor a separate component which uses the load balancing data. Thisinternal alerting initially fired for the Edge DNS system rather thanthe mapping system, which delayed troubleshooting potential issueswith the mapping system and the load balancing component which had theconfiguration change. Subsequent internal alerts more clearlyindicated an issue with the mapping system. The impact to the Enhanced TLS service affected Akamai staffaccess to internal tools and websites, which delayed escalation ofalerts, troubleshooting, and especially initiation of the incidentprocess.
Short Term

Completed:
Akamai completed rolling back the configuration change at 16:44UTC on July 22, 2021.
    Blocked any further changes to the involved configuration channel.
Other related channels are being reviewed and may be subject to asimilar block as reviews take place. Channels will be unblocked afteradditional safety measures are assessed and implemented where needed.
In Progress:
Validate and strengthen the safety checks for the configurationdeployment zoning process Increase the sensitivity and priority of alerting for high ratesof SERVFAILs.
Long Term

In Progress:

    Reviewing and improving input safety checks for mapping components.
Auditing critical systems to identify gaps in monitoring andalerting, then closing unacceptable gaps.
On 22/07/2021 19:34, Mark Tinka wrote:
https://edgedns.status.akamai.com/

Mark.
[18:30 UTC on July 22, 2021] Update:
Akamai experienced a disruption with our DNS service on July 22,2021. The disruption began at 15:45 UTC and lasted for approximatelyone hour. Affected customer sites were significantly impacted forconnections that were not established before the incident began.
Our teams identified that a change made in a mapping component wascausing the issue, and in order to mitigate it we rolled the changeback at approximately 16:44 UTC. We can confirm this was not acyberattack against Akamai's platform. Immediately following therollback, the platform stabilized and DNS services resumed normaloperations. At this time the incident is resolved, and we aremonitoring to ensure that traffic remains stable.



--
In theory, there is no difference between theory and practice.
In practice, there is.  .... Yogi Berra

Theory is when you know everything but nothing works.
Practice is when everything works but no one knows why.
In our lab, theory and practice are combined:
nothing works and no one knows why.  ... unknown

Re: Global Akamai Outage

Reply via email to