Outage on apt.puppetlabs.com (and then yum.puppetlabs.com)

Timeline:

First reports of issues:  17 Nov 2015 1939GMT

Resolved at: 17 Nov 2015  2051GMT

Impact: users attempting to get packages from apt.puppetlabs.com were
unable to in a vast majority of cases for the duration of the outage.
Several users contacted us via IRC, twitter and other means.

What happened?

The release engineering team at Puppet Labs has been trying to make
shipping new versions of software more efficient and thus refactoring
our packaging
repo <https://github.com/puppetlabs/packaging> and workflows. For some
background, much of the way the download infrastructure and shipping
workflow happens was designed and set up in 2011. The characteristics of
our systems were a bit different then. In 2011, we were getting thousands
of hits a day. Now we get 10+ million a week.

We’ve been migrating toward having a staging system inside Puppet Labs
where all metadata is generated and then syncing out the deltas between the
internal staging servers and the public infrastructure. This will allow for
more flexibility with our download servers as well as a more robust safety
net.

Our testing last week indicated this should work without issue. During the
preliminary steps of shipping puppet-agent 1.3.0[1] today, we found that
not to be the case. The sync job was syncing very large amount of content
(80+ GB) and not just the deltas due to the way freight
<https://github.com/rcrowley/freight>[2] processes debian package metadata.
Because of that, we were in an inconsistent state for a while, thus causing
403, or 404 errors at various points during the outage.

Additionally, this uncovered some bad assumptions that we make in our
shipping automation, including that we will always be shipping deb
packages. Despite the fact that we were only trying to ship gems for the
individual components of puppet-agent 1.3.0, freight was running under the
assumption that deb packages had been shipped.

What we did to restore service:

We failed over to our secondary repository server at approximately
20:20GMT, but continued to serve 403 and 404 errors for another 30 minutes
while we resolved whether the problem was actually bad permissions or
missing files; we were hampered by the number of spurrous 404 errors that
apt.puppetlabs.com regularly serves due to the number of optional
internationalization files that the apt-get tool will attempt to look for
if run. Once we determined that disk permissions were the root issue,
permissions were corrected and all of the expected files were once again
available.

The cause of bad permissions was determined to be a result of the
synchronization between our first and and second system, which did not
specifically maintain file permissions on copy instead relying on the
receiver to set the correct permissions for access. This resulted in a
number of directories being created without global read permissions.

A second outage

Beginning around 18 Nov 2015 1800GMT we began hearing reports of additional
outages affecting yum.puppetlabs.com and apt.puppetlabs.com. As part of
routine maintenance on our repository host we failed over from primary to
backup. Unfortunately, as a result of the puppet-agent 1.3.0 shipping from
yesterday, the backup was out of sync with the primary. This caused
metadata mismatch errors. It also made the puppet-agent packages
unavailable. Around 1845GMT we failed back over from the backup to the
primary. Maintenance on the primary server will be postponed until the
backup is fully synced.


Next Steps:

To fix this, we’re going to ship all the metadata internal bits that
freight uses, and ship much smaller payloads (as was the original plan).
Additionally, prior to transitioning from primary to backup server we will
confirm that the servers are up to date.

We apologize for the outage,

Puppet Labs Release Engineering




-----------

[1] puppet-agent 1.3.0 is on the public download sites as of ~1845 GMT
11/18.

[2] We’re also looking into replacing freight, because our repository has
scaled well past the size of where freight was an ideal tool for us/
-- 
Morgan Haskel
mor...@puppetlabs.com
Release Engineer

-- 
You received this message because you are subscribed to the Google Groups 
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to puppet-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/puppet-users/CA%2BFnDv0y49-YbQCVFOStxEG3yoPMKK0wYm30u-Me7kdBfA1_fQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to