Outage on apt.puppetlabs.com (and then yum.puppetlabs.com) Timeline:
First reports of issues: 17 Nov 2015 1939GMT Resolved at: 17 Nov 2015 2051GMT Impact: users attempting to get packages from apt.puppetlabs.com were unable to in a vast majority of cases for the duration of the outage. Several users contacted us via IRC, twitter and other means. What happened? The release engineering team at Puppet Labs has been trying to make shipping new versions of software more efficient and thus refactoring our packaging repo <https://github.com/puppetlabs/packaging> and workflows. For some background, much of the way the download infrastructure and shipping workflow happens was designed and set up in 2011. The characteristics of our systems were a bit different then. In 2011, we were getting thousands of hits a day. Now we get 10+ million a week. We’ve been migrating toward having a staging system inside Puppet Labs where all metadata is generated and then syncing out the deltas between the internal staging servers and the public infrastructure. This will allow for more flexibility with our download servers as well as a more robust safety net. Our testing last week indicated this should work without issue. During the preliminary steps of shipping puppet-agent 1.3.0[1] today, we found that not to be the case. The sync job was syncing very large amount of content (80+ GB) and not just the deltas due to the way freight <https://github.com/rcrowley/freight>[2] processes debian package metadata. Because of that, we were in an inconsistent state for a while, thus causing 403, or 404 errors at various points during the outage. Additionally, this uncovered some bad assumptions that we make in our shipping automation, including that we will always be shipping deb packages. Despite the fact that we were only trying to ship gems for the individual components of puppet-agent 1.3.0, freight was running under the assumption that deb packages had been shipped. What we did to restore service: We failed over to our secondary repository server at approximately 20:20GMT, but continued to serve 403 and 404 errors for another 30 minutes while we resolved whether the problem was actually bad permissions or missing files; we were hampered by the number of spurrous 404 errors that apt.puppetlabs.com regularly serves due to the number of optional internationalization files that the apt-get tool will attempt to look for if run. Once we determined that disk permissions were the root issue, permissions were corrected and all of the expected files were once again available. The cause of bad permissions was determined to be a result of the synchronization between our first and and second system, which did not specifically maintain file permissions on copy instead relying on the receiver to set the correct permissions for access. This resulted in a number of directories being created without global read permissions. A second outage Beginning around 18 Nov 2015 1800GMT we began hearing reports of additional outages affecting yum.puppetlabs.com and apt.puppetlabs.com. As part of routine maintenance on our repository host we failed over from primary to backup. Unfortunately, as a result of the puppet-agent 1.3.0 shipping from yesterday, the backup was out of sync with the primary. This caused metadata mismatch errors. It also made the puppet-agent packages unavailable. Around 1845GMT we failed back over from the backup to the primary. Maintenance on the primary server will be postponed until the backup is fully synced. Next Steps: To fix this, we’re going to ship all the metadata internal bits that freight uses, and ship much smaller payloads (as was the original plan). Additionally, prior to transitioning from primary to backup server we will confirm that the servers are up to date. We apologize for the outage, Puppet Labs Release Engineering ----------- [1] puppet-agent 1.3.0 is on the public download sites as of ~1845 GMT 11/18. [2] We’re also looking into replacing freight, because our repository has scaled well past the size of where freight was an ideal tool for us/ -- Morgan Haskel mor...@puppetlabs.com Release Engineer -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/puppet-users/CA%2BFnDv0y49-YbQCVFOStxEG3yoPMKK0wYm30u-Me7kdBfA1_fQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.