Should have sent this earlier, but here we go https://wikitech.wikimedia.org/wiki/Incidents/2024-10-09_calico-codfw
On Wed, Oct 9, 2024 at 4:20 PM Brooke Vibber <[email protected]> wrote: > Yikes! And now... We know (knowing is half the battle) > > -- brooke > > On Wed, Oct 9, 2024, 12:12 AM Alexandros Kosiaris <[email protected]> > wrote: > >> Hi, >> >> Well, there was one surprise, and it did explode! It was probably very >> close to a full outage in codfw. We are still in the process of documenting >> everything and we 'll be publishing a full incident report, but the TL;DR >> is that every single mwscript-k8s invocation that is happening in the for >> loop creates an entire helm release, including all k8s resources. This is >> by design, but it did have an unforeseen consequence and that is that close >> to 2k Calico Network Policies were created (including a ton of other >> resources, which would create their own set of problems), which meant all >> Calico components in k8s had to gradually react to the increasing number of >> those, which ended up hitting resources limits for some of those, which led >> to throttles and then to failures and into a slowly cascading outage that >> was putting hardware nodes one by one out of rotation. The last couple of >> hours were interesting for some of us, I can tell you that. >> >> We are already working on plans (we got enough action items already, >> including amending the design) on how to fix this, but in the meantime, I >> 'd like to request that until we provide an update that we 've solved this, >> don't spawn mwscript-k8s in a for loop or anything similar. You can >> continue working of course with the tool to get acquainted with it, find >> bugs etc, just please don't spawn hundreds or even worse thousands of >> invocations. >> >> Brooke, I had to kill your bash shell on deploy2002 doing the transcodes. >> I am sorry about that, but despite attaching to your screen I didn't manage >> to find how to stop it (it didn't respond to any of the usual control >> sequences and shell job controls) and I didn't want to risk 1 more outage >> (which would probably happen once the resources were reaching some critical >> number). >> >> On Wed, Oct 9, 2024 at 6:16 AM Reuven Lazarus <[email protected]> >> wrote: >> >>> Great to hear, thanks! >>> >>> As a side note for others, to highlight something Brooke said in >>> passing: Wrapping mwscript-k8s in a bash for loop is a fine idea, *as >>> long as* you're running it with --follow or --attach. In that case, >>> each mwscript-k8s invocation will keep running to monitor the job's output, >>> and will terminate when the job terminates. One job will run at a time, >>> which is what you expect. >>> >>> Without --follow or --attach, mwscript-k8s is just the launcher: it >>> kicks off your job, then terminates immediately. Your for loop will >>> rapidfire *launch* all the jobs one after another, which means hundreds >>> of 'em might be executing simultaneously, and that might not be what you >>> had in mind. If your job involves expensive DB operations, it *really* >>> might not be what you had in mind. >>> >>> First-class dblist support will indeed make that pitfall easier to >>> avoid. In the meantime there's nothing wrong with using a for loop, and >>> it's what I'd do too -- but since this is a new system and nobody has >>> well-honed intuition for it yet, I wanted to draw everyone's eye to that >>> distinction. >>> >>> On Tue, Oct 8, 2024 at 4:37 PM Brooke Vibber <[email protected]> >>> wrote: >>> >>>> I'm starting some batch maintenance of video transcodes so I'm >>>> exercising the new k8s-based maint script system on TMH's >>>> requeueTranscodes.php; good news: no surprises so far, everything's working >>>> just fine. :D >>>> >>>> Since I'm running the same scripts over multiple wikis I went ahead and >>>> manually wrapped them in a bash for loop so it's submitting one job at a >>>> time out of all.dblist, using a screen session for the wrapper loop and >>>> tailing the logs to the session so they don't all smash out at once, and a >>>> second manually-started run for Commons. :) >>>> >>>> First-class support for running over a dblist will be a very welcome >>>> improvement, and should be pretty straightforward! Good work everybody. :D >>>> >>>> The longest job (Commons) might take a couple days to run, so we'll see >>>> if anything explodes later! hehe >>>> >>>> -- brooke >>>> >>>> On Wed, Sep 25, 2024 at 8:11 PM Reuven Lazarus <[email protected]> >>>> wrote: >>>> >>>>> Hi all, >>>>> >>>>> With MediaWiki at the WMF moving to Kubernetes, it's now time to start >>>>> running manual maintenance scripts there. Any time you would previously >>>>> SSH >>>>> to a mwmaint host and run mwscript, follow these steps instead. The old >>>>> way >>>>> will continue working for a little while, but it will be going away. >>>>> >>>>> >>>>> What's familiar: >>>>> >>>>> Starting a maintenance script looks like this: >>>>> >>>>> rzl@deploy2002:~$ mwscript-k8s --comment="T341553" -- Version.php >>>>> --wiki=enwiki >>>>> >>>>> Any options for the mwscript-k8s tool, as described below, go before >>>>> the --. >>>>> >>>>> After the --, the first argument is the script name; everything else >>>>> is passed to the script. This is the same as you're used to passing to >>>>> mwscript. >>>>> >>>>> >>>>> What's different: >>>>> >>>>> - Run mwscript-k8s on a deployment host, not the maintenance host. >>>>> Either deployment host will work; your job will automatically run in >>>>> whichever data center is active, so you no longer need to change hosts >>>>> when >>>>> there’s a switchover. >>>>> >>>>> - You don't need a tmux. By default the tool launches your >>>>> maintenance script and exits immediately, without waiting for your job to >>>>> finish. If you log out of the deployment host, your job keeps running on >>>>> the Kubernetes cluster. >>>>> >>>>> - Kubernetes saves the maintenance script's output for seven days >>>>> after completion. By default, mwscript-k8s prints a kubectl command that >>>>> you (or anyone else) can paste and run to monitor the output or save it to >>>>> a file. >>>>> >>>>> - As a convenience, you can pass -f (--follow) to mwscript-k8s to >>>>> immediately >>>>> begin tailing the script output. If you like, you can do this inside >>>>> a tmux and keep the same workflow as before. Either way, you can safely >>>>> disconnect and your script will continue running on Kubernetes. >>>>> >>>>> rzl@deploy2002:~$ mwscript-k8s -f -- Version.php --wiki=testwiki >>>>> >>>>> [...] >>>>> >>>>> MediaWiki version: 1.43.0-wmf.24 LTS (built: 22:35, 23 September >>>>> 2024) >>>>> >>>>> - For scripts that take input on stdin, you can pass --attach to >>>>> mwscript-k8s, either interactively or in a pipeline. >>>>> >>>>> rzl@deploy2002:~$ mwscript-k8s --attach -- shell.php --wiki=testwiki >>>>> >>>>> [...] >>>>> >>>>> Psy Shell v0.12.3 (PHP 7.4.33 — cli) by Justin Hileman >>>>> >>>>> > $wmgRealm >>>>> >>>>> = "production" >>>>> >>>>> > >>>>> >>>>> rzl@deploy2002:~$ cat example_url.txt | mwscript-k8s --attach -- >>>>> purgeList.php >>>>> >>>>> [...] >>>>> >>>>> Purging 1 urls >>>>> >>>>> Done! >>>>> >>>>> - Your maintenance script runs in a Docker container which will not >>>>> outlive it, so it can't save persistent files to disk. Ensure your >>>>> script logs its important output to stdout, or persists it in a database >>>>> or >>>>> other remote storage. >>>>> >>>>> - The --comment flag sets an optional (but encouraged) descriptive >>>>> label, such as a task number. >>>>> >>>>> - Using standard kubectl commands[1][2], you can check the status, and >>>>> view the output, of your running jobs or anyone else's. (Example: >>>>> `kube_env >>>>> mw-script codfw; kubectl get pod -l username=rzl`) >>>>> >>>>> [1]: https://wikitech.wikimedia.org/wiki/Kubernetes/Kubectl >>>>> >>>>> [2]: https://kubernetes.io/docs/reference/kubectl/quick-reference/ >>>>> >>>>> >>>>> What's not supported yet: >>>>> >>>>> - Maintenance scripts launched automatically on a timer. We're working >>>>> on migrating them -- for now, this is for one-off scripts launched by >>>>> hand. >>>>> >>>>> - If your job is interrupted (e.g. by hardware problems), Kubernetes >>>>> can automatically move it to another machine and restart it, babysitting >>>>> it >>>>> until it completes. But we only want to do that if your job is safe to >>>>> restart. So by default, if your job is interrupted, it will stay stopped >>>>> until you restart it yourself. Soon, we'll add an option to declare "this >>>>> is idempotent, please restart it as needed" and that design is recommended >>>>> for new scripts. >>>>> >>>>> - No support yet for mwscriptwikiset, foreachwiki, >>>>> foreachwikiindblist, etc, but we'll add similar functionality as flags to >>>>> mwscript_k8s. >>>>> >>>>> >>>>> Your feedback: >>>>> >>>>> Let me know by email or IRC, or on Phab (T341553 >>>>> <https://phabricator.wikimedia.org/T341553>). If mwscript-k8s doesn't >>>>> work for you, for now you can fall back to using the mwmaint hosts as >>>>> before -- but they will be going away. Please report any problems sooner >>>>> rather than later, so that we can ensure the new system meets your needs >>>>> before that happens. >>>>> >>>>> Thanks, >>>>> >>>>> Reuven, for Service Ops SRE >>>>> _______________________________________________ >>>>> Wikitech-l mailing list -- [email protected] >>>>> To unsubscribe send an email to [email protected] >>>>> >>>>> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/ >>>> >>>> _______________________________________________ >>>> Wikitech-l mailing list -- [email protected] >>>> To unsubscribe send an email to [email protected] >>>> >>>> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/ >>> >>> >>> >>> -- >>> *Reuven Lazarus *(he/him) >>> Staff Site Reliability Engineer >>> Wikimedia Foundation <https://wikimediafoundation.org/> >>> <https://wikimediafoundation.org/> >>> _______________________________________________ >>> Wikitech-l mailing list -- [email protected] >>> To unsubscribe send an email to [email protected] >>> >>> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/ >> >> >> >> -- >> Alexandros Kosiaris >> Principal Site Reliability Engineer >> Wikimedia Foundation >> _______________________________________________ >> Wikitech-l mailing list -- [email protected] >> To unsubscribe send an email to [email protected] >> >> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/ > > _______________________________________________ > Wikitech-l mailing list -- [email protected] > To unsubscribe send an email to [email protected] > https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/ -- Alexandros Kosiaris Principal Site Reliability Engineer Wikimedia Foundation
_______________________________________________ Wikitech-l mailing list -- [email protected] To unsubscribe send an email to [email protected] https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
