Yikes! And now... We know (knowing is half the battle) -- brooke
On Wed, Oct 9, 2024, 12:12 AM Alexandros Kosiaris <[email protected]> wrote: > Hi, > > Well, there was one surprise, and it did explode! It was probably very > close to a full outage in codfw. We are still in the process of documenting > everything and we 'll be publishing a full incident report, but the TL;DR > is that every single mwscript-k8s invocation that is happening in the for > loop creates an entire helm release, including all k8s resources. This is > by design, but it did have an unforeseen consequence and that is that close > to 2k Calico Network Policies were created (including a ton of other > resources, which would create their own set of problems), which meant all > Calico components in k8s had to gradually react to the increasing number of > those, which ended up hitting resources limits for some of those, which led > to throttles and then to failures and into a slowly cascading outage that > was putting hardware nodes one by one out of rotation. The last couple of > hours were interesting for some of us, I can tell you that. > > We are already working on plans (we got enough action items already, > including amending the design) on how to fix this, but in the meantime, I > 'd like to request that until we provide an update that we 've solved this, > don't spawn mwscript-k8s in a for loop or anything similar. You can > continue working of course with the tool to get acquainted with it, find > bugs etc, just please don't spawn hundreds or even worse thousands of > invocations. > > Brooke, I had to kill your bash shell on deploy2002 doing the transcodes. > I am sorry about that, but despite attaching to your screen I didn't manage > to find how to stop it (it didn't respond to any of the usual control > sequences and shell job controls) and I didn't want to risk 1 more outage > (which would probably happen once the resources were reaching some critical > number). > > On Wed, Oct 9, 2024 at 6:16 AM Reuven Lazarus <[email protected]> > wrote: > >> Great to hear, thanks! >> >> As a side note for others, to highlight something Brooke said in passing: >> Wrapping mwscript-k8s in a bash for loop is a fine idea, *as long as* >> you're running it with --follow or --attach. In that case, each >> mwscript-k8s invocation will keep running to monitor the job's output, and >> will terminate when the job terminates. One job will run at a time, which >> is what you expect. >> >> Without --follow or --attach, mwscript-k8s is just the launcher: it kicks >> off your job, then terminates immediately. Your for loop will rapidfire >> *launch* all the jobs one after another, which means hundreds of 'em >> might be executing simultaneously, and that might not be what you had in >> mind. If your job involves expensive DB operations, it *really* might >> not be what you had in mind. >> >> First-class dblist support will indeed make that pitfall easier to avoid. >> In the meantime there's nothing wrong with using a for loop, and it's what >> I'd do too -- but since this is a new system and nobody has well-honed >> intuition for it yet, I wanted to draw everyone's eye to that distinction. >> >> On Tue, Oct 8, 2024 at 4:37 PM Brooke Vibber <[email protected]> >> wrote: >> >>> I'm starting some batch maintenance of video transcodes so I'm >>> exercising the new k8s-based maint script system on TMH's >>> requeueTranscodes.php; good news: no surprises so far, everything's working >>> just fine. :D >>> >>> Since I'm running the same scripts over multiple wikis I went ahead and >>> manually wrapped them in a bash for loop so it's submitting one job at a >>> time out of all.dblist, using a screen session for the wrapper loop and >>> tailing the logs to the session so they don't all smash out at once, and a >>> second manually-started run for Commons. :) >>> >>> First-class support for running over a dblist will be a very welcome >>> improvement, and should be pretty straightforward! Good work everybody. :D >>> >>> The longest job (Commons) might take a couple days to run, so we'll see >>> if anything explodes later! hehe >>> >>> -- brooke >>> >>> On Wed, Sep 25, 2024 at 8:11 PM Reuven Lazarus <[email protected]> >>> wrote: >>> >>>> Hi all, >>>> >>>> With MediaWiki at the WMF moving to Kubernetes, it's now time to start >>>> running manual maintenance scripts there. Any time you would previously SSH >>>> to a mwmaint host and run mwscript, follow these steps instead. The old way >>>> will continue working for a little while, but it will be going away. >>>> >>>> >>>> What's familiar: >>>> >>>> Starting a maintenance script looks like this: >>>> >>>> rzl@deploy2002:~$ mwscript-k8s --comment="T341553" -- Version.php >>>> --wiki=enwiki >>>> >>>> Any options for the mwscript-k8s tool, as described below, go before >>>> the --. >>>> >>>> After the --, the first argument is the script name; everything else is >>>> passed to the script. This is the same as you're used to passing to >>>> mwscript. >>>> >>>> >>>> What's different: >>>> >>>> - Run mwscript-k8s on a deployment host, not the maintenance host. >>>> Either deployment host will work; your job will automatically run in >>>> whichever data center is active, so you no longer need to change hosts when >>>> there’s a switchover. >>>> >>>> - You don't need a tmux. By default the tool launches your maintenance >>>> script and exits immediately, without waiting for your job to finish. If >>>> you log out of the deployment host, your job keeps running on the >>>> Kubernetes cluster. >>>> >>>> - Kubernetes saves the maintenance script's output for seven days >>>> after completion. By default, mwscript-k8s prints a kubectl command that >>>> you (or anyone else) can paste and run to monitor the output or save it to >>>> a file. >>>> >>>> - As a convenience, you can pass -f (--follow) to mwscript-k8s to >>>> immediately >>>> begin tailing the script output. If you like, you can do this inside a >>>> tmux and keep the same workflow as before. Either way, you can safely >>>> disconnect and your script will continue running on Kubernetes. >>>> >>>> rzl@deploy2002:~$ mwscript-k8s -f -- Version.php --wiki=testwiki >>>> >>>> [...] >>>> >>>> MediaWiki version: 1.43.0-wmf.24 LTS (built: 22:35, 23 September 2024) >>>> >>>> - For scripts that take input on stdin, you can pass --attach to >>>> mwscript-k8s, either interactively or in a pipeline. >>>> >>>> rzl@deploy2002:~$ mwscript-k8s --attach -- shell.php --wiki=testwiki >>>> >>>> [...] >>>> >>>> Psy Shell v0.12.3 (PHP 7.4.33 — cli) by Justin Hileman >>>> >>>> > $wmgRealm >>>> >>>> = "production" >>>> >>>> > >>>> >>>> rzl@deploy2002:~$ cat example_url.txt | mwscript-k8s --attach -- >>>> purgeList.php >>>> >>>> [...] >>>> >>>> Purging 1 urls >>>> >>>> Done! >>>> >>>> - Your maintenance script runs in a Docker container which will not >>>> outlive it, so it can't save persistent files to disk. Ensure your >>>> script logs its important output to stdout, or persists it in a database or >>>> other remote storage. >>>> >>>> - The --comment flag sets an optional (but encouraged) descriptive >>>> label, such as a task number. >>>> >>>> - Using standard kubectl commands[1][2], you can check the status, and >>>> view the output, of your running jobs or anyone else's. (Example: `kube_env >>>> mw-script codfw; kubectl get pod -l username=rzl`) >>>> >>>> [1]: https://wikitech.wikimedia.org/wiki/Kubernetes/Kubectl >>>> >>>> [2]: https://kubernetes.io/docs/reference/kubectl/quick-reference/ >>>> >>>> >>>> What's not supported yet: >>>> >>>> - Maintenance scripts launched automatically on a timer. We're working >>>> on migrating them -- for now, this is for one-off scripts launched by hand. >>>> >>>> - If your job is interrupted (e.g. by hardware problems), Kubernetes >>>> can automatically move it to another machine and restart it, babysitting it >>>> until it completes. But we only want to do that if your job is safe to >>>> restart. So by default, if your job is interrupted, it will stay stopped >>>> until you restart it yourself. Soon, we'll add an option to declare "this >>>> is idempotent, please restart it as needed" and that design is recommended >>>> for new scripts. >>>> >>>> - No support yet for mwscriptwikiset, foreachwiki, foreachwikiindblist, >>>> etc, but we'll add similar functionality as flags to mwscript_k8s. >>>> >>>> >>>> Your feedback: >>>> >>>> Let me know by email or IRC, or on Phab (T341553 >>>> <https://phabricator.wikimedia.org/T341553>). If mwscript-k8s doesn't >>>> work for you, for now you can fall back to using the mwmaint hosts as >>>> before -- but they will be going away. Please report any problems sooner >>>> rather than later, so that we can ensure the new system meets your needs >>>> before that happens. >>>> >>>> Thanks, >>>> >>>> Reuven, for Service Ops SRE >>>> _______________________________________________ >>>> Wikitech-l mailing list -- [email protected] >>>> To unsubscribe send an email to [email protected] >>>> >>>> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/ >>> >>> _______________________________________________ >>> Wikitech-l mailing list -- [email protected] >>> To unsubscribe send an email to [email protected] >>> >>> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/ >> >> >> >> -- >> *Reuven Lazarus *(he/him) >> Staff Site Reliability Engineer >> Wikimedia Foundation <https://wikimediafoundation.org/> >> <https://wikimediafoundation.org/> >> _______________________________________________ >> Wikitech-l mailing list -- [email protected] >> To unsubscribe send an email to [email protected] >> >> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/ > > > > -- > Alexandros Kosiaris > Principal Site Reliability Engineer > Wikimedia Foundation > _______________________________________________ > Wikitech-l mailing list -- [email protected] > To unsubscribe send an email to [email protected] > https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
_______________________________________________ Wikitech-l mailing list -- [email protected] To unsubscribe send an email to [email protected] https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
